2016-04-06 15:13:39

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 0/9] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64

Add support for Generic Hardware Error Source (GHES) v2, which introduces the
capability for the OS to acknowledge the consumption of the error record
generated by the Reliability, Availability and Serviceability (RAS) controller.
This eliminates potential race conditions between the OS and the RAS controller.

Add support for the timestamp field added to the Generic Error Data Entry v3,
allowing the OS to log the time that the error is generated by the firmware,
rather than the time the error is consumed. This improves the correctness of
event sequences when analyzing error logs. The timestamp is added in
ACPI 6.1, reference Table 18-343 Generic Error Data Entry.

Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
specification. ARMv8 specific processor error information is reported as part of
the CPER records. This provides more detail on for processor error logs.

Synchronous External Abort (SEA) represents a specific processor error condition
in ARM systems. A handler is added to recognize SEA errors, and a notifier is
added to parse and report the errors before the process is killed. Refer to
section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
specification.

Currently the kernel ignores CPER records that are unrecognized.
On the other hand, UEFI spec allows for non-standard (eg. vendor
proprietary) error section type in CPER (Common Platform Error Record),
as defined in section N2.3 of UEFI version 2.5. Therefore, user
is not able to see hardware error data of non-standard section.

If section Type field of Generic Error Data Entry is unrecognized,
prints out the raw data in dmesg buffer, and also adds a tracepoint
for reporting such hardware errors.

Currently even if an error status block's severity is fatal, the kernel
does not honor the severity level and panic. With the firmware first
model, the platform could inform the OS about a fatal hardware error
through the non-NMI GHES notification type. The OS should panic when a
hardware error record is received with this severity.

Depends on: [PATCH v9] acpi, apei, arm64: APEI initial support for aarch64.
https://lkml.org/lkml/2016/4/5/522

Depends on: [PATCH 00/30] ACPICA: 20160318 Release
https://lkml.org/lkml/2016/3/23/649

V2: Add PSCI state print for the ARMv8 error type.
Separate timestamp year into year and century using BCD format.
Rebase on top of ACPICA 20160318 release and remove header file changes
in include/acpi/actbl1.h.
Add panic OS with fatal error status block patch.
Add processing of unrecognized CPER error section patches with updates
from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646

V1: https://lkml.org/lkml/2016/2/5/544

Jonathan (Zhixiong) Zhang (1):
acpi: apei: panic OS with fatal error status block

Tyler Baicar (8):
acpi: apei: read ack upon ghes record consumption
ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
efi: parse ARMv8 processor error
arm64: exception: handle Synchronous External Abort
arm64: exception: handle instruction abort at current EL
acpi: apei: handle SEA notification type for ARMv8
efi: print unrecognized CPER section
ras: acpi / apei: generate trace event for unrecognized CPER section

arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/system_misc.h | 13 ++
arch/arm64/kernel/entry.S | 19 +++
arch/arm64/mm/fault.c | 58 ++++++--
drivers/acpi/apei/Kconfig | 14 ++
drivers/acpi/apei/ghes.c | 188 ++++++++++++++++++++++++-
drivers/acpi/apei/hest.c | 7 +-
drivers/firmware/efi/cper.c | 264 ++++++++++++++++++++++++++++++++---
drivers/ras/ras.c | 1 +
include/acpi/ghes.h | 1 +
include/linux/cper.h | 72 ++++++++++
include/ras/ras_event.h | 45 ++++++
12 files changed, 646 insertions(+), 37 deletions(-)

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.


2016-04-06 15:13:46

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 1/9] acpi: apei: read ack upon ghes record consumption

A RAS (Reliability, Availability, Serviceability) controller
may be a separate processor running in parallel with OS
execution, and may generate error records for consumption by
the OS. If the RAS controller produces multiple error records,
then they may be overwritten before the OS has consumed them.

The Generic Hardware Error Source (GHES) v2 structure
introduces the capability for the OS to acknowledge the
consumption of the error record generated by the RAS
controller. A RAS controller supporting GHESv2 shall wait for
the acknowledgment before writing a new error record, thus
eliminating the race condition.

Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
Signed-off-by: Richard Ruigrok <[email protected]>
Signed-off-by: Tyler Baicar <[email protected]>
Signed-off-by: Naveen Kaje <[email protected]>
---
drivers/acpi/apei/ghes.c | 41 +++++++++++++++++++++++++++++++++++++++++
drivers/acpi/apei/hest.c | 7 +++++--
include/acpi/ghes.h | 1 +
3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 60746ef..9b0543e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -45,6 +45,7 @@
#include <linux/aer.h>
#include <linux/nmi.h>

+#include <acpi/actbl1.h>
#include <acpi/ghes.h>
#include <acpi/apei.h>
#include <asm/tlbflush.h>
@@ -244,10 +245,22 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)
struct ghes *ghes;
unsigned int error_block_length;
int rc;
+ struct acpi_hest_header *hest_hdr;

ghes = kzalloc(sizeof(*ghes), GFP_KERNEL);
if (!ghes)
return ERR_PTR(-ENOMEM);
+
+ hest_hdr = (struct acpi_hest_header *)generic;
+ if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2) {
+ ghes->generic_v2 = (struct acpi_hest_generic_v2 *)generic;
+ rc = apei_map_generic_address(
+ &ghes->generic_v2->read_ack_register);
+ if (rc)
+ goto err_unmap;
+ } else
+ ghes->generic_v2 = NULL;
+
ghes->generic = generic;
rc = apei_map_generic_address(&generic->error_status_address);
if (rc)
@@ -270,6 +283,9 @@ static struct ghes *ghes_new(struct acpi_hest_generic *generic)

err_unmap:
apei_unmap_generic_address(&generic->error_status_address);
+ if (ghes->generic_v2)
+ apei_unmap_generic_address(
+ &ghes->generic_v2->read_ack_register);
err_free:
kfree(ghes);
return ERR_PTR(rc);
@@ -279,6 +295,9 @@ static void ghes_fini(struct ghes *ghes)
{
kfree(ghes->estatus);
apei_unmap_generic_address(&ghes->generic->error_status_address);
+ if (ghes->generic_v2)
+ apei_unmap_generic_address(
+ &ghes->generic_v2->error_status_address);
}

static inline int ghes_severity(int severity)
@@ -648,6 +667,22 @@ static void ghes_estatus_cache_add(
rcu_read_unlock();
}

+static int ghes_do_read_ack(struct acpi_hest_generic_v2 *generic_v2)
+{
+ int rc;
+ u64 val = 0;
+
+ rc = apei_read(&val, &generic_v2->read_ack_register);
+ if (rc)
+ return rc;
+ val &= generic_v2->read_ack_preserve <<
+ generic_v2->read_ack_register.bit_offset;
+ val |= generic_v2->read_ack_write;
+ rc = apei_write(val, &generic_v2->read_ack_register);
+
+ return rc;
+}
+
static int ghes_proc(struct ghes *ghes)
{
int rc;
@@ -660,6 +695,12 @@ static int ghes_proc(struct ghes *ghes)
ghes_estatus_cache_add(ghes->generic, ghes->estatus);
}
ghes_do_proc(ghes, ghes->estatus);
+
+ if (ghes->generic_v2) {
+ rc = ghes_do_read_ack(ghes->generic_v2);
+ if (rc)
+ return rc;
+ }
out:
ghes_clear_estatus(ghes);
return 0;
diff --git a/drivers/acpi/apei/hest.c b/drivers/acpi/apei/hest.c
index 792a0d9..ef725a9 100644
--- a/drivers/acpi/apei/hest.c
+++ b/drivers/acpi/apei/hest.c
@@ -52,6 +52,7 @@ static const int hest_esrc_len_tab[ACPI_HEST_TYPE_RESERVED] = {
[ACPI_HEST_TYPE_AER_ENDPOINT] = sizeof(struct acpi_hest_aer),
[ACPI_HEST_TYPE_AER_BRIDGE] = sizeof(struct acpi_hest_aer_bridge),
[ACPI_HEST_TYPE_GENERIC_ERROR] = sizeof(struct acpi_hest_generic),
+ [ACPI_HEST_TYPE_GENERIC_ERROR_V2] = sizeof(struct acpi_hest_generic_v2),
};

static int hest_esrc_len(struct acpi_hest_header *hest_hdr)
@@ -146,7 +147,8 @@ static int __init hest_parse_ghes_count(struct acpi_hest_header *hest_hdr, void
{
int *count = data;

- if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR)
+ if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR ||
+ hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2)
(*count)++;
return 0;
}
@@ -157,7 +159,8 @@ static int __init hest_parse_ghes(struct acpi_hest_header *hest_hdr, void *data)
struct ghes_arr *ghes_arr = data;
int rc, i;

- if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR)
+ if (hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR &&
+ hest_hdr->type != ACPI_HEST_TYPE_GENERIC_ERROR_V2)
return 0;

if (!((struct acpi_hest_generic *)hest_hdr)->enabled)
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 720446c..d0108b6 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -14,6 +14,7 @@

struct ghes {
struct acpi_hest_generic *generic;
+ struct acpi_hest_generic_v2 *generic_v2;
struct acpi_hest_generic_status *estatus;
u64 buffer_paddr;
unsigned long flags;
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2016-04-06 15:13:54

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 2/9] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1

Currently when a RAS error is reported it is not timestamped.
The ACPI 6.1 spec adds the timestamp field to the generic error
data entry v3 structure. The timestamp of when the firmware
generated the error is now being reported.

Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
Signed-off-by: Richard Ruigrok <[email protected]>
Signed-off-by: Tyler Baicar <[email protected]>
Signed-off-by: Naveen Kaje <[email protected]>
---
drivers/acpi/apei/ghes.c | 35 ++++++++++++--
drivers/firmware/efi/cper.c | 111 +++++++++++++++++++++++++++++++++++++-------
2 files changed, 126 insertions(+), 20 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9b0543e..a6848706 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -419,7 +419,15 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
int flags = -1;
int sec_sev = ghes_severity(gdata->error_severity);
struct cper_sec_mem_err *mem_err;
- mem_err = (struct cper_sec_mem_err *)(gdata + 1);
+ struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
+
+ if ((gdata->revision >> 8) >= 0x03)
+ gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
+
+ if (gdata_v3)
+ mem_err = (struct cper_sec_mem_err *)(gdata_v3 + 1);
+ else
+ mem_err = (struct cper_sec_mem_err *)(gdata + 1);

if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
@@ -449,14 +457,27 @@ static void ghes_do_proc(struct ghes *ghes,
{
int sev, sec_sev;
struct acpi_hest_generic_data *gdata;
+ struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
+ uuid_le sec_type;

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
sec_sev = ghes_severity(gdata->error_severity);
- if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
+ sec_type = *(uuid_le *)gdata->section_type;
+
+ if ((gdata->revision >> 8) >= 0x03)
+ gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
+
+ if (!uuid_le_cmp(sec_type,
CPER_SEC_PLATFORM_MEM)) {
struct cper_sec_mem_err *mem_err;
- mem_err = (struct cper_sec_mem_err *)(gdata+1);
+
+ if (gdata_v3)
+ mem_err = (struct cper_sec_mem_err *)
+ (gdata_v3 + 1);
+ else
+ mem_err = (struct cper_sec_mem_err *)
+ (gdata + 1);
ghes_edac_report_mem_error(ghes, sev, mem_err);

arch_apei_report_mem_error(sev, mem_err);
@@ -466,7 +487,13 @@ static void ghes_do_proc(struct ghes *ghes,
else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
CPER_SEC_PCIE)) {
struct cper_sec_pcie *pcie_err;
- pcie_err = (struct cper_sec_pcie *)(gdata+1);
+
+ if (gdata_v3)
+ pcie_err = (struct cper_sec_pcie *)
+ (gdata_v3 + 1);
+ else
+ pcie_err = (struct cper_sec_pcie *)
+ (gdata + 1);
if (sev == GHES_SEV_RECOVERABLE &&
sec_sev == GHES_SEV_RECOVERABLE &&
pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index d425374..23f62962 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -32,6 +32,8 @@
#include <linux/acpi.h>
#include <linux/pci.h>
#include <linux/aer.h>
+#include <linux/printk.h>
+#include <linux/bcd.h>

#define INDENT_SP " "

@@ -392,6 +394,10 @@ static void cper_estatus_print_section(
uuid_le *sec_type = (uuid_le *)gdata->section_type;
__u16 severity;
char newpfx[64];
+ struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
+
+ if ((gdata->revision >> 8) >= 0x03)
+ gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;

severity = gdata->error_severity;
printk("%s""Error %d, type: %s\n", pfx, sec_no,
@@ -403,14 +409,24 @@ static void cper_estatus_print_section(

snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
- struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
+ struct cper_sec_proc_generic *proc_err;
+
+ if (gdata_v3)
+ proc_err = (void *)(gdata_v3 + 1);
+ else
+ proc_err = (void *)(gdata + 1);
printk("%s""section_type: general processor error\n", newpfx);
if (gdata->error_data_length >= sizeof(*proc_err))
cper_print_proc_generic(newpfx, proc_err);
else
goto err_section_too_small;
} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
- struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
+ struct cper_sec_mem_err *mem_err;
+
+ if (gdata_v3)
+ mem_err = (void *)(gdata_v3 + 1);
+ else
+ mem_err = (void *)(gdata + 1);
printk("%s""section_type: memory error\n", newpfx);
if (gdata->error_data_length >=
sizeof(struct cper_sec_mem_err_old))
@@ -419,7 +435,12 @@ static void cper_estatus_print_section(
else
goto err_section_too_small;
} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
- struct cper_sec_pcie *pcie = (void *)(gdata + 1);
+ struct cper_sec_pcie *pcie;
+
+ if (gdata_v3)
+ pcie = (void *)(gdata_v3 + 1);
+ else
+ pcie = (void *)(gdata + 1);
printk("%s""section_type: PCIe error\n", newpfx);
if (gdata->error_data_length >= sizeof(*pcie))
cper_print_pcie(newpfx, pcie, gdata);
@@ -434,10 +455,38 @@ err_section_too_small:
pr_err(FW_WARN "error section length is too small\n");
}

+static void cper_estatus_print_section_v300(const char *pfx,
+ const struct acpi_hest_generic_data_v300 *gdata, int sec_no)
+{
+ __u8 hour, min, sec, day, mon, year, century, *timestamp;
+
+ if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
+ timestamp = (__u8 *)&(gdata->time_stamp);
+ memcpy(&sec, timestamp, 1);
+ memcpy(&min, timestamp + 1, 1);
+ memcpy(&hour, timestamp + 2, 1);
+ memcpy(&day, timestamp + 4, 1);
+ memcpy(&mon, timestamp + 5, 1);
+ memcpy(&year, timestamp + 6, 1);
+ memcpy(&century, timestamp + 7, 1);
+ printk("%stime: ", pfx);
+ printk("%7s", 0x01 & *(timestamp + 3) ? "precise" : "");
+ printk(" %02d:%02d:%02d %02d%02d-%02d-%02d\n",
+ bcd2bin(hour), bcd2bin(min), bcd2bin(sec),
+ bcd2bin(century), bcd2bin(year), bcd2bin(mon),
+ bcd2bin(day));
+ }
+
+ cper_estatus_print_section(pfx,
+ (const struct acpi_hest_generic_data *)gdata,
+ sec_no);
+}
+
void cper_estatus_print(const char *pfx,
const struct acpi_hest_generic_status *estatus)
{
struct acpi_hest_generic_data *gdata;
+ struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
unsigned int data_len, gedata_len;
int sec_no = 0;
char newpfx[64];
@@ -451,13 +500,28 @@ void cper_estatus_print(const char *pfx,
printk("%s""event severity: %s\n", pfx, cper_severity_str(severity));
data_len = estatus->data_length;
gdata = (struct acpi_hest_generic_data *)(estatus + 1);
+ if ((gdata->revision >> 8) >= 0x03)
+ gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
+
snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
- while (data_len >= sizeof(*gdata)) {
- gedata_len = gdata->error_data_length;
- cper_estatus_print_section(newpfx, gdata, sec_no);
- data_len -= gedata_len + sizeof(*gdata);
- gdata = (void *)(gdata + 1) + gedata_len;
- sec_no++;
+
+ if (gdata_v3) {
+ while (data_len >= sizeof(*gdata_v3)) {
+ gedata_len = gdata_v3->error_data_length;
+ cper_estatus_print_section_v300(newpfx, gdata_v3,
+ sec_no);
+ data_len -= gedata_len + sizeof(*gdata_v3);
+ gdata_v3 = (void *)(gdata_v3 + 1) + gedata_len;
+ sec_no++;
+ }
+ } else {
+ while (data_len >= sizeof(*gdata)) {
+ gedata_len = gdata->error_data_length;
+ cper_estatus_print_section(newpfx, gdata, sec_no);
+ data_len -= gedata_len + sizeof(*gdata);
+ gdata = (void *)(gdata + 1) + gedata_len;
+ sec_no++;
+ }
}
}
EXPORT_SYMBOL_GPL(cper_estatus_print);
@@ -478,6 +542,7 @@ EXPORT_SYMBOL_GPL(cper_estatus_check_header);
int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
{
struct acpi_hest_generic_data *gdata;
+ struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
unsigned int data_len, gedata_len;
int rc;

@@ -486,15 +551,29 @@ int cper_estatus_check(const struct acpi_hest_generic_status *estatus)
return rc;
data_len = estatus->data_length;
gdata = (struct acpi_hest_generic_data *)(estatus + 1);
- while (data_len >= sizeof(*gdata)) {
- gedata_len = gdata->error_data_length;
- if (gedata_len > data_len - sizeof(*gdata))
+
+ if ((gdata->revision >> 8) >= 0x03) {
+ gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
+ while (data_len >= sizeof(*gdata_v3)) {
+ gedata_len = gdata_v3->error_data_length;
+ if (gedata_len > data_len - sizeof(*gdata_v3))
+ return -EINVAL;
+ data_len -= gedata_len + sizeof(*gdata_v3);
+ gdata_v3 = (void *)(gdata_v3 + 1) + gedata_len;
+ }
+ if (data_len)
+ return -EINVAL;
+ } else {
+ while (data_len >= sizeof(*gdata)) {
+ gedata_len = gdata->error_data_length;
+ if (gedata_len > data_len - sizeof(*gdata))
+ return -EINVAL;
+ data_len -= gedata_len + sizeof(*gdata);
+ gdata = (void *)(gdata + 1) + gedata_len;
+ }
+ if (data_len)
return -EINVAL;
- data_len -= gedata_len + sizeof(*gdata);
- gdata = (void *)(gdata + 1) + gedata_len;
}
- if (data_len)
- return -EINVAL;

return 0;
}
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2016-04-06 15:14:14

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 5/9] arm64: exception: handle instruction abort at current EL

Add a handler for instruction aborts at the current EL
(ESR_ELx_EC_IABT_CUR) so they are no longer handled in el1_inv.
This allows firmware first handling for possible SEA
(Synchronous External Abort) caused instruction abort at
current EL.

Signed-off-by: Tyler Baicar <[email protected]>
Signed-off-by: Naveen Kaje <[email protected]>
---
arch/arm64/kernel/entry.S | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 12e8d2b..f257856 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -336,6 +336,8 @@ el1_sync:
lsr x24, x1, #ESR_ELx_EC_SHIFT // exception class
cmp x24, #ESR_ELx_EC_DABT_CUR // data abort in EL1
b.eq el1_da
+ cmp x24, #ESR_ELx_EC_IABT_CUR // instruction abort in EL1
+ b.eq el1_ia
cmp x24, #ESR_ELx_EC_SYS64 // configurable trap
b.eq el1_undef
cmp x24, #ESR_ELx_EC_SP_ALIGN // stack alignment exception
@@ -363,6 +365,23 @@ el1_da:
// disable interrupts before pulling preserved data off the stack
disable_irq
kernel_exit 1
+el1_ia:
+ /*
+ * Instruction abort handling
+ */
+ mrs x0, far_el1
+ enable_dbg
+ // re-enable interrupts if they were enabled in the aborted context
+ tbnz x23, #7, 1f // PSR_I_BIT
+ enable_irq
+1:
+ orr x1, x1, #1 << 24 // use reserved ISS bit for instruction aborts
+ mov x2, sp // struct pt_regs
+ bl do_mem_abort
+
+ // disable interrupts before pulling preserved data off the stack
+ disable_irq
+ kernel_exit 1
el1_sp_pc:
/*
* Stack or PC alignment exception handling
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2016-04-06 15:14:26

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 7/9] acpi: apei: panic OS with fatal error status block

From: "Jonathan (Zhixiong) Zhang" <[email protected]>

Even if an error status block's severity is fatal, the kernel does not
honor the severity level and panic.

With the firmware first model, the platform could inform the OS about a
fatal hardware error through the non-NMI GHES notification type. The OS
should panic when a hardware error record is received with this
severity.

Call panic() after CPER data in error status block is printed if
severity is fatal, before each error section is handled.

Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>

---
drivers/acpi/apei/ghes.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index aae5ca0..3814895 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -137,6 +137,8 @@ static unsigned long ghes_estatus_pool_size_request;
static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
static atomic_t ghes_estatus_cache_alloced;

+static int ghes_panic_timeout __read_mostly = 30;
+
static int ghes_ioremap_init(void)
{
ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
@@ -725,6 +727,12 @@ static int ghes_proc(struct ghes *ghes)
if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
ghes_estatus_cache_add(ghes->generic, ghes->estatus);
}
+ if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+ if (panic_timeout == 0)
+ panic_timeout = ghes_panic_timeout;
+ panic("Fatal hardware error!");
+ }
+
ghes_do_proc(ghes, ghes->estatus);

if (ghes->generic_v2) {
@@ -869,8 +877,6 @@ static atomic_t ghes_in_nmi = ATOMIC_INIT(0);

static LIST_HEAD(ghes_nmi);

-static int ghes_panic_timeout __read_mostly = 30;
-
static void ghes_proc_in_irq(struct irq_work *irq_work)
{
struct llist_node *llnode, *next;
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2016-04-06 15:14:42

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 9/9] ras: acpi / apei: generate trace event for unrecognized CPER section

UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.

Currently if the CPER section's type (UUID) does not match with
any section type that the kernel knows how to parse, trace event
is not generated for such section. And thus user is not able to know
happening of such hardware error, including error record of
non-standard section.

This commit generates a trace event which contains raw error data
for unrecognized CPER section.

Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
---
drivers/acpi/apei/ghes.c | 19 ++++++++++++++++++-
drivers/ras/ras.c | 1 +
include/ras/ras_event.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3814895..dcfd879 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -49,6 +49,7 @@
#include <acpi/ghes.h>
#include <acpi/apei.h>
#include <asm/tlbflush.h>
+#include <ras/ras_event.h>

#ifdef CONFIG_HAVE_ACPI_APEI_SEA
#include <asm/system_misc.h>
@@ -465,6 +466,9 @@ static void ghes_do_proc(struct ghes *ghes,
struct acpi_hest_generic_data *gdata;
struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
uuid_le sec_type;
+ uuid_le *fru_id;
+ char *fru_text = "";
+ void *raw_err;

sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
@@ -474,6 +478,13 @@ static void ghes_do_proc(struct ghes *ghes,
if ((gdata->revision >> 8) >= 0x03)
gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;

+ if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
+ fru_id = (uuid_le *)gdata->fru_id;
+ else
+ fru_id = &NULL_UUID_LE;
+ if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
+ fru_text = gdata->fru_text;
+
if (!uuid_le_cmp(sec_type,
CPER_SEC_PLATFORM_MEM)) {
struct cper_sec_mem_err *mem_err;
@@ -490,7 +501,7 @@ static void ghes_do_proc(struct ghes *ghes,
ghes_handle_memory_failure(gdata, sev);
}
#ifdef CONFIG_ACPI_APEI_PCIEAER
- else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
+ else if (!uuid_le_cmp(sec_type,
CPER_SEC_PCIE)) {
struct cper_sec_pcie *pcie_err;

@@ -528,6 +539,12 @@ static void ghes_do_proc(struct ghes *ghes,

}
#endif
+ else {
+ raw_err = gdata + 1;
+ trace_unknown_sec_event(&sec_type,
+ fru_id, fru_text, sec_sev,
+ raw_err, gdata->error_data_length);
+ }
}
}

diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index b67dd36..fb2500b 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -27,3 +27,4 @@ subsys_initcall(ras_init);
EXPORT_TRACEPOINT_SYMBOL_GPL(extlog_mem_event);
#endif
EXPORT_TRACEPOINT_SYMBOL_GPL(mc_event);
+EXPORT_TRACEPOINT_SYMBOL_GPL(unknown_sec_event);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 1443d79..6782ca7 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -162,6 +162,51 @@ TRACE_EVENT(mc_event,
);

/*
+ * Unknown Section Report
+ *
+ * This event is generated when hardware detected a hardware
+ * error event, which may be of non-standard section as defined
+ * in UEFI spec appendix "Common Platform Error Record", or may
+ * be of sections for which TRACE_EVENT is not defined.
+ *
+ */
+TRACE_EVENT(unknown_sec_event,
+
+ TP_PROTO(const uuid_le *sec_type,
+ const uuid_le *fru_id,
+ const char *fru_text,
+ u8 sev,
+ const u8 *err,
+ const u32 len),
+
+ TP_ARGS(sec_type, fru_id, fru_text, sev, err, len),
+
+ TP_STRUCT__entry(
+ __array(char, sec_type, 16)
+ __array(char, fru_id, 16)
+ __string(fru_text, fru_text)
+ __field(u8, sev)
+ __field(u32, len)
+ __dynamic_array(u8, buf, len)
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->sec_type, sec_type, sizeof(uuid_le));
+ memcpy(__entry->fru_id, fru_id, sizeof(uuid_le));
+ __assign_str(fru_text, fru_text);
+ __entry->sev = sev;
+ __entry->len = len;
+ memcpy(__get_dynamic_array(buf), err, len);
+ ),
+
+ TP_printk("severity: %d; sec type:%pU; FRU: %pU %s; data len:%d; raw data:%s",
+ __entry->sev, __entry->sec_type,
+ __entry->fru_id, __get_str(fru_text),
+ __entry->len,
+ __print_hex(__get_dynamic_array(buf), __entry->len))
+);
+
+/*
* PCIe AER Trace event
*
* These events are generated when hardware detects a corrected or
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2016-04-06 15:14:36

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 8/9] efi: print unrecognized CPER section

UEFI spec allows for non-standard section in Common Platform Error
Record. This is defined in section N.2.3 of UEFI version 2.5.

Currently if the CPER section's type (UUID) does not match with
one of the section types that the kernel knows how to parse, the
section is skipped. Therefore, user is not able to see
such CPER data, for instance, error record of non-standard section.

For above mentioned case, this change prints out the raw data in
hex in dmesg buffer. Data length is taken from Error Data length
field of Generic Error Data Entry.

Following is a sample output from dmesg:
[ 115.771702] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 115.779042] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 115.787456] {1}[Hardware Error]: event severity: corrected
[ 115.792927] {1}[Hardware Error]: Error 0, type: corrected
[ 115.798415] {1}[Hardware Error]: fru_id: 00000000-0000-0000-0000-000000000000
[ 115.805596] {1}[Hardware Error]: fru_text:
[ 115.816105] {1}[Hardware Error]: section type: d2e2621c-f936-468d-0d84-15a4ed015c8b
[ 115.823880] {1}[Hardware Error]: section length: 88
[ 115.828779] {1}[Hardware Error]: 00000000: 01000001 00000002 5f434345 525f4543
[ 115.836153] {1}[Hardware Error]: 00000010: 0000574d 00000000 00000000 00000000
[ 115.843531] {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
[ 115.850908] {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000
[ 115.858288] {1}[Hardware Error]: 00000040: fe800000 00000000 00000004 5f434345
[ 115.865665] {1}[Hardware Error]: 00000050: 525f4543 0000574d

Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
---
drivers/firmware/efi/cper.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 40373e7..b9bf94a 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -584,8 +584,19 @@ static void cper_estatus_print_section(
cper_print_proc_armv8(newpfx, armv8_err);
else
goto err_section_too_small;
- } else
- printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);
+ } else {
+ const void *raw_err;
+
+ if (gdata_v3)
+ raw_err = (void *)(gdata_v3 + 1);
+ else
+ raw_err = (void *)(gdata + 1);
+ printk("%ssection type: %pUl\n", newpfx, sec_type);
+ printk("%ssection length: %d\n", newpfx,
+ gdata->error_data_length);
+ print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4, raw_err,
+ gdata->error_data_length, 0);
+ }

return;

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2016-04-06 15:14:34

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 6/9] acpi: apei: handle SEA notification type for ARMv8

ARM APEI extension proposal added SEA (Synchrounous External
Abort) notification type for ARMv8.
Add a new GHES error source handling function for SEA. If an error
source's notification type is SEA, then this function can be registered
into the SEA exception handler. That way GHES will parse and report
SEA exceptions when they occur.

Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
Signed-off-by: Tyler Baicar <[email protected]>
Signed-off-by: Naveen Kaje <[email protected]>
---
arch/arm64/Kconfig | 1 +
drivers/acpi/apei/Kconfig | 14 ++++++++
drivers/acpi/apei/ghes.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 98 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 08952ec..6a7e166 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -4,6 +4,7 @@ config ARM64
select ACPI_GENERIC_GSI if ACPI
select ACPI_REDUCED_HARDWARE_ONLY if ACPI
select HAVE_ACPI_APEI if ACPI
+ select HAVE_ACPI_APEI_SEA if ACPI
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
select ARCH_HAS_ELF_RANDOMIZE
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index b0140c8..2d6cd9b 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -4,6 +4,20 @@ config HAVE_ACPI_APEI
config HAVE_ACPI_APEI_NMI
bool

+config HAVE_ACPI_APEI_SEA
+ bool "APEI Synchronous External Abort logging/recovering support"
+ help
+ This option should be enabled if the system supports
+ firmware first handling of SEA (Synchronous External Abort).
+ SEA happens with certain faults of data abort or instruction
+ abort synchronous exceptions on ARMv8 systems. If a system
+ supports firmware first handling of SEA, the platform analyzes
+ and handles hardware error notifications with SEA, and it may then
+ form a HW error record for the OS to parse and handle. This
+ option allows the OS to look for such HW error record, and
+ take appropriate action.
+
config ACPI_APEI
bool "ACPI Platform Error Interface (APEI)"
select MISC_FILESYSTEMS
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a6848706..aae5ca0 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -50,6 +50,10 @@
#include <acpi/apei.h>
#include <asm/tlbflush.h>

+#ifdef CONFIG_HAVE_ACPI_APEI_SEA
+#include <asm/system_misc.h>
+#endif
+
#include "apei-internal.h"

#define GHES_PFX "GHES: "
@@ -789,6 +793,62 @@ static struct notifier_block ghes_notifier_sci = {
.notifier_call = ghes_notify_sci,
};

+#ifdef CONFIG_HAVE_ACPI_APEI_SEA
+static LIST_HEAD(ghes_sea);
+
+static int ghes_notify_sea(struct notifier_block *this,
+ unsigned long event, void *data)
+{
+ struct ghes *ghes;
+ int ret = NOTIFY_DONE;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(ghes, &ghes_sea, list) {
+ if (!ghes_proc(ghes))
+ ret = NOTIFY_OK;
+ }
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static struct notifier_block ghes_notifier_sea = {
+ .notifier_call = ghes_notify_sea,
+};
+
+static int ghes_sea_add(struct ghes *ghes)
+{
+ mutex_lock(&ghes_list_mutex);
+ if (list_empty(&ghes_sea))
+ sea_register_handler_chain(&ghes_notifier_sea);
+ list_add_rcu(&ghes->list, &ghes_sea);
+ mutex_unlock(&ghes_list_mutex);
+ return 0;
+}
+
+static void ghes_sea_remove(struct ghes *ghes)
+{
+ mutex_lock(&ghes_list_mutex);
+ list_del_rcu(&ghes->list);
+ if (list_empty(&ghes_sea))
+ sea_unregister_handler_chain(&ghes_notifier_sea);
+ mutex_unlock(&ghes_list_mutex);
+}
+#else /* CONFIG_HAVE_ACPI_APEI_SEA */
+static inline int ghes_sea_add(struct ghes *ghes)
+{
+ pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not supported\n",
+ ghes->generic->header.source_id);
+ return -ENOTSUPP;
+}
+
+static inline void ghes_sea_remove(struct ghes *ghes)
+{
+ pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not supported\n",
+ ghes->generic->header.source_id);
+}
+#endif /* CONFIG_HAVE_ACPI_APEI_SEA */
+
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
/*
* printk is not safe in NMI context. So in NMI handler, we allocate
@@ -1033,6 +1093,14 @@ static int ghes_probe(struct platform_device *ghes_dev)
case ACPI_HEST_NOTIFY_EXTERNAL:
case ACPI_HEST_NOTIFY_SCI:
break;
+ case ACPI_HEST_NOTIFY_SEA:
+ if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_SEA)) {
+ pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEA is not supported\n",
+ generic->header.source_id);
+ rc = -ENOTSUPP;
+ goto err;
+ }
+ break;
case ACPI_HEST_NOTIFY_NMI:
if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
@@ -1044,6 +1112,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
pr_warning(GHES_PFX "Generic hardware error source: %d notified via local interrupt is not supported!\n",
generic->header.source_id);
goto err;
+ case ACPI_HEST_NOTIFY_GPIO:
+ case ACPI_HEST_NOTIFY_SEI:
+ case ACPI_HEST_NOTIFY_GSIV:
+ pr_warn(GHES_PFX "Generic hardware error source: %d notified via notification type %u is not supported\n",
+ generic->header.source_id, generic->header.source_id);
+ rc = -ENOTSUPP;
+ goto err;
default:
pr_warning(FW_WARN GHES_PFX "Unknown notification type: %u for generic hardware error source: %d\n",
generic->notify.type, generic->header.source_id);
@@ -1098,6 +1173,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
list_add_rcu(&ghes->list, &ghes_sci);
mutex_unlock(&ghes_list_mutex);
break;
+ case ACPI_HEST_NOTIFY_SEA:
+ rc = ghes_sea_add(ghes);
+ if (rc)
+ goto err_edac_unreg;
+ break;
case ACPI_HEST_NOTIFY_NMI:
ghes_nmi_add(ghes);
break;
@@ -1140,6 +1220,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
unregister_acpi_hed_notifier(&ghes_notifier_sci);
mutex_unlock(&ghes_list_mutex);
break;
+ case ACPI_HEST_NOTIFY_SEA:
+ ghes_sea_remove(ghes);
+ break;
case ACPI_HEST_NOTIFY_NMI:
ghes_nmi_remove(ghes);
break;
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2016-04-06 15:14:11

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 4/9] arm64: exception: handle Synchronous External Abort

SEA exceptions are often caused by an uncorrected hardware
error, and are handled when data abort and instruction abort
exception classes have specific values for their Fault Status
Code.
When SEA occurs, before killing the process, go through
the handlers registered in the notification list.
Update fault_info[] with specific SEA faults so that the
new SEA handler is used.

Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
Signed-off-by: Tyler Baicar <[email protected]>
Signed-off-by: Naveen Kaje <[email protected]>
---
arch/arm64/include/asm/system_misc.h | 13 ++++++++
arch/arm64/mm/fault.c | 58 +++++++++++++++++++++++++++++-------
2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index 57f110b..90daf4a 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -64,4 +64,17 @@ extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);

#endif /* __ASSEMBLY__ */

+/*
+ * The functions below are used to register and unregister callbacks
+ * that are to be invoked when a Synchronous External Abort (SEA)
+ * occurs. An SEA is raised by certain fault status codes that have
+ * either data or instruction abort as the exception class, and
+ * callbacks may be registered to parse or handle such hardware errors.
+ *
+ * Registered callbacks are run in an interrupt/atomic context. They
+ * are not allowed to block or sleep.
+ */
+int sea_register_handler_chain(struct notifier_block *nb);
+void sea_unregister_handler_chain(struct notifier_block *nb);
+
#endif /* __ASM_SYSTEM_MISC_H */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 95df28b..04a25e8 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -39,6 +39,22 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>

+/*
+ * GHES SEA handler code may register a notifier call here to
+ * handle HW error record passed from platform.
+ */
+static ATOMIC_NOTIFIER_HEAD(sea_handler_chain);
+
+int sea_register_handler_chain(struct notifier_block *nb)
+{
+ return atomic_notifier_chain_register(&sea_handler_chain, nb);
+}
+
+void sea_unregister_handler_chain(struct notifier_block *nb)
+{
+ atomic_notifier_chain_unregister(&sea_handler_chain, nb);
+}
+
static const char *fault_name(unsigned int esr);

/*
@@ -395,6 +411,28 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
return 1;
}

+/*
+ * This abort handler deals with Synchronous External Abort.
+ * It calls notifiers, and then returns "fault".
+ */
+static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
+{
+ struct siginfo info;
+
+ atomic_notifier_call_chain(&sea_handler_chain, 0, NULL);
+
+ pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
+ fault_name(esr), esr, addr);
+
+ info.si_signo = SIGBUS;
+ info.si_errno = 0;
+ info.si_code = 0;
+ info.si_addr = (void __user *)addr;
+ arm64_notify_die("", regs, &info, esr);
+
+ return 0;
+}
+
static struct fault_info {
int (*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
int sig;
@@ -417,22 +455,22 @@ static struct fault_info {
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 permission fault" },
- { do_bad, SIGBUS, 0, "synchronous external abort" },
+ { do_sea, SIGBUS, 0, "synchronous external abort" },
{ do_bad, SIGBUS, 0, "unknown 17" },
{ do_bad, SIGBUS, 0, "unknown 18" },
{ do_bad, SIGBUS, 0, "unknown 19" },
- { do_bad, SIGBUS, 0, "synchronous abort (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous abort (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous abort (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous abort (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous parity error" },
+ { do_sea, SIGBUS, 0, "level 0 SEA (trans tbl walk)" },
+ { do_sea, SIGBUS, 0, "level 1 SEA (trans tbl walk)" },
+ { do_sea, SIGBUS, 0, "level 2 SEA (trans tbl walk)" },
+ { do_sea, SIGBUS, 0, "level 3 SEA (trans tbl walk)" },
+ { do_sea, SIGBUS, 0, "synchronous parity or ECC err" },
{ do_bad, SIGBUS, 0, "unknown 25" },
{ do_bad, SIGBUS, 0, "unknown 26" },
{ do_bad, SIGBUS, 0, "unknown 27" },
- { do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
- { do_bad, SIGBUS, 0, "synchronous parity error (translation table walk)" },
+ { do_sea, SIGBUS, 0, "level 0 synch parity error" },
+ { do_sea, SIGBUS, 0, "level 1 synch parity error" },
+ { do_sea, SIGBUS, 0, "level 2 synch parity error" },
+ { do_sea, SIGBUS, 0, "level 3 synch parity error" },
{ do_bad, SIGBUS, 0, "unknown 32" },
{ do_alignment_fault, SIGBUS, BUS_ADRALN, "alignment fault" },
{ do_bad, SIGBUS, 0, "unknown 34" },
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2016-04-06 15:14:09

by Tyler Baicar

[permalink] [raw]
Subject: [PATCH V2 3/9] efi: parse ARMv8 processor error

Add support for ARMv8 Common Platform Error Record (CPER).
UEFI 2.6 specification adds support for ARMv8 specific
processor error information to be reported as part of the
CPER records. This provides more detail on for processor error logs.

Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
Signed-off-by: Tyler Baicar <[email protected]>
Signed-off-by: Naveen Kaje <[email protected]>
---
drivers/firmware/efi/cper.c | 138 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/cper.h | 72 +++++++++++++++++++++++
2 files changed, 210 insertions(+)

diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 23f62962..40373e7 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -109,12 +109,15 @@ void cper_print_bits(const char *pfx, unsigned int bits,
static const char * const proc_type_strs[] = {
"IA32/X64",
"IA64",
+ "ARMv8",
};

static const char * const proc_isa_strs[] = {
"IA32",
"IA64",
"X64",
+ "ARM A32/T32",
+ "ARM A64",
};

static const char * const proc_error_type_strs[] = {
@@ -183,6 +186,129 @@ static void cper_print_proc_generic(const char *pfx,
printk("%s""IP: 0x%016llx\n", pfx, proc->ip);
}

+static void cper_print_proc_armv8(const char *pfx,
+ const struct cper_sec_proc_armv8 *proc)
+{
+ int i, len;
+ struct cper_armv8_err_info *err_info;
+ __u64 *qword = NULL;
+ char newpfx[64];
+
+ printk("%ssection length: %d\n", pfx, proc->section_length);
+ printk("%sMIDR: 0x%016llx\n", pfx, proc->midr);
+
+ len = proc->section_length - (sizeof(*proc) +
+ proc->err_info_num * (sizeof(*err_info)));
+ if (len < 0) {
+ printk("%ssection length is too small.\n", pfx);
+ printk("%sERR_INFO_NUM is %d.\n", pfx, proc->err_info_num);
+ return;
+ }
+
+ if (proc->validation_bits & CPER_ARMV8_VALID_MPIDR)
+ printk("%sMPIDR: 0x%016llx\n", pfx, proc->mpidr);
+ if (proc->validation_bits & CPER_ARMV8_VALID_AFFINITY_LEVEL)
+ printk("%serror affinity level: %d\n", pfx,
+ proc->affinity_level);
+ if (proc->validation_bits & CPER_ARMV8_VALID_RUNNING_STATE) {
+ printk("%srunning state: %d\n", pfx, proc->running_state);
+ printk("%sPSCI state: %d\n", pfx, proc->psci_state);
+ }
+
+ snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
+
+ err_info = (struct cper_armv8_err_info *)(proc + 1);
+ for (i = 0; i < proc->err_info_num; i++) {
+ printk("%sError info structure %d:\n", pfx, i);
+ printk("%sversion:%d\n", newpfx, err_info->version);
+ printk("%slength:%d\n", newpfx, err_info->length);
+ if (err_info->validation_bits &
+ CPER_ARMV8_INFO_VALID_MULTI_ERR) {
+ if (err_info->multiple_error == 0)
+ printk("%ssingle error.\n", newpfx);
+ else if (err_info->multiple_error == 1)
+ printk("%smultiple errors.\n", newpfx);
+ else
+ printk("%smultiple errors count:%d.\n",
+ newpfx, err_info->multiple_error);
+ }
+ if (err_info->validation_bits & CPER_ARMV8_INFO_VALID_FLAGS) {
+ if (err_info->flags & CPER_ARMV8_INFO_FLAGS_FIRST)
+ printk("%sfirst error captured.\n", newpfx);
+ if (err_info->flags & CPER_ARMV8_INFO_FLAGS_LAST)
+ printk("%slast error captured.\n", newpfx);
+ if (err_info->flags & CPER_ARMV8_INFO_FLAGS_PROPAGATED)
+ printk("%spropagated error captured.\n",
+ newpfx);
+ }
+ printk("%serror_type: %d, %s\n", newpfx, err_info->type,
+ err_info->type < ARRAY_SIZE(proc_error_type_strs) ?
+ proc_error_type_strs[err_info->type] : "unknown");
+ printk("%serror_info: 0x%016llx\n", newpfx,
+ err_info->error_info);
+ if (err_info->validation_bits & CPER_ARMV8_INFO_VALID_VIRT_ADDR)
+ printk("%svirtual fault address: 0x%016llx\n",
+ newpfx, err_info->virt_fault_addr);
+ if (err_info->validation_bits &
+ CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR)
+ printk("%sphysical fault address: 0x%016llx\n",
+ newpfx, err_info->physical_fault_addr);
+ err_info += 1;
+ }
+
+ if (len < sizeof(*qword) && proc->context_info_num > 0) {
+ printk("%ssection length is too small.\n", pfx);
+ printk("%sCTX_INFO_NUM is %d.\n", pfx, proc->context_info_num);
+ return;
+ }
+ for (i = 0; i < proc->context_info_num; i++) {
+ qword = (__u64 *)err_info;
+ printk("%sProcessor context info structure %d:\n", pfx, i);
+ printk("%sException level %d.\n", newpfx,
+ (int)((*qword & CPER_ARMV8_CTX_EL_MASK)
+ >> CPER_ARMV8_CTX_EL_SHIFT));
+ printk("%sSecure bit: %d.\n", newpfx,
+ (int)((*qword & CPER_ARMV8_CTX_NS_MASK)
+ >> CPER_ARMV8_CTX_NS_SHIFT));
+ if ((*qword & CPER_ARMV8_CTX_TYPE_MASK) == 0) {
+ if (len < CPER_AARCH32_CTX_LEN) {
+ printk("%ssection length is too small.\n", pfx);
+ printk("%sremaining length is %d.\n", pfx, len);
+ return;
+ }
+ printk("%sAArch32 execution context.\n", newpfx);
+ qword++;
+ print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+ qword, CPER_AARCH32_CTX_LEN - sizeof(*qword),
+ 0);
+ len -= CPER_AARCH32_CTX_LEN;
+ } else if ((*qword & CPER_ARMV8_CTX_TYPE_MASK) == 1) {
+ if (len < CPER_AARCH64_CTX_LEN) {
+ printk("%ssection length is too small.\n", pfx);
+ printk("%sremaining length is %d.\n", pfx, len);
+ return;
+ }
+ printk("%sAArch64 execution context.\n", newpfx);
+ qword++;
+ print_hex_dump(newpfx, "", DUMP_PREFIX_OFFSET, 16, 4,
+ qword, CPER_AARCH64_CTX_LEN - sizeof(*qword),
+ 0);
+ len -= CPER_AARCH64_CTX_LEN;
+ } else {
+ printk("%scontext type is incorrect 0x%016llx.\n",
+ pfx, *qword);
+ return;
+ }
+ }
+
+ if (len > 0) {
+ printk("%sVendor specific error info has %d bytes.\n", pfx,
+ len);
+ print_hex_dump(pfx, "", DUMP_PREFIX_OFFSET, 16, 4, qword, len,
+ 0);
+ }
+}
+
static const char * const mem_err_type_strs[] = {
"unknown",
"no error",
@@ -446,6 +572,18 @@ static void cper_estatus_print_section(
cper_print_pcie(newpfx, pcie, gdata);
else
goto err_section_too_small;
+ } else if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_ARMV8)) {
+ struct cper_sec_proc_armv8 *armv8_err;
+
+ if (gdata_v3)
+ armv8_err = (void *)(gdata_v3 + 1);
+ else
+ armv8_err = (void *)(gdata + 1);
+ printk("%ssection_type: ARMv8 processor error\n", newpfx);
+ if (gdata->error_data_length >= sizeof(*armv8_err))
+ cper_print_proc_armv8(newpfx, armv8_err);
+ else
+ goto err_section_too_small;
} else
printk("%s""section type: unknown, %pUl\n", newpfx, sec_type);

diff --git a/include/linux/cper.h b/include/linux/cper.h
index dcacb1a..d1efbef 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -162,6 +162,11 @@ enum {
* corrective action before the data is consumed
*/
#define CPER_SEC_LATENT_ERROR 0x0020
+/*
+ * If set, the section contains an error that is propagated. The error
+ * did not originate from the hardware associated with this section.
+ */
+#define CPER_SEC_PROPAGATED 0x0040

/*
* Section type definitions, used in section_type field in struct
@@ -180,6 +185,10 @@ enum {
#define CPER_SEC_PROC_IPF \
UUID_LE(0xE429FAF1, 0x3CB7, 0x11D4, 0x0B, 0xCA, 0x07, 0x00, \
0x80, 0xC7, 0x3C, 0x88, 0x81)
+/* Processor Specific: ARMv8 */
+#define CPER_SEC_PROC_ARMV8 \
+ UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05, \
+ 0x1D, 0x5D, 0x46, 0xB0)
/* Platform Memory */
#define CPER_SEC_PLATFORM_MEM \
UUID_LE(0xA5BC1114, 0x6F64, 0x4EDE, 0xB8, 0x63, 0x3E, 0x83, \
@@ -255,6 +264,34 @@ enum {

#define CPER_PCIE_SLOT_SHIFT 3

+#define CPER_ARMV8_ERR_INFO_NUM_MASK 0x00000000000000FF
+#define CPER_ARMV8_CTX_INFO_NUM_MASK 0x0000000000FFFF00
+#define CPER_ARMV8_CTX_INFO_NUM_SHIFT 8
+
+#define CPER_ARMV8_VALID_MPIDR 0x00000001
+#define CPER_ARMV8_VALID_AFFINITY_LEVEL 0x00000002
+#define CPER_ARMV8_VALID_RUNNING_STATE 0x00000004
+#define CPER_ARMV8_VALID_VENDOR_INFO 0x00000008
+
+#define CPER_ARMV8_INFO_VALID_MULTI_ERR 0x0001
+#define CPER_ARMV8_INFO_VALID_FLAGS 0x0002
+#define CPER_ARMV8_INFO_VALID_ERR_INFO 0x0004
+#define CPER_ARMV8_INFO_VALID_VIRT_ADDR 0x0008
+#define CPER_ARMV8_INFO_VALID_PHYSICAL_ADDR 0x0010
+
+#define CPER_ARMV8_INFO_FLAGS_FIRST 0x0001
+#define CPER_ARMV8_INFO_FLAGS_LAST 0x0002
+#define CPER_ARMV8_INFO_FLAGS_PROPAGATED 0x0004
+
+#define CPER_AARCH64_CTX_LEN 368
+#define CPER_AARCH32_CTX_LEN 256
+
+#define CPER_ARMV8_CTX_TYPE_MASK 0x000000000000000F
+#define CPER_ARMV8_CTX_EL_MASK 0x0000000000000070
+#define CPER_ARMV8_CTX_NS_MASK 0x0000000000000080
+#define CPER_ARMV8_CTX_EL_SHIFT 4
+#define CPER_ARMV8_CTX_NS_SHIFT 7
+
/*
* All tables and structs must be byte-packed to match CPER
* specification, since the tables are provided by the system BIOS
@@ -340,6 +377,41 @@ struct cper_ia_proc_ctx {
__u64 mm_reg_addr;
};

+/* ARMv8 Processor Error Section */
+struct cper_sec_proc_armv8 {
+ __u32 validation_bits;
+ __u16 err_info_num; /* Number of Processor Error Info */
+ __u16 context_info_num; /* Number of Processor Context Info Records*/
+ __u32 section_length;
+ __u8 affinity_level;
+ __u8 reserved[3]; /* must be zero */
+ __u64 mpidr;
+ __u64 midr;
+ __u32 running_state; /* Bit 0 set - Processor running. PSCI = 0 */
+ __u32 psci_state;
+};
+
+/* ARMv8 Processor Error Information Structure */
+struct cper_armv8_err_info {
+ __u8 version;
+ __u8 length;
+ __u16 validation_bits;
+ __u8 type;
+ __u16 multiple_error;
+ __u8 flags;
+ __u64 error_info;
+ __u64 virt_fault_addr;
+ __u64 physical_fault_addr;
+};
+
+/* ARMv8 AARCH64 Processor Context Information Structure */
+struct cper_armv8_aarch64_ctx {
+ __u8 type_el_ns;
+ __u8 reserved[7]; /* must be zero */
+ __u8 gpr[288];
+ __u8 spr[68];
+};
+
/* Old Memory Error Section UEFI 2.1, 2.2 */
struct cper_sec_mem_err_old {
__u64 validation_bits;
--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2016-04-06 15:36:53

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V2 5/9] arm64: exception: handle instruction abort at current EL

On 06/04/16 16:12, Tyler Baicar wrote:
> Add a handler for instruction aborts at the current EL
> (ESR_ELx_EC_IABT_CUR) so they are no longer handled in el1_inv.
> This allows firmware first handling for possible SEA
> (Synchronous External Abort) caused instruction abort at
> current EL.
>
> Signed-off-by: Tyler Baicar <[email protected]>
> Signed-off-by: Naveen Kaje <[email protected]>
> ---
> arch/arm64/kernel/entry.S | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 12e8d2b..f257856 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -336,6 +336,8 @@ el1_sync:
> lsr x24, x1, #ESR_ELx_EC_SHIFT // exception class
> cmp x24, #ESR_ELx_EC_DABT_CUR // data abort in EL1
> b.eq el1_da
> + cmp x24, #ESR_ELx_EC_IABT_CUR // instruction abort in EL1
> + b.eq el1_ia
> cmp x24, #ESR_ELx_EC_SYS64 // configurable trap
> b.eq el1_undef
> cmp x24, #ESR_ELx_EC_SP_ALIGN // stack alignment exception
> @@ -363,6 +365,23 @@ el1_da:
> // disable interrupts before pulling preserved data off the stack
> disable_irq
> kernel_exit 1
> +el1_ia:
> + /*
> + * Instruction abort handling
> + */
> + mrs x0, far_el1
> + enable_dbg
> + // re-enable interrupts if they were enabled in the aborted context
> + tbnz x23, #7, 1f // PSR_I_BIT
> + enable_irq
> +1:
> + orr x1, x1, #1 << 24 // use reserved ISS bit for instruction aborts
> + mov x2, sp // struct pt_regs
> + bl do_mem_abort
> +
> + // disable interrupts before pulling preserved data off the stack
> + disable_irq
> + kernel_exit 1
> el1_sp_pc:
> /*
> * Stack or PC alignment exception handling
>

What happens if you were running at EL2 when this faults gets injected?
It looks like KVM needs something similar, doesn't it?

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2016-04-06 15:53:45

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH V2 1/9] acpi: apei: read ack upon ghes record consumption

On 06/04/16 16:12, Tyler Baicar wrote:
> A RAS (Reliability, Availability, Serviceability) controller
> may be a separate processor running in parallel with OS
> execution, and may generate error records for consumption by
> the OS. If the RAS controller produces multiple error records,
> then they may be overwritten before the OS has consumed them.
>
> The Generic Hardware Error Source (GHES) v2 structure
> introduces the capability for the OS to acknowledge the
> consumption of the error record generated by the RAS
> controller. A RAS controller supporting GHESv2 shall wait for
> the acknowledgment before writing a new error record, thus
> eliminating the race condition.
>
> Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
> Signed-off-by: Richard Ruigrok <[email protected]>
> Signed-off-by: Tyler Baicar <[email protected]>
> Signed-off-by: Naveen Kaje <[email protected]>


> return ERR_PTR(-ENOMEM);
> +
> + hest_hdr = (struct acpi_hest_header *)generic;
> + if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2) {
> + ghes->generic_v2 = (struct acpi_hest_generic_v2 *)generic;
> + rc = apei_map_generic_address(
> + &ghes->generic_v2->read_ack_register);
> + if (rc)
> + goto err_unmap;
> + } else
> + ghes->generic_v2 = NULL;

...

> err_unmap:
> apei_unmap_generic_address(&generic->error_status_address);
> + if (ghes->generic_v2)
> + apei_unmap_generic_address(
> + &ghes->generic_v2->read_ack_register);
> err_free:
> kfree(ghes);
> return ERR_PTR(rc);
> @@ -279,6 +295,9 @@ static void ghes_fini(struct ghes *ghes)
> {
> kfree(ghes->estatus);
> apei_unmap_generic_address(&ghes->generic->error_status_address);
> + if (ghes->generic_v2)
> + apei_unmap_generic_address(
> + &ghes->generic_v2->error_status_address);

I am not familiar with the APEI code, but is this error_status_address or
read_ack_register ? We don't seem to be mapping error_status_address in generic_v2 header
which is introduced in this patch ? Am I missing something ?

Suzuki

2016-04-06 20:43:12

by Tyler Baicar

[permalink] [raw]
Subject: Re: [PATCH V2 1/9] acpi: apei: read ack upon ghes record consumption

Hello Suzuki,

On 4/6/2016 9:53 AM, Suzuki K Poulose wrote:
> On 06/04/16 16:12, Tyler Baicar wrote:
>> + hest_hdr = (struct acpi_hest_header *)generic;
>> + if (hest_hdr->type == ACPI_HEST_TYPE_GENERIC_ERROR_V2) {
>> + ghes->generic_v2 = (struct acpi_hest_generic_v2 *)generic;
>> + rc = apei_map_generic_address(
>> + &ghes->generic_v2->read_ack_register);
>> + if (rc)
>> + goto err_unmap;
>> + } else
>> + ghes->generic_v2 = NULL;
> ...
>> err_unmap:
>> apei_unmap_generic_address(&generic->error_status_address);
>> + if (ghes->generic_v2)
>> + apei_unmap_generic_address(
>> + &ghes->generic_v2->read_ack_register);
>> err_free:
>> kfree(ghes);
>> return ERR_PTR(rc);
>> @@ -279,6 +295,9 @@ static void ghes_fini(struct ghes *ghes)
>> {
>> kfree(ghes->estatus);
>> apei_unmap_generic_address(&ghes->generic->error_status_address);
>> + if (ghes->generic_v2)
>> + apei_unmap_generic_address(
>> + &ghes->generic_v2->error_status_address);
>
> I am not familiar with the APEI code, but is this error_status_address or
> read_ack_register ? We don't seem to be mapping error_status_address
> in generic_v2 header
> which is introduced in this patch ? Am I missing something ?
>
> Suzuki

Thank you for your feedback. This does look like an error; it should be
&ghes->generic_v2->read_ack_register. The variable
&ghes->generic_v2->error_status_address is the same as
&ghes->generic->error_status_address which is unmapped on the line above
the if statement here.

Thanks,
Tyler

--
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2016-04-06 21:36:07

by Tyler Baicar

[permalink] [raw]
Subject: Re: [PATCH V2 5/9] arm64: exception: handle instruction abort at current EL

Hello Marc,

On 4/6/2016 9:36 AM, Marc Zyngier wrote:
> On 06/04/16 16:12, Tyler Baicar wrote:
>> Add a handler for instruction aborts at the current EL
>> (ESR_ELx_EC_IABT_CUR) so they are no longer handled in el1_inv.
>> This allows firmware first handling for possible SEA
>> (Synchronous External Abort) caused instruction abort at
>> current EL.
>>
>> Signed-off-by: Tyler Baicar <[email protected]>
>> Signed-off-by: Naveen Kaje <[email protected]>
>> ---
>> arch/arm64/kernel/entry.S | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>> index 12e8d2b..f257856 100644
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -336,6 +336,8 @@ el1_sync:
>> lsr x24, x1, #ESR_ELx_EC_SHIFT // exception class
>> cmp x24, #ESR_ELx_EC_DABT_CUR // data abort in EL1
>> b.eq el1_da
>> + cmp x24, #ESR_ELx_EC_IABT_CUR // instruction abort in EL1
>> + b.eq el1_ia
>> cmp x24, #ESR_ELx_EC_SYS64 // configurable trap
>> b.eq el1_undef
>> cmp x24, #ESR_ELx_EC_SP_ALIGN // stack alignment exception
>> @@ -363,6 +365,23 @@ el1_da:
>> // disable interrupts before pulling preserved data off the stack
>> disable_irq
>> kernel_exit 1
>> +el1_ia:
>> + /*
>> + * Instruction abort handling
>> + */
>> + mrs x0, far_el1
>> + enable_dbg
>> + // re-enable interrupts if they were enabled in the aborted context
>> + tbnz x23, #7, 1f // PSR_I_BIT
>> + enable_irq
>> +1:
>> + orr x1, x1, #1 << 24 // use reserved ISS bit for instruction aborts
>> + mov x2, sp // struct pt_regs
>> + bl do_mem_abort
>> +
>> + // disable interrupts before pulling preserved data off the stack
>> + disable_irq
>> + kernel_exit 1
>> el1_sp_pc:
>> /*
>> * Stack or PC alignment exception handling
>>
> What happens if you were running at EL2 when this faults gets injected?
> It looks like KVM needs something similar, doesn't it?
>
> Thanks,
>
> M.
Thank you for your comment. I don't think this case is possible, or at
least the current KVM code suggests that this case should never happen.
In the EL1 code, we get to this case via the vector:

ventry el1_sync // Synchronous EL1h

The EL2 KVM equivalent appears to be in arch/arm64/kvm/hyp-entry.S and is:

ventry el2h_sync_invalid // Synchronous EL2h

This vector is defined as an invalid_vector and has a comment suggesting
that it should never happen:

/* None of these should ever happen */
...
invalid_vector el2h_sync_invalid

Please correct me if I am wrong, but it looks like this case should not
be possible.

Thanks,
Tyler

--
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2016-04-07 07:54:27

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V2 5/9] arm64: exception: handle instruction abort at current EL

On Wed, 6 Apr 2016 15:36:00 -0600
"Baicar, Tyler" <[email protected]> wrote:

Hi Tyler,

> Hello Marc,
>
> On 4/6/2016 9:36 AM, Marc Zyngier wrote:
> > On 06/04/16 16:12, Tyler Baicar wrote:
> >> Add a handler for instruction aborts at the current EL
> >> (ESR_ELx_EC_IABT_CUR) so they are no longer handled in el1_inv.
> >> This allows firmware first handling for possible SEA
> >> (Synchronous External Abort) caused instruction abort at
> >> current EL.
> >>
> >> Signed-off-by: Tyler Baicar <[email protected]>
> >> Signed-off-by: Naveen Kaje <[email protected]>
> >> ---
> >> arch/arm64/kernel/entry.S | 19 +++++++++++++++++++
> >> 1 file changed, 19 insertions(+)
> >>
> >> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> >> index 12e8d2b..f257856 100644
> >> --- a/arch/arm64/kernel/entry.S
> >> +++ b/arch/arm64/kernel/entry.S
> >> @@ -336,6 +336,8 @@ el1_sync:
> >> lsr x24, x1, #ESR_ELx_EC_SHIFT // exception class
> >> cmp x24, #ESR_ELx_EC_DABT_CUR // data abort in EL1
> >> b.eq el1_da
> >> + cmp x24, #ESR_ELx_EC_IABT_CUR // instruction abort in EL1
> >> + b.eq el1_ia
> >> cmp x24, #ESR_ELx_EC_SYS64 // configurable trap
> >> b.eq el1_undef
> >> cmp x24, #ESR_ELx_EC_SP_ALIGN // stack alignment exception
> >> @@ -363,6 +365,23 @@ el1_da:
> >> // disable interrupts before pulling preserved data off the stack
> >> disable_irq
> >> kernel_exit 1
> >> +el1_ia:
> >> + /*
> >> + * Instruction abort handling
> >> + */
> >> + mrs x0, far_el1
> >> + enable_dbg
> >> + // re-enable interrupts if they were enabled in the aborted context
> >> + tbnz x23, #7, 1f // PSR_I_BIT
> >> + enable_irq
> >> +1:
> >> + orr x1, x1, #1 << 24 // use reserved ISS bit for instruction aborts
> >> + mov x2, sp // struct pt_regs
> >> + bl do_mem_abort
> >> +
> >> + // disable interrupts before pulling preserved data off the stack
> >> + disable_irq
> >> + kernel_exit 1
> >> el1_sp_pc:
> >> /*
> >> * Stack or PC alignment exception handling
> >>
> > What happens if you were running at EL2 when this faults gets injected?
> > It looks like KVM needs something similar, doesn't it?
> >
> > Thanks,
> >
> > M.
> Thank you for your comment. I don't think this case is possible, or at
> least the current KVM code suggests that this case should never happen.
> In the EL1 code, we get to this case via the vector:
>
> ventry el1_sync // Synchronous EL1h
>
> The EL2 KVM equivalent appears to be in arch/arm64/kvm/hyp-entry.S and is:
>
> ventry el2h_sync_invalid // Synchronous EL2h
>
> This vector is defined as an invalid_vector and has a comment suggesting
> that it should never happen:
>
> /* None of these should ever happen */
> ...
> invalid_vector el2h_sync_invalid
>
> Please correct me if I am wrong, but it looks like this case should not
> be possible.

This comments really means that we shouldn't ever take any of these
exception. If we do, we'll crash and burn (just like the kernel didn't
expect to take an instruction fault from the kernel itself, up until
this patch).

I expect that the firmware does inject the fault into the exception
level it has preempted. So let me turn the question the other way
around: what guarantees that we will never have to handle such a fault
at EL2?

As a corollary, what happens when the firmware injects a fault
triggered by a VM running at EL1, under the control of a hypervisor
running at EL2? There should be some form of exception delegation to
the hypervisor, which makes the lack of handling at EL2 even more
worrying.

Thanks,

M.
--
Jazz is not dead. It just smells funny.

2016-04-11 22:57:33

by Abdulhamid, Harb

[permalink] [raw]
Subject: Re: [PATCH V2 5/9] arm64: exception: handle instruction abort at current EL

On 4/7/2016 3:54 AM, Marc Zyngier wrote:
> On Wed, 6 Apr 2016 15:36:00 -0600
> "Baicar, Tyler" <[email protected]> wrote:
>
> Hi Tyler,
>
>> Hello Marc,
>>
>> On 4/6/2016 9:36 AM, Marc Zyngier wrote:
>>> On 06/04/16 16:12, Tyler Baicar wrote:
>>>> Add a handler for instruction aborts at the current EL
>>>> (ESR_ELx_EC_IABT_CUR) so they are no longer handled in el1_inv.
>>>> This allows firmware first handling for possible SEA
>>>> (Synchronous External Abort) caused instruction abort at
>>>> current EL.
>>>>
>>>> Signed-off-by: Tyler Baicar <[email protected]>
>>>> Signed-off-by: Naveen Kaje <[email protected]>
>>>> ---
>>>> arch/arm64/kernel/entry.S | 19 +++++++++++++++++++
>>>> 1 file changed, 19 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>>> index 12e8d2b..f257856 100644
>>>> --- a/arch/arm64/kernel/entry.S
>>>> +++ b/arch/arm64/kernel/entry.S
>>>> @@ -336,6 +336,8 @@ el1_sync:
>>>> lsr x24, x1, #ESR_ELx_EC_SHIFT // exception class
>>>> cmp x24, #ESR_ELx_EC_DABT_CUR // data abort in EL1
>>>> b.eq el1_da
>>>> + cmp x24, #ESR_ELx_EC_IABT_CUR // instruction abort in EL1
>>>> + b.eq el1_ia
>>>> cmp x24, #ESR_ELx_EC_SYS64 // configurable trap
>>>> b.eq el1_undef
>>>> cmp x24, #ESR_ELx_EC_SP_ALIGN // stack alignment exception
>>>> @@ -363,6 +365,23 @@ el1_da:
>>>> // disable interrupts before pulling preserved data off the stack
>>>> disable_irq
>>>> kernel_exit 1
>>>> +el1_ia:
>>>> + /*
>>>> + * Instruction abort handling
>>>> + */
>>>> + mrs x0, far_el1
>>>> + enable_dbg
>>>> + // re-enable interrupts if they were enabled in the aborted context
>>>> + tbnz x23, #7, 1f // PSR_I_BIT
>>>> + enable_irq
>>>> +1:
>>>> + orr x1, x1, #1 << 24 // use reserved ISS bit for instruction aborts
>>>> + mov x2, sp // struct pt_regs
>>>> + bl do_mem_abort
>>>> +
>>>> + // disable interrupts before pulling preserved data off the stack
>>>> + disable_irq
>>>> + kernel_exit 1
>>>> el1_sp_pc:
>>>> /*
>>>> * Stack or PC alignment exception handling
>>>>
>>> What happens if you were running at EL2 when this faults gets injected?
>>> It looks like KVM needs something similar, doesn't it?
>>>
>>> Thanks,
>>>
>>> M.
>> Thank you for your comment. I don't think this case is possible, or at
>> least the current KVM code suggests that this case should never happen.
>> In the EL1 code, we get to this case via the vector:
>>
>> ventry el1_sync // Synchronous EL1h
>>
>> The EL2 KVM equivalent appears to be in arch/arm64/kvm/hyp-entry.S and is:
>>
>> ventry el2h_sync_invalid // Synchronous EL2h
>>
>> This vector is defined as an invalid_vector and has a comment suggesting
>> that it should never happen:
>>
>> /* None of these should ever happen */
>> ...
>> invalid_vector el2h_sync_invalid
>>
>> Please correct me if I am wrong, but it looks like this case should not
>> be possible.
>
> This comments really means that we shouldn't ever take any of these
> exception. If we do, we'll crash and burn (just like the kernel didn't
> expect to take an instruction fault from the kernel itself, up until
> this patch).
>
> I expect that the firmware does inject the fault into the exception
> level it has preempted. So let me turn the question the other way
> around: what guarantees that we will never have to handle such a fault
> at EL2?
>

It is definitely possible to take an external abort (instruction or
data) as well as SError interrupts in EL2. One would expect that they
would be trapped in EL2 when running guest VMs.

However, this patch was not intended to address KVM APEI support at EL2
(at this point). The aim here was to enable APEI (namely firmware first
error handling support) in the host/root kernel.

The general idea of how APEI would work with Hypervisors may vary
depending on the specific Hypervisor (e.g. KVM, Xen, HyperV, VMWare,
etc.).

For example, if the Hypervisor (i.e. code running at EL2) traps SEI/SEA
exceptions (either during EL2 code execution or an SEI/SEA exception
encountered during guest VM execution), the Hypervisor may not have
built-in APEI support, or the ability to handle such faults directly.
One option is for the Hypervisor to forward or "replay" SEA/SEI
exceptions to the host/root kernel for handling of such exceptions. If
the root/host kernel happens to support APEI, the kernel will attempt to
leverage GHES information to identify the severity of the error, and if
possible, may attempt to recover from the error. Essentially, the final
decision on how to handle SEA/SEI faults falls on the root/host kernel.

Extending APEI support to KVM should be addressed in a separate
patchset, as the implication would go beyond just the EL2 exception
handlers we are referencing here. There would be much more work and
validation needed.

> As a corollary, what happens when the firmware injects a fault
> triggered by a VM running at EL1, under the control of a hypervisor
> running at EL2? There should be some form of exception delegation to
> the hypervisor, which makes the lack of handling at EL2 even more
> worrying.
>
> Thanks,
>
> M.
>

See above example. The Hypervisor could forward/replay such faults to
the root/host kernel (or DOM0 in the case of Xen).

Just a clarification on firmware injecting faults: The firmware does
not inject faults directly into a particular exception level. If
hardware error injection is supported, it will be at a particular
physical address in memory, possibly a specific cache line, or other
specific hardware component. For example, one could target a specific
exception level by injecting an error at an instruction address that is
known to run at EL2, but the fault injection itself does not usually
target exception levels.

Thanks,
Harb
--
Qualcomm Technologies, Inc.
on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2016-04-12 14:17:34

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH V2 5/9] arm64: exception: handle instruction abort at current EL

On 11/04/16 23:57, Abdulhamid, Harb wrote:
> On 4/7/2016 3:54 AM, Marc Zyngier wrote:
>> On Wed, 6 Apr 2016 15:36:00 -0600
>> "Baicar, Tyler" <[email protected]> wrote:
>>
>> Hi Tyler,
>>
>>> Hello Marc,
>>>
>>> On 4/6/2016 9:36 AM, Marc Zyngier wrote:
>>>> On 06/04/16 16:12, Tyler Baicar wrote:
>>>>> Add a handler for instruction aborts at the current EL
>>>>> (ESR_ELx_EC_IABT_CUR) so they are no longer handled in el1_inv.
>>>>> This allows firmware first handling for possible SEA
>>>>> (Synchronous External Abort) caused instruction abort at
>>>>> current EL.
>>>>>
>>>>> Signed-off-by: Tyler Baicar <[email protected]>
>>>>> Signed-off-by: Naveen Kaje <[email protected]>
>>>>> ---
>>>>> arch/arm64/kernel/entry.S | 19 +++++++++++++++++++
>>>>> 1 file changed, 19 insertions(+)
>>>>>
>>>>> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
>>>>> index 12e8d2b..f257856 100644
>>>>> --- a/arch/arm64/kernel/entry.S
>>>>> +++ b/arch/arm64/kernel/entry.S
>>>>> @@ -336,6 +336,8 @@ el1_sync:
>>>>> lsr x24, x1, #ESR_ELx_EC_SHIFT // exception class
>>>>> cmp x24, #ESR_ELx_EC_DABT_CUR // data abort in EL1
>>>>> b.eq el1_da
>>>>> + cmp x24, #ESR_ELx_EC_IABT_CUR // instruction abort in EL1
>>>>> + b.eq el1_ia
>>>>> cmp x24, #ESR_ELx_EC_SYS64 // configurable trap
>>>>> b.eq el1_undef
>>>>> cmp x24, #ESR_ELx_EC_SP_ALIGN // stack alignment exception
>>>>> @@ -363,6 +365,23 @@ el1_da:
>>>>> // disable interrupts before pulling preserved data off the stack
>>>>> disable_irq
>>>>> kernel_exit 1
>>>>> +el1_ia:
>>>>> + /*
>>>>> + * Instruction abort handling
>>>>> + */
>>>>> + mrs x0, far_el1
>>>>> + enable_dbg
>>>>> + // re-enable interrupts if they were enabled in the aborted context
>>>>> + tbnz x23, #7, 1f // PSR_I_BIT
>>>>> + enable_irq
>>>>> +1:
>>>>> + orr x1, x1, #1 << 24 // use reserved ISS bit for instruction aborts
>>>>> + mov x2, sp // struct pt_regs
>>>>> + bl do_mem_abort
>>>>> +
>>>>> + // disable interrupts before pulling preserved data off the stack
>>>>> + disable_irq
>>>>> + kernel_exit 1
>>>>> el1_sp_pc:
>>>>> /*
>>>>> * Stack or PC alignment exception handling
>>>>>
>>>> What happens if you were running at EL2 when this faults gets injected?
>>>> It looks like KVM needs something similar, doesn't it?
>>>>
>>>> Thanks,
>>>>
>>>> M.
>>> Thank you for your comment. I don't think this case is possible, or at
>>> least the current KVM code suggests that this case should never happen.
>>> In the EL1 code, we get to this case via the vector:
>>>
>>> ventry el1_sync // Synchronous EL1h
>>>
>>> The EL2 KVM equivalent appears to be in arch/arm64/kvm/hyp-entry.S and is:
>>>
>>> ventry el2h_sync_invalid // Synchronous EL2h
>>>
>>> This vector is defined as an invalid_vector and has a comment suggesting
>>> that it should never happen:
>>>
>>> /* None of these should ever happen */
>>> ...
>>> invalid_vector el2h_sync_invalid
>>>
>>> Please correct me if I am wrong, but it looks like this case should not
>>> be possible.
>>
>> This comments really means that we shouldn't ever take any of these
>> exception. If we do, we'll crash and burn (just like the kernel didn't
>> expect to take an instruction fault from the kernel itself, up until
>> this patch).
>>
>> I expect that the firmware does inject the fault into the exception
>> level it has preempted. So let me turn the question the other way
>> around: what guarantees that we will never have to handle such a fault
>> at EL2?
>>
>
> It is definitely possible to take an external abort (instruction or
> data) as well as SError interrupts in EL2. One would expect that they
> would be trapped in EL2 when running guest VMs.
>
> However, this patch was not intended to address KVM APEI support at EL2
> (at this point). The aim here was to enable APEI (namely firmware first
> error handling support) in the host/root kernel.

The problem is that if you enable it on the host, then you cannot ignore
the EL2 code (i.e. KVM). We need to at least be able to pass the fault
down to the host kernel, where we have the infrastructure to handle it.

> The general idea of how APEI would work with Hypervisors may vary
> depending on the specific Hypervisor (e.g. KVM, Xen, HyperV, VMWare,
> etc.).
>
> For example, if the Hypervisor (i.e. code running at EL2) traps SEI/SEA
> exceptions (either during EL2 code execution or an SEI/SEA exception
> encountered during guest VM execution), the Hypervisor may not have
> built-in APEI support, or the ability to handle such faults directly.
> One option is for the Hypervisor to forward or "replay" SEA/SEI
> exceptions to the host/root kernel for handling of such exceptions. If
> the root/host kernel happens to support APEI, the kernel will attempt to
> leverage GHES information to identify the severity of the error, and if
> possible, may attempt to recover from the error. Essentially, the final
> decision on how to handle SEA/SEI faults falls on the root/host kernel.
>
> Extending APEI support to KVM should be addressed in a separate
> patchset, as the implication would go beyond just the EL2 exception
> handlers we are referencing here. There would be much more work and
> validation needed.

I wouldn't be keen on seeing this series being merged without at least a
minimum amount of support at EL2 (making sure we don't explode). Having
the infrastructure to report the fault to a guest is a different issue,
and should indeed be addressed separately. But dealing with the EL2 part
of the host kernel should be taken care at the same time as the EL1 code.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2016-04-14 10:22:54

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH V2 2/9] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1

On 06/04/16 16:12, Tyler Baicar wrote:
> Currently when a RAS error is reported it is not timestamped.
> The ACPI 6.1 spec adds the timestamp field to the generic error
> data entry v3 structure. The timestamp of when the firmware
> generated the error is now being reported.
>
> Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
> Signed-off-by: Richard Ruigrok <[email protected]>
> Signed-off-by: Tyler Baicar <[email protected]>
> Signed-off-by: Naveen Kaje <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 35 ++++++++++++--
> drivers/firmware/efi/cper.c | 111 +++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 126 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 9b0543e..a6848706 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -419,7 +419,15 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
> int flags = -1;
> int sec_sev = ghes_severity(gdata->error_severity);
> struct cper_sec_mem_err *mem_err;
> - mem_err = (struct cper_sec_mem_err *)(gdata + 1);
> + struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
> +
> + if ((gdata->revision >> 8) >= 0x03)

Could we please make that a macro ? We seem to be using the check everywhere.

> + gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
> +
> + if (gdata_v3)
> + mem_err = (struct cper_sec_mem_err *)(gdata_v3 + 1);
> + else
> + mem_err = (struct cper_sec_mem_err *)(gdata + 1);
>
> if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
> return;
> @@ -449,14 +457,27 @@ static void ghes_do_proc(struct ghes *ghes,
> {
> int sev, sec_sev;
> struct acpi_hest_generic_data *gdata;
> + struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
> + uuid_le sec_type;
>
> sev = ghes_severity(estatus->error_severity);
> apei_estatus_for_each_section(estatus, gdata) {
> sec_sev = ghes_severity(gdata->error_severity);
> - if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> + sec_type = *(uuid_le *)gdata->section_type;
> +
> + if ((gdata->revision >> 8) >= 0x03)
> + gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
> +
> + if (!uuid_le_cmp(sec_type,
> CPER_SEC_PLATFORM_MEM)) {
> struct cper_sec_mem_err *mem_err;
> - mem_err = (struct cper_sec_mem_err *)(gdata+1);
> +
> + if (gdata_v3)
> + mem_err = (struct cper_sec_mem_err *)
> + (gdata_v3 + 1);
> + else
> + mem_err = (struct cper_sec_mem_err *)
> + (gdata + 1);
> ghes_edac_report_mem_error(ghes, sev, mem_err);
>
> arch_apei_report_mem_error(sev, mem_err);
> @@ -466,7 +487,13 @@ static void ghes_do_proc(struct ghes *ghes,
> else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
> CPER_SEC_PCIE)) {
> struct cper_sec_pcie *pcie_err;
> - pcie_err = (struct cper_sec_pcie *)(gdata+1);
> +
> + if (gdata_v3)
> + pcie_err = (struct cper_sec_pcie *)
> + (gdata_v3 + 1);
> + else
> + pcie_err = (struct cper_sec_pcie *)
> + (gdata + 1);
> if (sev == GHES_SEV_RECOVERABLE &&
> sec_sev == GHES_SEV_RECOVERABLE &&
> pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index d425374..23f62962 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -32,6 +32,8 @@
> #include <linux/acpi.h>
> #include <linux/pci.h>
> #include <linux/aer.h>
> +#include <linux/printk.h>
> +#include <linux/bcd.h>
>
> #define INDENT_SP " "
>
> @@ -392,6 +394,10 @@ static void cper_estatus_print_section(
> uuid_le *sec_type = (uuid_le *)gdata->section_type;
> __u16 severity;
> char newpfx[64];
> + struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
> +
> + if ((gdata->revision >> 8) >= 0x03)
> + gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
>
> severity = gdata->error_severity;
> printk("%s""Error %d, type: %s\n", pfx, sec_no,
> @@ -403,14 +409,24 @@ static void cper_estatus_print_section(
>
> snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
> if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
> - struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
> + struct cper_sec_proc_generic *proc_err;
> +
> + if (gdata_v3)
> + proc_err = (void *)(gdata_v3 + 1);
> + else
> + proc_err = (void *)(gdata + 1);
> printk("%s""section_type: general processor error\n", newpfx);
> if (gdata->error_data_length >= sizeof(*proc_err))
> cper_print_proc_generic(newpfx, proc_err);
> else
> goto err_section_too_small;
> } else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
> - struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
> + struct cper_sec_mem_err *mem_err;
> +
> + if (gdata_v3)
> + mem_err = (void *)(gdata_v3 + 1);
> + else
> + mem_err = (void *)(gdata + 1);
> printk("%s""section_type: memory error\n", newpfx);
> if (gdata->error_data_length >=
> sizeof(struct cper_sec_mem_err_old))
> @@ -419,7 +435,12 @@ static void cper_estatus_print_section(
> else
> goto err_section_too_small;
> } else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
> - struct cper_sec_pcie *pcie = (void *)(gdata + 1);
> + struct cper_sec_pcie *pcie;
> +
> + if (gdata_v3)
> + pcie = (void *)(gdata_v3 + 1);
> + else
> + pcie = (void *)(gdata + 1);


The only use of the gdata_v3 above cases is to get the payload(or error_record).
So instead of spilling these checks all over could we use something like :

#define acpi_hest_generic_data_version(gdata) \
(gdata->revision >> 8)

static inline void *
acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
{
return acpi_hest_generic_data_version(gdata) >= 3 ?
((struct acpi_hest_generic_data_v300 *)(gdata)) + 1 :
gdata + 1;
}

And then do :

void *payload = acpi_hest_generic_data_payload(gdata);


> printk("%s""section_type: PCIe error\n", newpfx);
> if (gdata->error_data_length >= sizeof(*pcie))
> cper_print_pcie(newpfx, pcie, gdata);
> @@ -434,10 +455,38 @@ err_section_too_small:
> pr_err(FW_WARN "error section length is too small\n");
> }
>
> +static void cper_estatus_print_section_v300(const char *pfx,
> + const struct acpi_hest_generic_data_v300 *gdata, int sec_no)
> +{
> + __u8 hour, min, sec, day, mon, year, century, *timestamp;
> +
> + if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
> + timestamp = (__u8 *)&(gdata->time_stamp);
> + memcpy(&sec, timestamp, 1);
> + memcpy(&min, timestamp + 1, 1);
> + memcpy(&hour, timestamp + 2, 1);
> + memcpy(&day, timestamp + 4, 1);
> + memcpy(&mon, timestamp + 5, 1);
> + memcpy(&year, timestamp + 6, 1);
> + memcpy(&century, timestamp + 7, 1);
> + printk("%stime: ", pfx);
> + printk("%7s", 0x01 & *(timestamp + 3) ? "precise" : "");
> + printk(" %02d:%02d:%02d %02d%02d-%02d-%02d\n",
> + bcd2bin(hour), bcd2bin(min), bcd2bin(sec),
> + bcd2bin(century), bcd2bin(year), bcd2bin(mon),
> + bcd2bin(day));
> + }
> +
> + cper_estatus_print_section(pfx,
> + (const struct acpi_hest_generic_data *)gdata,
> + sec_no);
> +}


Wouldn't it be better do the v3 header check from cper_erstatus_print_section() and call out
to cper_erstatus_print_section_v300() ? That way, we can leave the callers unaffected,
even for future changes.


> + if (gdata_v3) {
> + while (data_len >= sizeof(*gdata_v3)) {
> + gedata_len = gdata_v3->error_data_length;
> + cper_estatus_print_section_v300(newpfx, gdata_v3,
> + sec_no);
> + data_len -= gedata_len + sizeof(*gdata_v3);
> + gdata_v3 = (void *)(gdata_v3 + 1) + gedata_len;
> + sec_no++;
> + }
> + } else {
> + while (data_len >= sizeof(*gdata)) {
> + gedata_len = gdata->error_data_length;
> + cper_estatus_print_section(newpfx, gdata, sec_no);
> + data_len -= gedata_len + sizeof(*gdata);
> + gdata = (void *)(gdata + 1) + gedata_len;
> + sec_no++;
> + }

With the change mentioned above and storing the sizeof(), we could make this
hunk a bit more cleaner.


Suzuki

2016-04-20 21:25:30

by Tyler Baicar

[permalink] [raw]
Subject: Re: [PATCH V2 2/9] ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1

Hello Suzuki,

On 4/14/2016 4:22 AM, Suzuki K Poulose wrote:
> On 06/04/16 16:12, Tyler Baicar wrote:
>> Currently when a RAS error is reported it is not timestamped.
>> The ACPI 6.1 spec adds the timestamp field to the generic error
>> data entry v3 structure. The timestamp of when the firmware
>> generated the error is now being reported.
>>
>> Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
>> Signed-off-by: Richard Ruigrok <[email protected]>
>> Signed-off-by: Tyler Baicar <[email protected]>
>> Signed-off-by: Naveen Kaje <[email protected]>
>> ---
>> drivers/acpi/apei/ghes.c | 35 ++++++++++++--
>> drivers/firmware/efi/cper.c | 111
>> +++++++++++++++++++++++++++++++++++++-------
>> 2 files changed, 126 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 9b0543e..a6848706 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -419,7 +419,15 @@ static void ghes_handle_memory_failure(struct
>> acpi_hest_generic_data *gdata, int
>> int flags = -1;
>> int sec_sev = ghes_severity(gdata->error_severity);
>> struct cper_sec_mem_err *mem_err;
>> - mem_err = (struct cper_sec_mem_err *)(gdata + 1);
>> + struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
>> +
>> + if ((gdata->revision >> 8) >= 0x03)
>
> Could we please make that a macro ? We seem to be using the check
> everywhere.
Yes, I can make this a macro since we do this several times.
>
>> + gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
>> +
>> + if (gdata_v3)
>> + mem_err = (struct cper_sec_mem_err *)(gdata_v3 + 1);
>> + else
>> + mem_err = (struct cper_sec_mem_err *)(gdata + 1);
>>
>> if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
>> return;
>> @@ -449,14 +457,27 @@ static void ghes_do_proc(struct ghes *ghes,
>> {
>> int sev, sec_sev;
>> struct acpi_hest_generic_data *gdata;
>> + struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
>> + uuid_le sec_type;
>>
>> sev = ghes_severity(estatus->error_severity);
>> apei_estatus_for_each_section(estatus, gdata) {
>> sec_sev = ghes_severity(gdata->error_severity);
>> - if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>> + sec_type = *(uuid_le *)gdata->section_type;
>> +
>> + if ((gdata->revision >> 8) >= 0x03)
>> + gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
>> +
>> + if (!uuid_le_cmp(sec_type,
>> CPER_SEC_PLATFORM_MEM)) {
>> struct cper_sec_mem_err *mem_err;
>> - mem_err = (struct cper_sec_mem_err *)(gdata+1);
>> +
>> + if (gdata_v3)
>> + mem_err = (struct cper_sec_mem_err *)
>> + (gdata_v3 + 1);
>> + else
>> + mem_err = (struct cper_sec_mem_err *)
>> + (gdata + 1);
>> ghes_edac_report_mem_error(ghes, sev, mem_err);
>>
>> arch_apei_report_mem_error(sev, mem_err);
>> @@ -466,7 +487,13 @@ static void ghes_do_proc(struct ghes *ghes,
>> else if (!uuid_le_cmp(*(uuid_le *)gdata->section_type,
>> CPER_SEC_PCIE)) {
>> struct cper_sec_pcie *pcie_err;
>> - pcie_err = (struct cper_sec_pcie *)(gdata+1);
>> +
>> + if (gdata_v3)
>> + pcie_err = (struct cper_sec_pcie *)
>> + (gdata_v3 + 1);
>> + else
>> + pcie_err = (struct cper_sec_pcie *)
>> + (gdata + 1);
>> if (sev == GHES_SEV_RECOVERABLE &&
>> sec_sev == GHES_SEV_RECOVERABLE &&
>> pcie_err->validation_bits &
>> CPER_PCIE_VALID_DEVICE_ID &&
>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
>> index d425374..23f62962 100644
>> --- a/drivers/firmware/efi/cper.c
>> +++ b/drivers/firmware/efi/cper.c
>> @@ -32,6 +32,8 @@
>> #include <linux/acpi.h>
>> #include <linux/pci.h>
>> #include <linux/aer.h>
>> +#include <linux/printk.h>
>> +#include <linux/bcd.h>
>>
>> #define INDENT_SP " "
>>
>> @@ -392,6 +394,10 @@ static void cper_estatus_print_section(
>> uuid_le *sec_type = (uuid_le *)gdata->section_type;
>> __u16 severity;
>> char newpfx[64];
>> + struct acpi_hest_generic_data_v300 *gdata_v3 = NULL;
>> +
>> + if ((gdata->revision >> 8) >= 0x03)
>> + gdata_v3 = (struct acpi_hest_generic_data_v300 *)gdata;
>>
>> severity = gdata->error_severity;
>> printk("%s""Error %d, type: %s\n", pfx, sec_no,
>> @@ -403,14 +409,24 @@ static void cper_estatus_print_section(
>>
>> snprintf(newpfx, sizeof(newpfx), "%s%s", pfx, INDENT_SP);
>> if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
>> - struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
>> + struct cper_sec_proc_generic *proc_err;
>> +
>> + if (gdata_v3)
>> + proc_err = (void *)(gdata_v3 + 1);
>> + else
>> + proc_err = (void *)(gdata + 1);
>> printk("%s""section_type: general processor error\n", newpfx);
>> if (gdata->error_data_length >= sizeof(*proc_err))
>> cper_print_proc_generic(newpfx, proc_err);
>> else
>> goto err_section_too_small;
>> } else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
>> - struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
>> + struct cper_sec_mem_err *mem_err;
>> +
>> + if (gdata_v3)
>> + mem_err = (void *)(gdata_v3 + 1);
>> + else
>> + mem_err = (void *)(gdata + 1);
>> printk("%s""section_type: memory error\n", newpfx);
>> if (gdata->error_data_length >=
>> sizeof(struct cper_sec_mem_err_old))
>> @@ -419,7 +435,12 @@ static void cper_estatus_print_section(
>> else
>> goto err_section_too_small;
>> } else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
>> - struct cper_sec_pcie *pcie = (void *)(gdata + 1);
>> + struct cper_sec_pcie *pcie;
>> +
>> + if (gdata_v3)
>> + pcie = (void *)(gdata_v3 + 1);
>> + else
>> + pcie = (void *)(gdata + 1);
>
>
> The only use of the gdata_v3 above cases is to get the payload(or
> error_record).
> So instead of spilling these checks all over could we use something
> like :
>
> #define acpi_hest_generic_data_version(gdata) \
> (gdata->revision >> 8)
>
> static inline void *
> acpi_hest_generic_data_payload(struct acpi_hest_generic_data *gdata)
> {
> return acpi_hest_generic_data_version(gdata) >= 3 ?
> ((struct acpi_hest_generic_data_v300 *)(gdata)) + 1 :
> gdata + 1;
> }
>
> And then do :
>
> void *payload = acpi_hest_generic_data_payload(gdata);
>
>
Yes, this should simplify the the code to get the payload. This can also
be done in the ghes.c code above as well.
>> printk("%s""section_type: PCIe error\n", newpfx);
>> if (gdata->error_data_length >= sizeof(*pcie))
>> cper_print_pcie(newpfx, pcie, gdata);
>> @@ -434,10 +455,38 @@ err_section_too_small:
>> pr_err(FW_WARN "error section length is too small\n");
>> }
>>
>> +static void cper_estatus_print_section_v300(const char *pfx,
>> + const struct acpi_hest_generic_data_v300 *gdata, int sec_no)
>> +{
>> + __u8 hour, min, sec, day, mon, year, century, *timestamp;
>> +
>> + if (gdata->validation_bits & ACPI_HEST_GEN_VALID_TIMESTAMP) {
>> + timestamp = (__u8 *)&(gdata->time_stamp);
>> + memcpy(&sec, timestamp, 1);
>> + memcpy(&min, timestamp + 1, 1);
>> + memcpy(&hour, timestamp + 2, 1);
>> + memcpy(&day, timestamp + 4, 1);
>> + memcpy(&mon, timestamp + 5, 1);
>> + memcpy(&year, timestamp + 6, 1);
>> + memcpy(&century, timestamp + 7, 1);
>> + printk("%stime: ", pfx);
>> + printk("%7s", 0x01 & *(timestamp + 3) ? "precise" : "");
>> + printk(" %02d:%02d:%02d %02d%02d-%02d-%02d\n",
>> + bcd2bin(hour), bcd2bin(min), bcd2bin(sec),
>> + bcd2bin(century), bcd2bin(year), bcd2bin(mon),
>> + bcd2bin(day));
>> + }
>> +
>> + cper_estatus_print_section(pfx,
>> + (const struct acpi_hest_generic_data *)gdata,
>> + sec_no);
>> +}
>
>
> Wouldn't it be better do the v3 header check from
> cper_erstatus_print_section() and call out
> to cper_erstatus_print_section_v300() ? That way, we can leave the
> callers unaffected,
> even for future changes.
>
>
>> + if (gdata_v3) {
>> + while (data_len >= sizeof(*gdata_v3)) {
>> + gedata_len = gdata_v3->error_data_length;
>> + cper_estatus_print_section_v300(newpfx, gdata_v3,
>> + sec_no);
>> + data_len -= gedata_len + sizeof(*gdata_v3);
>> + gdata_v3 = (void *)(gdata_v3 + 1) + gedata_len;
>> + sec_no++;
>> + }
>> + } else {
>> + while (data_len >= sizeof(*gdata)) {
>> + gedata_len = gdata->error_data_length;
>> + cper_estatus_print_section(newpfx, gdata, sec_no);
>> + data_len -= gedata_len + sizeof(*gdata);
>> + gdata = (void *)(gdata + 1) + gedata_len;
>> + sec_no++;
>> + }
>
> With the change mentioned above and storing the sizeof(), we could
> make this
> hunk a bit more cleaner.
This should certainly clean up this code and I agree it is better to
leave the callers unaffected.

I'll add all these suggested changes in my next patch-set.

Thanks,
Tyler
>
>
> Suzuki

--
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2016-04-20 21:38:36

by Tyler Baicar

[permalink] [raw]
Subject: Re: [PATCH V2 6/9] acpi: apei: handle SEA notification type for ARMv8

This patch currently breaks compilation of "allyesconfig" for x86
because HAVE_ACPI_APEI_SEA is really only supported for ARM64 at this
point. Is making this config dependent on ARM64 the correct way to go
about fixing this? Then, in the future when this support is added for
other architectures, whomever adds the support can say this config
depends on either ARM64 or whatever other architecture they add support for.

If this is not the correct way to go about this, please suggest how to
get around this compilation failure.

Thanks,
Tyler

On 4/6/2016 9:12 AM, Tyler Baicar wrote:
> ARM APEI extension proposal added SEA (Synchrounous External
> Abort) notification type for ARMv8.
> Add a new GHES error source handling function for SEA. If an error
> source's notification type is SEA, then this function can be registered
> into the SEA exception handler. That way GHES will parse and report
> SEA exceptions when they occur.
>
> Signed-off-by: Jonathan (Zhixiong) Zhang <[email protected]>
> Signed-off-by: Tyler Baicar <[email protected]>
> Signed-off-by: Naveen Kaje <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> drivers/acpi/apei/Kconfig | 14 ++++++++
> drivers/acpi/apei/ghes.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 98 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 08952ec..6a7e166 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -4,6 +4,7 @@ config ARM64
> select ACPI_GENERIC_GSI if ACPI
> select ACPI_REDUCED_HARDWARE_ONLY if ACPI
> select HAVE_ACPI_APEI if ACPI
> + select HAVE_ACPI_APEI_SEA if ACPI
> select ARCH_HAS_DEVMEM_IS_ALLOWED
> select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE
> select ARCH_HAS_ELF_RANDOMIZE
> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
> index b0140c8..2d6cd9b 100644
> --- a/drivers/acpi/apei/Kconfig
> +++ b/drivers/acpi/apei/Kconfig
> @@ -4,6 +4,20 @@ config HAVE_ACPI_APEI
> config HAVE_ACPI_APEI_NMI
> bool
>
> +config HAVE_ACPI_APEI_SEA
> + bool "APEI Synchronous External Abort logging/recovering support"
> + help
> + This option should be enabled if the system supports
> + firmware first handling of SEA (Synchronous External Abort).
> + SEA happens with certain faults of data abort or instruction
> + abort synchronous exceptions on ARMv8 systems. If a system
> + supports firmware first handling of SEA, the platform analyzes
> + and handles hardware error notifications with SEA, and it may then
> + form a HW error record for the OS to parse and handle. This
> + option allows the OS to look for such HW error record, and
> + take appropriate action.
> +
> config ACPI_APEI
> bool "ACPI Platform Error Interface (APEI)"
> select MISC_FILESYSTEMS
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a6848706..aae5ca0 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -50,6 +50,10 @@
> #include <acpi/apei.h>
> #include <asm/tlbflush.h>
>
> +#ifdef CONFIG_HAVE_ACPI_APEI_SEA
> +#include <asm/system_misc.h>
> +#endif
> +
> #include "apei-internal.h"
>
> #define GHES_PFX "GHES: "
> @@ -789,6 +793,62 @@ static struct notifier_block ghes_notifier_sci = {
> .notifier_call = ghes_notify_sci,
> };
>
> +#ifdef CONFIG_HAVE_ACPI_APEI_SEA
> +static LIST_HEAD(ghes_sea);
> +
> +static int ghes_notify_sea(struct notifier_block *this,
> + unsigned long event, void *data)
> +{
> + struct ghes *ghes;
> + int ret = NOTIFY_DONE;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(ghes, &ghes_sea, list) {
> + if (!ghes_proc(ghes))
> + ret = NOTIFY_OK;
> + }
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +static struct notifier_block ghes_notifier_sea = {
> + .notifier_call = ghes_notify_sea,
> +};
> +
> +static int ghes_sea_add(struct ghes *ghes)
> +{
> + mutex_lock(&ghes_list_mutex);
> + if (list_empty(&ghes_sea))
> + sea_register_handler_chain(&ghes_notifier_sea);
> + list_add_rcu(&ghes->list, &ghes_sea);
> + mutex_unlock(&ghes_list_mutex);
> + return 0;
> +}
> +
> +static void ghes_sea_remove(struct ghes *ghes)
> +{
> + mutex_lock(&ghes_list_mutex);
> + list_del_rcu(&ghes->list);
> + if (list_empty(&ghes_sea))
> + sea_unregister_handler_chain(&ghes_notifier_sea);
> + mutex_unlock(&ghes_list_mutex);
> +}
> +#else /* CONFIG_HAVE_ACPI_APEI_SEA */
> +static inline int ghes_sea_add(struct ghes *ghes)
> +{
> + pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not supported\n",
> + ghes->generic->header.source_id);
> + return -ENOTSUPP;
> +}
> +
> +static inline void ghes_sea_remove(struct ghes *ghes)
> +{
> + pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not supported\n",
> + ghes->generic->header.source_id);
> +}
> +#endif /* CONFIG_HAVE_ACPI_APEI_SEA */
> +
> #ifdef CONFIG_HAVE_ACPI_APEI_NMI
> /*
> * printk is not safe in NMI context. So in NMI handler, we allocate
> @@ -1033,6 +1093,14 @@ static int ghes_probe(struct platform_device *ghes_dev)
> case ACPI_HEST_NOTIFY_EXTERNAL:
> case ACPI_HEST_NOTIFY_SCI:
> break;
> + case ACPI_HEST_NOTIFY_SEA:
> + if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_SEA)) {
> + pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEA is not supported\n",
> + generic->header.source_id);
> + rc = -ENOTSUPP;
> + goto err;
> + }
> + break;
> case ACPI_HEST_NOTIFY_NMI:
> if (!IS_ENABLED(CONFIG_HAVE_ACPI_APEI_NMI)) {
> pr_warn(GHES_PFX "Generic hardware error source: %d notified via NMI interrupt is not supported!\n",
> @@ -1044,6 +1112,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
> pr_warning(GHES_PFX "Generic hardware error source: %d notified via local interrupt is not supported!\n",
> generic->header.source_id);
> goto err;
> + case ACPI_HEST_NOTIFY_GPIO:
> + case ACPI_HEST_NOTIFY_SEI:
> + case ACPI_HEST_NOTIFY_GSIV:
> + pr_warn(GHES_PFX "Generic hardware error source: %d notified via notification type %u is not supported\n",
> + generic->header.source_id, generic->header.source_id);
> + rc = -ENOTSUPP;
> + goto err;
> default:
> pr_warning(FW_WARN GHES_PFX "Unknown notification type: %u for generic hardware error source: %d\n",
> generic->notify.type, generic->header.source_id);
> @@ -1098,6 +1173,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
> list_add_rcu(&ghes->list, &ghes_sci);
> mutex_unlock(&ghes_list_mutex);
> break;
> + case ACPI_HEST_NOTIFY_SEA:
> + rc = ghes_sea_add(ghes);
> + if (rc)
> + goto err_edac_unreg;
> + break;
> case ACPI_HEST_NOTIFY_NMI:
> ghes_nmi_add(ghes);
> break;
> @@ -1140,6 +1220,9 @@ static int ghes_remove(struct platform_device *ghes_dev)
> unregister_acpi_hed_notifier(&ghes_notifier_sci);
> mutex_unlock(&ghes_list_mutex);
> break;
> + case ACPI_HEST_NOTIFY_SEA:
> + ghes_sea_remove(ghes);
> + break;
> case ACPI_HEST_NOTIFY_NMI:
> ghes_nmi_remove(ghes);
> break;

--
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project