Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756526AbcKVRN2 (ORCPT ); Tue, 22 Nov 2016 12:13:28 -0500 Received: from smtp.codeaurora.org ([198.145.29.96]:41966 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751986AbcKVRN0 (ORCPT ); Tue, 22 Nov 2016 12:13:26 -0500 DMARC-Filter: OpenDMARC Filter v1.3.1 smtp.codeaurora.org 56AE0614F9 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=pass smtp.mailfrom=tbaicar@codeaurora.org Subject: Re: [PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64 To: John Garry , marc.zyngier@arm.com, pbonzini@redhat.com, rkrcmar@redhat.com, linux@armlinux.org.uk, catalin.marinas@arm.com, will.deacon@arm.com, rjw@rjwysocki.net, lenb@kernel.org, matt@codeblueprint.co.uk, robert.moore@intel.com, lv.zheng@intel.com, nkaje@codeaurora.org, zjzhang@codeaurora.org, mark.rutland@arm.com, james.morse@arm.com, akpm@linux-foundation.org, eun.taik.lee@samsung.com, sandeepa.s.prabhu@gmail.com, shijie.huang@arm.com, rruigrok@codeaurora.org, paul.gortmaker@windriver.com, tomasz.nowicki@linaro.org, fu.wei@linaro.org, rostedt@goodmis.org, bristot@redhat.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-efi@vger.kernel.org, Suzuki.Poulose@arm.com, punit.agrawal@arm.com, astone@redhat.com, harba@codeaurora.org, hanjun.guo@linaro.org, Shiju Jose , Linuxarm , Anurup M References: <1479767763-27532-1-git-send-email-tbaicar@codeaurora.org> From: "Baicar, Tyler" Message-ID: <1baa09bb-a42f-05bb-0523-4942d60c0619@codeaurora.org> Date: Tue, 22 Nov 2016 10:13:19 -0700 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8520 Lines: 199 Thank you John! Let me know how it goes and if you have any questions :) Tyler On 11/22/2016 4:11 AM, John Garry wrote: > + > > We'll try and test this on our platform. > > Cheers, > John > > On 21/11/2016 22:35, Tyler Baicar wrote: >> When a memory error, CPU error, PCIe error, or other type of hardware >> error >> that's covered by RAS occurs, firmware should populate the shared >> GHES memory >> location with the proper GHES structures to notify the OS of the error. >> For example, platforms that implement firmware first handling may >> implement >> separate GHES sources for corrected errors and uncorrected errors. If >> the >> error is an uncorrectable error, then the firmware will notify the OS >> immediately since the error needs to be handled ASAP. The OS will >> then be able >> to take the appropriate action needed such as offlining a page. If >> the error >> is a corrected error, then the firmware will not interrupt the OS >> immediately. >> Instead, the OS will see and report the error the next time it's GHES >> timer >> expires. The kernel will first parse the GHES structures and report >> the errors >> through the kernel logs and then notify the user space through RAS trace >> events. This allows user space applications such as RAS Daemon to see >> the >> errors and report them however the user desires. This patchset >> extends the >> kernel functionality for RAS errors based on updates in the UEFI 2.6 and >> ACPI 6.1 specifications. >> >> An example flow from firmware to user space could be: >> >> +---------------+ >> +-------->| | >> | | GHES polling |--+ >> +-------------+ | source | | +---------------+ +------------+ >> | | +---------------+ | | Kernel GHES | | | >> | Firmware | +-->| CPER AER and |-->| RAS >> trace | >> | | +---------------+ | | EDAC drivers | | event | >> +-------------+ | | | +---------------+ +------------+ >> | | GHES sci |--+ >> +-------->| source | >> +---------------+ >> >> Add support for Generic Hardware Error Source (GHES) v2, which >> introduces the >> capability for the OS to acknowledge the consumption of the error record >> generated by the Reliability, Availability and Serviceability (RAS) >> controller. >> This eliminates potential race conditions between the OS and the RAS >> controller. >> >> Add support for the timestamp field added to the Generic Error Data >> Entry v3, >> allowing the OS to log the time that the error is generated by the >> firmware, >> rather than the time the error is consumed. This improves the >> correctness of >> event sequences when analyzing error logs. The timestamp is added in >> ACPI 6.1, reference Table 18-343 Generic Error Data Entry. >> >> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6 >> specification. ARMv8 specific processor error information is reported >> as part of >> the CPER records. This provides more detail on for processor error >> logs. This >> can help describe ARMv8 cache, tlb, and bus errors. >> >> Synchronous External Abort (SEA) represents a specific processor >> error condition >> in ARM systems. A handler is added to recognize SEA errors, and a >> notifier is >> added to parse and report the errors before the process is killed. >> Refer to >> section N.2.1.1 in the Common Platform Error Record appendix of the >> UEFI 2.6 >> specification. >> >> Currently the kernel ignores CPER records that are unrecognized. >> On the other hand, UEFI spec allows for non-standard (eg. vendor >> proprietary) error section type in CPER (Common Platform Error Record), >> as defined in section N2.3 of UEFI version 2.5. Therefore, user >> is not able to see hardware error data of non-standard section. >> >> If section Type field of Generic Error Data Entry is unrecognized, >> prints out the raw data in dmesg buffer, and also adds a tracepoint >> for reporting such hardware errors. >> >> Currently even if an error status block's severity is fatal, the kernel >> does not honor the severity level and panic. With the firmware first >> model, the platform could inform the OS about a fatal hardware error >> through the non-NMI GHES notification type. The OS should panic when a >> hardware error record is received with this severity. >> >> Add support to handle SEAs that occur while a KVM guest kernel is >> running. Currently these are unsupported by the guest abort handling. >> >> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for >> aarch64. >> https://lkml.org/lkml/2016/8/10/231 >> >> V5: Fix GHES goto logic for error conditions >> Change ghes_do_read_ack to ghes_ack_error >> Make sure data version check is >= 3 >> Use CPER helper functions in print functions >> Make handle_guest_sea() dummy function static for arm >> Add arm to subject line for KVM patch >> >> V4: Add bit offset left shift to read_ack_write value >> Make HEST generic and generic_v2 structures a union in the ghes >> structure >> Move gdata v3 helper functions into ghes.h to avoid duplication >> Reorder the timestamp print and avoid memcpy >> Add helper functions for gdata size checking >> Rename the SEA functions >> Add helper function for GHES panics >> Set fru_id to NULL UUID at variable declaration >> Limit ARM trace event parameters to the needed structures >> Reorder the ARM trace event variables to save space >> Add comment for why we don't pass SEAs to the guest when it aborts >> Move ARM trace event call into GHES driver instead of CPER >> >> V3: Fix unmapped address to the read_ack_register in ghes.c >> Add helper function to get the proper payload based on generic >> data entry >> version >> Move timestamp print to avoid changing function calls in cper.c >> Remove patch "arm64: exception: handle instruction abort at >> current EL" >> since the el1_ia handler is already added in 4.8 >> Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA >> Add a new trace event for ARM type errors >> Add support to handle KVM guest SEAs >> >> V2: Add PSCI state print for the ARMv8 error type. >> Separate timestamp year into year and century using BCD format. >> Rebase on top of ACPICA 20160318 release and remove header file >> changes >> in include/acpi/actbl1.h. >> Add panic OS with fatal error status block patch. >> Add processing of unrecognized CPER error section patches with >> updates >> from previous comments. Original patches: >> https://lkml.org/lkml/2015/9/8/646 >> >> V1: https://lkml.org/lkml/2016/2/5/544 >> >> Jonathan (Zhixiong) Zhang (1): >> acpi: apei: panic OS with fatal error status block >> >> Tyler Baicar (9): >> acpi: apei: read ack upon ghes record consumption >> ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1 >> efi: parse ARMv8 processor error >> arm64: exception: handle Synchronous External Abort >> acpi: apei: handle SEA notification type for ARMv8 >> efi: print unrecognized CPER section >> ras: acpi / apei: generate trace event for unrecognized CPER section >> trace, ras: add ARM processor error trace event >> arm/arm64: KVM: add guest SEA support >> >> arch/arm/include/asm/kvm_arm.h | 1 + >> arch/arm/include/asm/system_misc.h | 5 + >> arch/arm/kvm/mmu.c | 18 ++- >> arch/arm64/Kconfig | 1 + >> arch/arm64/include/asm/kvm_arm.h | 1 + >> arch/arm64/include/asm/system_misc.h | 15 +++ >> arch/arm64/mm/fault.c | 71 ++++++++++-- >> drivers/acpi/apei/Kconfig | 14 +++ >> drivers/acpi/apei/ghes.c | 188 >> ++++++++++++++++++++++++++++--- >> drivers/acpi/apei/hest.c | 7 +- >> drivers/firmware/efi/cper.c | 210 >> ++++++++++++++++++++++++++++++++--- >> drivers/ras/ras.c | 2 + >> include/acpi/ghes.h | 15 ++- >> include/linux/cper.h | 84 ++++++++++++++ >> include/ras/ras_event.h | 100 +++++++++++++++++ >> 15 files changed, 688 insertions(+), 44 deletions(-) >> > > -- Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.