Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1979396imu; Sat, 22 Dec 2018 09:23:18 -0800 (PST) X-Google-Smtp-Source: ALg8bN5hwM3ZgyefkcfhAT3vIqQ6uD6UwodjPHqMiwrYqqewazEgJ7BoKrkGyc8KyZJHbdn6OwAZ X-Received: by 2002:a17:902:ab92:: with SMTP id f18mr6877870plr.221.1545499398558; Sat, 22 Dec 2018 09:23:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545499398; cv=none; d=google.com; s=arc-20160816; b=IU4L8s4cwSFdVkPSR2KmHMhWDoo/l0yglU96SqvUlBSC1knNpVV5tkvL2HA1yWOXX7 8vPT3RBqQ8Pfovb86yirwV46iCZAYUQLm4fDxpiXzXiSTtfnBr8FBJnW8TdNZTyyh463 eqUms99pI4fD9hbSJ5Q9eXt+JI0hzrKB3QwMM6G237bsBKN/o2Cf6D2JCZ34HDuJXmDr mLEMdCOjsB1i1jwNLXayLC2ueL+g2p0i2I5+lYRpgUMMCLLlWTLP5HtMA86um+0IynnH 8InlHDrZPOAcZsetYJ0Ns/OC3RQWerJ4NaQXhSCpEdCLbX+JFoRDRBZuJCT/LW/5DVS4 bV8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:subject:from; bh=VvUZk2+MhzVx2qcQw+JVfM+MskOOVXLg/VXuenn6uCs=; b=OAeB46LF/kiYFmYvBFjcjhIhF9e6i0bmJFwDrAeZswNZpAA7bIF6rIk9be6x40gn7z ++xhibxnkqkWnoUVThWgpg3f9uxESbHXN+htcjavqU7i3JMDncqvRcK/e/68wF/VDv0C JmhCmE5ipOOEqqedzNYxjhi+lidyL1DT1Sa2vwV3n0D2xoYpziHr3TpUhC/bgOOj9O7p iLueg5ppytG3/qWkf2pgcdhMPfPfRjjdVJo/dDHP9UIi8wG/bL4j5OstPgHZITUpHJUX vBR4jhIDIA8xsNPA7m5QCuQ2skIQ2Y+5MJuTt/QcmV5INIIPc4xcnVodBnPafo0LKgL4 z8jw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d36si2812264pla.216.2018.12.22.09.23.02; Sat, 22 Dec 2018 09:23:18 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388559AbeLUSRo (ORCPT + 99 others); Fri, 21 Dec 2018 13:17:44 -0500 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:56310 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731660AbeLUSRo (ORCPT ); Fri, 21 Dec 2018 13:17:44 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7B23EEBD; Fri, 21 Dec 2018 10:17:43 -0800 (PST) Received: from [10.1.196.105] (eglon.cambridge.arm.com [10.1.196.105]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 87D223F575; Fri, 21 Dec 2018 10:17:41 -0800 (PST) From: James Morse Subject: Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space To: Peter Maydell Cc: gengdongjiu , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Jonathan Corbet , Christoffer Dall , Marc Zyngier , Catalin Marinas , Will Deacon , kvm-devel , "open list:DOCUMENTATION" , lkml - Kernel Mailing List , arm-mail-list References: <0184EA26B2509940AA629AE1405DD7F201FFC21E@dggema523-mbx.china.huawei.com> Message-ID: <1d73ec5a-b58d-1e00-b681-53cd80cba999@arm.com> Date: Fri, 21 Dec 2018 18:17:40 +0000 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:60.0) Gecko/20100101 Thunderbird/60.3.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Peter, On 19/12/2018 19:02, Peter Maydell wrote: > On Mon, 17 Dec 2018 at 15:56, James Morse wrote: >> I don't think this really matters. Its only the NMIlike notifications that the >> guest doesn't have to register or poll. The ones we support today extend the >> architectures existing behaviour: you would have taken an external-abort on a >> real system, whether you know about the additional metadata doesn't matter to Qemu. > > Consider the case where we booted the guest using a DTB and no ACPI > table at all -- we certainly can't just call QEMU code that tries to > add entries to a nonexistent table. Sure, because you know which of the two sets of firmware-table you're providing. I'm taking the behaviour of physical machines as the template for what we should do here. I can boot a DT-only kernel on Seattle. Firmware has no idea I did this, it will still take DRAM uncorrected-error IRQs in firmware, and generate CPER records in the POLLed areas. But the kernel will never look, because it booted with DT. What happens if the kernel goes on to access the corrupt location? It either gets corrupt values back, or an external abort, depending on the design of the memory-controller. X-gene uses an IRQ for its firmware-first notification. Booted with DT that interrupt can be asserted, but as the OS has didn't know to register it, its never taken. We eventually get the same corrupt-values/external-abort behaviour. KVM/Linux is acting as the memory controller using stage2. When an error is detected by the host it unmaps the page from stage2, and refuses to map it again until its fixed up in Qemu's memory map (which can happen automatically). If the kernel can't fix it itself, the AO signal is like the DRAM-IRQ above, and the AR like the external abort. We don't have a parallel to the 'gets corrupt values back' behaviour as Linux will always unmap hwpoison pages from user-space/guests. If the host-kernel wasn't build with CONFIG_MEMORY_FAILURE, its like the memory controller doesn't support any of the above. I think knowing this is the closest to what you want. > My main point is that there > needs to be logic in Dongjiu's QEMU patches that checks more than > just "does this KVM feature exist". I'm not sufficiently familiar > with all this RAS stuff to be certain what those checks should > be and what the right choices are; I just know we need to check > *something*... I think this is the crux of where we don't see this the same way. The v8.2 RAS stuff is new, RAS support on arm64 is not. Kernel support arrived at roughly the same time, but not CPU support. There are v8.0 systems that support RAS. There are DT systems that can do the same with edac drivers. The physical v8.0 systems that do this, are doing it without any extra CPU support. I think x86's behaviour here includes some history, which we don't have. From the order of the HEST entries, it looks like the machine-check stuff came first, then firmware-first using a 'GHES' entry in that table. I think Qemu on x86 only supports the emulated machine check stuff, so it needs to check KVM has the widget to do this. If Qemu on x86 supported firmware-first, I don't think there would be anything to check. (details below) >>> Let us see the X86's QEMU logic: >>> 1. Before the vCPU created, it will set a default env->mcg_cap value with >> >>> MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports >>> RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host >>> kernel support this feature[2]. Only when host kernel and default env->mcg_cap >>> value all expected this feature, then it will setup vCPU support RAS error >>> recovery[3]. >> >> This looks like KVM exposing a CPU capability to Qemu, which then configures the >> behaviour KVM gives to the guest. This doesn't tell you anything about what the >> guest supports. > > It tells you what the *guest CPU* supports, which for x86 is a combination > of (a) what did the user/machine model ask for and (b) what can KVM > actually implement. I don't much care whether the guest OS supports > anything or not, that's its business... but it does seem odd to me > that the equivalent Arm code is not similarly saying "what were we > asked for, and what can we do?". The flow is something like: For AO, generate CPER records, and notify the OS via NOTIFY_POLL (which isn't really a notification) or some flavour of IRQ. To do this, Qemu needs to be able to write to its reserved area of guest memory, and possibly trigger an interrupt. For AR, generate CPER records and notify the OS via external abort. (the presence of the CPER records makes this NOTIFY_SEA or NOTIFY_SEI). To do this, Qemu again needs to be able to write to guest memory, set guest registers (KVM_SET_ONE_REG()). If it wants to inject an SError-Interrupt/Asynchronous-external-abort while the guest has it masked, it needs KVM_SET_VCPU_EVENTS(). Nothing here depends on the CPU or kernel configuration. This is all ACPI stuff, so its the same on x86. (The only difference is external-abort becomes NMI, which is probably done through SET_VCPU_EVENTS()) What were we asked for? Qemu wants to know if it can write to guest memory, guest registers (for synchronous external abort) and trigger interrupts. It has always been able to do these things. > I think one question here which it would be good to answer is: > if we are modelling a guest and we haven't specifically provided > it an ACPI table to tell it about memory errors, what do we do > when we get a sigbus from the host? We have basically two choices: > (1) send the guest an SError (aka asynchronous external abort) > anyway (with no further info about what the memory error is) For an AR signal an external abort is valid. Its up to the implementation whether these are synchronous or asynchronous. Qemu can only take a signal for something that was synchronous, so you can choose between the two. Synchronous external abort is marginally better as an unaware OS knows its affects this thread, and may be able to kill it. SError with an imp-def ESR is indistinguishable from 'part of the soc fell out', and should always result in a panic(). > (2) just stop QEMU (as we would for a memory error in QEMU's > own memory) This is also valid. A machine may take external-abort to EL3 and then reboot/crash/burn. Just in case this is the deeper issue: I keep picking on memory-errors, what about CPU errors? Linux can't handle these at all, unless they are also memory errors. If we take an imprecise abort from a guest KVM can't tell Qemu using signals. We don't have any mechanism to tell user-space about imprecise exceptions. In this case KVM throws an imp-def SError back at the affected vcpu, these are allowed to be imprecise, as this is the closest thing we have. This does mean that any AO/AR signal Qemu gets is a memory error. Happy New Year, James