Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1850416imu; Thu, 10 Jan 2019 04:12:13 -0800 (PST) X-Google-Smtp-Source: ALg8bN6o6WjEUfQ5ppQahMm8kOfaioGoF0mXadtRqPNfo93PdXLUZRjH769JrG5v222vEW2hQq/5 X-Received: by 2002:a63:42c1:: with SMTP id p184mr9121469pga.202.1547122332976; Thu, 10 Jan 2019 04:12:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547122332; cv=none; d=google.com; s=arc-20160816; b=BpJr9wd/YG5lc2tFp4LVhZo5ydo26eANSKxGqfidCGUzzBK1QRxtiASG9hbqSUP5MP CA7I6ffTd71WjMo9LDY+AnIKA9WYyBeFeqbP1BJmQ05nEbtDayeGQHqp8i3Z08N36+3+ 00EjeRodfJMgOYIdYLKovcGTtd/n72TznctbZCIaWPM21CtVqv2V/3IGq8+Aw3kBMDfq ESekrknNQwQpRKmfJCpxhl5Yr+fLCFZBnnJ+HMlY9MsaZBgof2F+ucdvlm/WORiuT8Ii ARaHO2owPl+xA2S/Pt5oANHcXXJWHYbB+neZOOMJDlz1hw2/Lnaq9Eh7bNsdh27sG37J TSag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=9/q3swsarMps5LHOtADZ3XSxArmxpml6bduynJRCbt4=; b=RQlp5gkTtYLVn9dcfmG567ZsW6CegFfN4vESe2LISwCOXRO3/W9uZVbI15j8+j6Sek OJt2ZFRN8yrpo5upeYcx5tTJIQAMtrzHg+xC5OIuv48PacjPcqfdLLW5P6W102T/mvhh ieZgM0EcJ61OX7odFUzTw5us0OteNjtmaietM16jlLsdM5ksVzXSrd/4swKL0vJiY4Mi jq7bVn/C92yOUpRi3wW+ER2wI31WxjIDW4iBDzdtX59RNGXp801rj6hwIQmJOAid00gB X9zViPddnlgnI9WjU9muApKbkqxKM0iuO4o+JtB1poUJlwdtREdIh9vNMNoU/VRUklqH u+Dg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k190si21446649pgd.64.2019.01.10.04.11.57; Thu, 10 Jan 2019 04:12:12 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728434AbfAJMJm (ORCPT + 99 others); Thu, 10 Jan 2019 07:09:42 -0500 Received: from szxga07-in.huawei.com ([45.249.212.35]:59550 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727391AbfAJMJm (ORCPT ); Thu, 10 Jan 2019 07:09:42 -0500 Received: from DGGEMS401-HUB.china.huawei.com (unknown [172.30.72.59]) by Forcepoint Email with ESMTP id 27B96DE8FB657DE1C470; Thu, 10 Jan 2019 20:09:39 +0800 (CST) Received: from [127.0.0.1] (10.142.68.147) by DGGEMS401-HUB.china.huawei.com (10.3.19.201) with Microsoft SMTP Server id 14.3.408.0; Thu, 10 Jan 2019 20:09:35 +0800 Subject: Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space To: James Morse , Peter Maydell CC: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Jonathan Corbet , Christoffer Dall , Marc Zyngier , Catalin Marinas , "Will Deacon" , kvm-devel , "open list:DOCUMENTATION" , lkml - Kernel Mailing List , arm-mail-list References: <0184EA26B2509940AA629AE1405DD7F201FFC21E@dggema523-mbx.china.huawei.com> <1d73ec5a-b58d-1e00-b681-53cd80cba999@arm.com> From: gengdongjiu Message-ID: <63a1ad58-b4db-c266-1077-0a3d0a3975d0@huawei.com> Date: Thu, 10 Jan 2019 20:09:34 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <1d73ec5a-b58d-1e00-b681-53cd80cba999@arm.com> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.142.68.147] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi James/Peter, thanks for this discussion, and sorry for my late response due to vacation. On 2018/12/22 2:17, James Morse wrote: > Hi Peter, > > On 19/12/2018 19:02, Peter Maydell wrote: >> On Mon, 17 Dec 2018 at 15:56, James Morse wrote: >>> I don't think this really matters. Its only the NMIlike notifications that the >>> guest doesn't have to register or poll. The ones we support today extend the >>> architectures existing behaviour: you would have taken an external-abort on a >>> real system, whether you know about the additional metadata doesn't matter to Qemu. >> >> Consider the case where we booted the guest using a DTB and no ACPI >> table at all -- we certainly can't just call QEMU code that tries to >> add entries to a nonexistent table. > > Sure, because you know which of the two sets of firmware-table you're providing. > > I'm taking the behaviour of physical machines as the template for what we should > do here. I can boot a DT-only kernel on Seattle. Firmware has no idea I did > this, it will still take DRAM uncorrected-error IRQs in firmware, and generate > CPER records in the POLLed areas. But the kernel will never look, because it > booted with DT. > What happens if the kernel goes on to access the corrupt location? It either > gets corrupt values back, or an external abort, depending on the design of the > memory-controller. > > X-gene uses an IRQ for its firmware-first notification. Booted with DT that > interrupt can be asserted, but as the OS has didn't know to register it, its > never taken. We eventually get the same corrupt-values/external-abort behaviour. > > KVM/Linux is acting as the memory controller using stage2. When an error is > detected by the host it unmaps the page from stage2, and refuses to map it again > until its fixed up in Qemu's memory map (which can happen automatically). If the > kernel can't fix it itself, the AO signal is like the DRAM-IRQ above, and the AR > like the external abort. > We don't have a parallel to the 'gets corrupt values back' behaviour as Linux > will always unmap hwpoison pages from user-space/guests. > > If the host-kernel wasn't build with CONFIG_MEMORY_FAILURE, its like the memory > controller doesn't support any of the above. I think knowing this is the closest > to what you want. > > >> My main point is that there >> needs to be logic in Dongjiu's QEMU patches that checks more than >> just "does this KVM feature exist". I'm not sufficiently familiar >> with all this RAS stuff to be certain what those checks should >> be and what the right choices are; I just know we need to check >> *something*... > > I think this is the crux of where we don't see this the same way. > The v8.2 RAS stuff is new, RAS support on arm64 is not. Kernel support arrived > at roughly the same time, but not CPU support. There are v8.0 systems that > support RAS. There are DT systems that can do the same with edac drivers. > The physical v8.0 systems that do this, are doing it without any extra CPU support. > > I think x86's behaviour here includes some history, which we don't have. >>From the order of the HEST entries, it looks like the machine-check stuff came > first, then firmware-first using a 'GHES' entry in that table. > I think Qemu on x86 only supports the emulated machine check stuff, so it needs > to check KVM has the widget to do this. > If Qemu on x86 supported firmware-first, I don't think there would be anything > to check. (details below) Peter, I summarize James's main idea, James think QEMU does not needs to check *something* if Qemu support firmware-first. What do we do for your comments? > > >>>> Let us see the X86's QEMU logic: >>>> 1. Before the vCPU created, it will set a default env->mcg_cap value with >>> >>>> MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports >>>> RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host >>>> kernel support this feature[2]. Only when host kernel and default env->mcg_cap >>>> value all expected this feature, then it will setup vCPU support RAS error >>>> recovery[3]. >>> >>> This looks like KVM exposing a CPU capability to Qemu, which then configures the >>> behaviour KVM gives to the guest. This doesn't tell you anything about what the >>> guest supports. >> >> It tells you what the *guest CPU* supports, which for x86 is a combination >> of (a) what did the user/machine model ask for and (b) what can KVM >> actually implement. I don't much care whether the guest OS supports >> anything or not, that's its business... but it does seem odd to me >> that the equivalent Arm code is not similarly saying "what were we >> asked for, and what can we do?". > > The flow is something like: > For AO, generate CPER records, and notify the OS via NOTIFY_POLL (which isn't > really a notification) or some flavour of IRQ. > To do this, Qemu needs to be able to write to its reserved area of guest memory, > and possibly trigger an interrupt. > > For AR, generate CPER records and notify the OS via external abort. (the > presence of the CPER records makes this NOTIFY_SEA or NOTIFY_SEI). > To do this, Qemu again needs to be able to write to guest memory, set guest > registers (KVM_SET_ONE_REG()). If it wants to inject an > SError-Interrupt/Asynchronous-external-abort while the guest has it masked, it > needs KVM_SET_VCPU_EVENTS(). > > Nothing here depends on the CPU or kernel configuration. This is all ACPI stuff, > so its the same on x86. (The only difference is external-abort becomes NMI, > which is probably done through SET_VCPU_EVENTS()) > > What were we asked for? Qemu wants to know if it can write to guest memory, > guest registers (for synchronous external abort) and trigger interrupts. It has > always been able to do these things. > > >> I think one question here which it would be good to answer is: >> if we are modelling a guest and we haven't specifically provided >> it an ACPI table to tell it about memory errors, what do we do >> when we get a sigbus from the host? We have basically two choices: >> (1) send the guest an SError (aka asynchronous external abort) >> anyway (with no further info about what the memory error is) > > For an AR signal an external abort is valid. Its up to the implementation > whether these are synchronous or asynchronous. Qemu can only take a signal for > something that was synchronous, so you can choose between the two. > Synchronous external abort is marginally better as an unaware OS knows its > affects this thread, and may be able to kill it. > SError with an imp-def ESR is indistinguishable from 'part of the soc fell out', > and should always result in a panic(). > > >> (2) just stop QEMU (as we would for a memory error in QEMU's >> own memory) > > This is also valid. A machine may take external-abort to EL3 and then > reboot/crash/burn. > > > Just in case this is the deeper issue: I keep picking on memory-errors, what > about CPU errors? > Linux can't handle these at all, unless they are also memory errors. If we take > an imprecise abort from a guest KVM can't tell Qemu using signals. We don't have > any mechanism to tell user-space about imprecise exceptions. In this case KVM > throws an imp-def SError back at the affected vcpu, these are allowed to be > imprecise, as this is the closest thing we have. > > This does mean that any AO/AR signal Qemu gets is a memory error. > > > Happy New Year, > > James > > . >