Received: by 10.223.176.5 with SMTP id f5csp1640028wra; Thu, 8 Feb 2018 00:41:40 -0800 (PST) X-Google-Smtp-Source: AH8x226v8/+JJw/NKoZJEN+NSom2vgsEZIyxQz/JmPiCFbAD1c4zCH9p5bCKqbX3tLH8oQQdwSVp X-Received: by 10.98.247.25 with SMTP id h25mr8644639pfi.162.1518079300479; Thu, 08 Feb 2018 00:41:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518079300; cv=none; d=google.com; s=arc-20160816; b=AvrJBxL1by0oAq7GdRA58eIYiF27jcQ0hDnoCM6nHLzybtpvor9f6aDYfA+4sHvgxr 16pF6g2kdro7STlD1bjr+fA4yQHdzJYNrpePRoYHcFUzfwBl07YfpMOVyehPKjPVigCo 6OJRYyfFDNzilXAKqRf4iBTfAJhXc22k8UBvPjdDmIqoeA+2Zla+C+qp1gowJhCg6CZq vU6p4UYfrp9FWydGBsb0oMr1aNIjjekbwqassYGIHdgBbUFmTuxCSXgl4PUYSZOyYYUs ZWpd+Vie0XYzaGM+yFpC7TePcZ7Rz5F6JQ5281CU7Fuyr8OdwD0HRk2FYniPR9avFZVB ThAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:cc:references:to :subject:arc-authentication-results; bh=abDpn1qE8/tGdeeWwrScQVl9r3R+LMCmplgNxRJE9OQ=; b=iZzyN//FFKHurgr8gXrz1r6GPdiR8PUocpxgHLTnEkb3Ssfw9iSx+2StfUqmw+qmUU upd6ytKIvMijDLQRkcsJNXRdlmRuzewdgMZp+AMrDXlyOsZ28R0yonSRrNWCpuLr/tAp R38ZKMxoMRaS/gUvcgdxihLPYDPt2ocms6+TRpmbtT91LD+0gs7Q5/Jsc1kin47J6toG 3dA52J+6t8wmyGZpvoWFDUCUAtO1qLX7fq+c+85osBvaaplbiRgVHL3QHvpxeMQ2hqfP Tlp87eGpa6w4GfUMGiLP0FRd1H1ISFRPtfRa5MgZsIVp1mlVX9oohA7kgGjOPQ/1kNsx FojQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a68-v6si1181861pli.594.2018.02.08.00.41.26; Thu, 08 Feb 2018 00:41:40 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751066AbeBHIku (ORCPT + 99 others); Thu, 8 Feb 2018 03:40:50 -0500 Received: from szxga06-in.huawei.com ([45.249.212.32]:59469 "EHLO huawei.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1750806AbeBHIkr (ORCPT ); Thu, 8 Feb 2018 03:40:47 -0500 Received: from DGGEMS404-HUB.china.huawei.com (unknown [172.30.72.59]) by Forcepoint Email with ESMTP id 6CD5B50829AB4; Thu, 8 Feb 2018 16:40:33 +0800 (CST) Received: from [127.0.0.1] (10.177.19.210) by DGGEMS404-HUB.china.huawei.com (10.3.19.204) with Microsoft SMTP Server id 14.3.361.1; Thu, 8 Feb 2018 16:40:23 +0800 Subject: Re: [PATCH v5 1/3] arm64/ras: support sea error recovery To: James Morse References: <1516969885-150532-1-git-send-email-xiexiuqi@huawei.com> <1516969885-150532-2-git-send-email-xiexiuqi@huawei.com> <5A70C536.7040208@arm.com> <5A7B4D87.9020207@arm.com> CC: , , , , , , , , , , , , , , , , , , , , , , , , From: Xie XiuQi Message-ID: <7dacf375-4645-ba34-62d1-96d9f67dbcc2@huawei.com> Date: Thu, 8 Feb 2018 16:35:44 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <5A7B4D87.9020207@arm.com> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.19.210] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi James, Sorry for reply late. On 2018/2/8 3:03, James Morse wrote: > Hi Xie XiuQi, > > On 30/01/18 19:19, James Morse wrote: >> On 26/01/18 12:31, Xie XiuQi wrote: >>> With ARM v8.2 RAS Extension, SEA are usually triggered when memory errors >>> are consumed. According to the existing process, errors occurred in the >>> kernel, leading to direct panic, if it occurred the user-space, we should >>> just kill process. >>> >>> But there is a class of error, in fact, is not necessary to kill >>> process, you can recover and continue to run the process. Such as >>> the instruction data corrupted, where the memory page might be >>> read-only, which is has not been modified, the disk might have the >>> correct data, so you can directly drop the page, ant reload it when >>> necessary. >> >> With firmware-first support, we do all this... >> >> >>> So this patchset is just try to solve such problem: if the error is >>> consumed in user-space and the error occurs on a clean page, you can >>> directly drop the memory page without killing process. >>> >>> If the corrupted page is clean, just dropped it and return to user-space >>> without side effects. And if corrupted page is dirty, memory_failure() >>> will send SIGBUS with code=BUS_MCEERR_AR. While without this patchset, >>> do_sea() will just send SIGBUS, so the process was killed in the same place. >> >> ... but this happens too. I agree its something we should fix, but I don't think >> this is the best way to do it. >> >> This series is pulling the memory-failure-queue details back into the arch-code >> to build a second list, that gets processed as extra work when we return to >> user-space. >> >> >> The root of the issue is ghes_notify_sea() claims the notification as something >> APEI has dealt with, ... but it hasn't done it yet. The signals will be >> generated by something currently stuck in a queue. (Evidently x86 doesn't handle >> synchronous errors like this using firmware-first). >> >> I think a smaller fix is to give the queues that may be holding the >> memory_failure() work a kick as part of the code that calls ghes_notify_sea(). >> This means that by the time we return to do_sea() ghes_notify_sea()'s claim that >> APEI has dealt with it is true as any generated signals are pending. We can then >> skip the existing SIGBUS generation code. >> >> >>> Because memory_failure() may sleep, we can not call it directly in SEA >> >> (this one is more serious, I've attempted to fix it by moving all NMI-like >> GHES-notifications to use the estatus queue). >> >> >>> exception context. So we saved faulting physical address associated with >>> a process in the ghes handler and set __TIF_SEA_NOTIFY. When we return >>> from SEA exception context and get into do_notify_resume() before the >>> process running, we could check it and call memory_failure() to do >>> recovery. >> >>> It's safe, because we are in process context. >> >> I think this is the trick. When we take a Synchronous-external-abort out of >> userspace, we're in process context too. We can add helpers to drain the >> memory_failure_queue which can be called when do_sea() when we know we're >> preemptible and interrupts-et-al are unmasked. > > Something like... base on [0], in arch/arm64/kernel/acpi.c: I am very glad that you are trying to solve the problem, which is very helpful. I agree with your proposal, and I'll test it on by box latter. Indeed, we're in precess context when we are in sea handler. I was thought we can't call schedule() in the exception handler before. Thank you very much! > -----------------%<----------------- > int apei_claim_sea(struct pt_regs *regs) > { > int cpu; > int err = -ENOENT; > unsigned long current_flags = arch_local_save_flags(); > unsigned long interrupted_flags = current_flags; > > if (!IS_ENABLED(CONFIG_ACPI_APEI_SEA)) > return err; > > if (regs) > interrupted_flags = regs->pstate; > > /* > * APEI expects an NMI-like notification to always be called > * in NMI context. > */ > local_daif_restore(DAIF_ERRCTX); > nmi_enter(); > err = ghes_notify_sea(); > cpu = smp_processor_id(); > nmi_exit(); > > /* > * APEI NMI-like notifications are deferred to irq_work. Unless > * we interrupted irqs-masked code, we can do that now. > */ > if (!err) { > if (!arch_irqs_disabled_flags(interrupted_flags)) { > local_daif_restore(DAIF_PROCCTX_NOIRQ); > irq_work_run(); > } else { > err = -EINPROGRESS; > } > } > > local_daif_restore(current_flags); > > if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE) && !err) { > /* > * Memory failure work is scheduled on the local CPU. > * If we interrupted userspace, or are in process context > * we can do that now. > */ > if ((regs && !user_mode(regs)) || !preemptible()) > err = -EINPROGRESS; > else > memory_failure_queue_kick(cpu); > } > > return err; > } > -----------------%<----------------- > > > and to mm/memory-failure.c: > -----------------%<----------------- > @@ -1355,7 +1355,7 @@ static void memory_failure_work_func(struct work_struct *w > ork) > unsigned long proc_flags; > int gotten; > > - mf_cpu = this_cpu_ptr(&memory_failure_cpu); > + mf_cpu = container_of(work, struct memory_failure_cpu, work); > for (;;) { > spin_lock_irqsave(&mf_cpu->lock, proc_flags); > gotten = kfifo_get(&mf_cpu->fifo, &entry); > > @@ -1369,6 +1369,22 @@ static void memory_failure_work_func(struct work_struct * > work) > } > } > > +/* > + * Process memory_failure work queued on the specified CPU. > + * Used to avoid return-to-userspace racing with the memory_failure workqueue. > + */ > +void memory_failure_queue_kick(int cpu) > +{ > + unsigned long flags; > + struct memory_failure_cpu *mf_cpu; > + > + might_sleep(); > + > + mf_cpu = &per_cpu(memory_failure_cpu, cpu); > + cancel_work_sync(&mf_cpu->work); > + memory_failure_work_func(&mf_cpu->work); > +} > + > static int __init memory_failure_init(void) > { > struct memory_failure_cpu *mf_cpu; > -----------------%<----------------- > > I've cooked up some NOTFIY_SEA-ing APEI firmware using kvmtool to test this. I > haven't yet managed to hit irq-masked code with NOTIFY_SEA. I'll try and tidy > this up and post a branch to make it easier to test... > > I prefer this as it doesn't duplicate the state then come back on a TIF flag. > I'd like to move the kicking logic into ghes.c, as that is where the queueing > happened, but the 'do-this, restore these flags, do-that' is somewhat tasteless, > and it looks like on arm64 has synchronous nmi-like notifications that must be > handled before returning to user-space... > > Thanks, > > James > > [0] https://www.spinics.net/lists/linux-acpi/msg80149.html > -- Thanks, Xie XiuQi