Received: by 10.223.176.5 with SMTP id f5csp1013027wra; Wed, 7 Feb 2018 11:06:56 -0800 (PST) X-Google-Smtp-Source: AH8x225+Ez0k7i+TkMpGrGPjBBVovzAJmQ9UKyAQIBdbjrYUyhLYFaBD33cm2MWuI7gTCcokmrRL X-Received: by 2002:a17:902:34a:: with SMTP id 68-v6mr7070305pld.276.1518030416468; Wed, 07 Feb 2018 11:06:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518030416; cv=none; d=google.com; s=arc-20160816; b=yy1bnK3kXn7ZTcrl/+TxpmaeRrG7y53Cch+nPg9ls2VNyT2M8RIjvygQJsld1g8+cj tetbbP9ns8AcN2gsSlUsiNOOnvo62IwQF1OBxlP8EmyCxFP5ol9YNvvrETVWqHsDYC9w ZKA4sNjyAVjY3sWJp8kUGaQQ17W4CD9rHr2ftaB0BKFLCX+fK28RK3VtZBOWYPefWeUI 8DrmxbcXuQRqi1nYtes0RL15EL4uXAPuwn3rrUAIiasLGl1Y0tFi2lM4N2WxzYL1v7vi P2iw09s/cP8ISNuqEmPpHFdI0uTaz6EE6XsNny8ym9qSboHqanjQIiq0+c52/zHWLZI+ pe0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :references:subject:cc:to:mime-version:user-agent:from:date :message-id:arc-authentication-results; bh=Ho1MKhpN2q3h65SRUpK3FoWIkWdBCsJyCEpx40RJQWk=; b=gV9pxsspog0zYPAdkJp7daudvyE9m1xiSYUPgjIOeTNCbd7GClNd0KI2kk4j/s3xSW 2Xrn+pYaP/Whqept1YrHNpqMcZW3mp8UuFloi7jTZIXblQwQ87ZGlA31BqT2+SBGKWu2 4JQoYnx8eoxTS+3jwuYuXpUtRj2SUUXVjOWOYDNMeJ3pHpUfPQIjxhdoYHZIWTaIhW6v QizVncRlj1PUkUmemEt+qwyntGQPIO+SJFhTfB953xtsvYHcEdkcvbf2Q2JyHacXbOq5 ggF/UA+++V/EWvIa2y1eUZzSXV3ALZWnhXtFB87DvxoaGPms328mqIOzvqyZcX6Y/cJs TXQw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b89si1509723pfc.274.2018.02.07.11.06.41; Wed, 07 Feb 2018 11:06:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754452AbeBGTGF (ORCPT + 99 others); Wed, 7 Feb 2018 14:06:05 -0500 Received: from foss.arm.com ([217.140.101.70]:55056 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753995AbeBGTGE (ORCPT ); Wed, 7 Feb 2018 14:06:04 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 22ED61529; Wed, 7 Feb 2018 11:06:04 -0800 (PST) Received: from [10.1.207.55] (melchizedek.cambridge.arm.com [10.1.207.55]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 017633F487; Wed, 7 Feb 2018 11:05:59 -0800 (PST) Message-ID: <5A7B4D87.9020207@arm.com> Date: Wed, 07 Feb 2018 19:03:35 +0000 From: James Morse User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.6.0 MIME-Version: 1.0 To: Xie XiuQi CC: catalin.marinas@arm.com, will.deacon@arm.com, mingo@redhat.com, mark.rutland@arm.com, ard.biesheuvel@linaro.org, Dave.Martin@arm.com, takahiro.akashi@linaro.org, tbaicar@codeaurora.org, stephen.boyd@linaro.org, bp@suse.de, julien.thierry@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, wangxiongfeng2@huawei.com, zhengqiang10@huawei.com, gengdongjiu@huawei.com, huawei.libin@huawei.com, wangkefeng.wang@huawei.com, lijinyue@huawei.com, guohanjun@huawei.com, hanjun.guo@linaro.org, cj.chengjian@huawei.com Subject: Re: [PATCH v5 1/3] arm64/ras: support sea error recovery References: <1516969885-150532-1-git-send-email-xiexiuqi@huawei.com> <1516969885-150532-2-git-send-email-xiexiuqi@huawei.com> <5A70C536.7040208@arm.com> In-Reply-To: <5A70C536.7040208@arm.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Xie XiuQi, On 30/01/18 19:19, James Morse wrote: > On 26/01/18 12:31, Xie XiuQi wrote: >> With ARM v8.2 RAS Extension, SEA are usually triggered when memory errors >> are consumed. According to the existing process, errors occurred in the >> kernel, leading to direct panic, if it occurred the user-space, we should >> just kill process. >> >> But there is a class of error, in fact, is not necessary to kill >> process, you can recover and continue to run the process. Such as >> the instruction data corrupted, where the memory page might be >> read-only, which is has not been modified, the disk might have the >> correct data, so you can directly drop the page, ant reload it when >> necessary. > > With firmware-first support, we do all this... > > >> So this patchset is just try to solve such problem: if the error is >> consumed in user-space and the error occurs on a clean page, you can >> directly drop the memory page without killing process. >> >> If the corrupted page is clean, just dropped it and return to user-space >> without side effects. And if corrupted page is dirty, memory_failure() >> will send SIGBUS with code=BUS_MCEERR_AR. While without this patchset, >> do_sea() will just send SIGBUS, so the process was killed in the same place. > > ... but this happens too. I agree its something we should fix, but I don't think > this is the best way to do it. > > This series is pulling the memory-failure-queue details back into the arch-code > to build a second list, that gets processed as extra work when we return to > user-space. > > > The root of the issue is ghes_notify_sea() claims the notification as something > APEI has dealt with, ... but it hasn't done it yet. The signals will be > generated by something currently stuck in a queue. (Evidently x86 doesn't handle > synchronous errors like this using firmware-first). > > I think a smaller fix is to give the queues that may be holding the > memory_failure() work a kick as part of the code that calls ghes_notify_sea(). > This means that by the time we return to do_sea() ghes_notify_sea()'s claim that > APEI has dealt with it is true as any generated signals are pending. We can then > skip the existing SIGBUS generation code. > > >> Because memory_failure() may sleep, we can not call it directly in SEA > > (this one is more serious, I've attempted to fix it by moving all NMI-like > GHES-notifications to use the estatus queue). > > >> exception context. So we saved faulting physical address associated with >> a process in the ghes handler and set __TIF_SEA_NOTIFY. When we return >> from SEA exception context and get into do_notify_resume() before the >> process running, we could check it and call memory_failure() to do >> recovery. > >> It's safe, because we are in process context. > > I think this is the trick. When we take a Synchronous-external-abort out of > userspace, we're in process context too. We can add helpers to drain the > memory_failure_queue which can be called when do_sea() when we know we're > preemptible and interrupts-et-al are unmasked. Something like... base on [0], in arch/arm64/kernel/acpi.c: -----------------%<----------------- int apei_claim_sea(struct pt_regs *regs) { int cpu; int err = -ENOENT; unsigned long current_flags = arch_local_save_flags(); unsigned long interrupted_flags = current_flags; if (!IS_ENABLED(CONFIG_ACPI_APEI_SEA)) return err; if (regs) interrupted_flags = regs->pstate; /* * APEI expects an NMI-like notification to always be called * in NMI context. */ local_daif_restore(DAIF_ERRCTX); nmi_enter(); err = ghes_notify_sea(); cpu = smp_processor_id(); nmi_exit(); /* * APEI NMI-like notifications are deferred to irq_work. Unless * we interrupted irqs-masked code, we can do that now. */ if (!err) { if (!arch_irqs_disabled_flags(interrupted_flags)) { local_daif_restore(DAIF_PROCCTX_NOIRQ); irq_work_run(); } else { err = -EINPROGRESS; } } local_daif_restore(current_flags); if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE) && !err) { /* * Memory failure work is scheduled on the local CPU. * If we interrupted userspace, or are in process context * we can do that now. */ if ((regs && !user_mode(regs)) || !preemptible()) err = -EINPROGRESS; else memory_failure_queue_kick(cpu); } return err; } -----------------%<----------------- and to mm/memory-failure.c: -----------------%<----------------- @@ -1355,7 +1355,7 @@ static void memory_failure_work_func(struct work_struct *w ork) unsigned long proc_flags; int gotten; - mf_cpu = this_cpu_ptr(&memory_failure_cpu); + mf_cpu = container_of(work, struct memory_failure_cpu, work); for (;;) { spin_lock_irqsave(&mf_cpu->lock, proc_flags); gotten = kfifo_get(&mf_cpu->fifo, &entry); @@ -1369,6 +1369,22 @@ static void memory_failure_work_func(struct work_struct * work) } } +/* + * Process memory_failure work queued on the specified CPU. + * Used to avoid return-to-userspace racing with the memory_failure workqueue. + */ +void memory_failure_queue_kick(int cpu) +{ + unsigned long flags; + struct memory_failure_cpu *mf_cpu; + + might_sleep(); + + mf_cpu = &per_cpu(memory_failure_cpu, cpu); + cancel_work_sync(&mf_cpu->work); + memory_failure_work_func(&mf_cpu->work); +} + static int __init memory_failure_init(void) { struct memory_failure_cpu *mf_cpu; -----------------%<----------------- I've cooked up some NOTFIY_SEA-ing APEI firmware using kvmtool to test this. I haven't yet managed to hit irq-masked code with NOTIFY_SEA. I'll try and tidy this up and post a branch to make it easier to test... I prefer this as it doesn't duplicate the state then come back on a TIF flag. I'd like to move the kicking logic into ghes.c, as that is where the queueing happened, but the 'do-this, restore these flags, do-that' is somewhat tasteless, and it looks like on arm64 has synchronous nmi-like notifications that must be handled before returning to user-space... Thanks, James [0] https://www.spinics.net/lists/linux-acpi/msg80149.html