Received: by 2002:a25:1985:0:0:0:0:0 with SMTP id 127csp4034204ybz; Mon, 20 Apr 2020 14:18:10 -0700 (PDT) X-Google-Smtp-Source: APiQypKW1kYp5xxdqFB+d1nuvRj+WLGemEtNe4zLWpOkhJnkGFs5V3Y6ZuA21aympagVsDNObpqc X-Received: by 2002:aa7:c38a:: with SMTP id k10mr16653297edq.74.1587417490391; Mon, 20 Apr 2020 14:18:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1587417490; cv=none; d=google.com; s=arc-20160816; b=jsVnRiVVLDFJWWPCfcuTnSHMWhlOVOUX4OGJUoCYW86EJpIxPNk1oiA9WCjcb/HnKV r2P3TnFve5yFV6rLNXjpbe9rpWK0s97Yb14zTcTGVFhsKGafJx8/cSflOCD5G2/1hQk1 +FJIPOF6c73zOLuKCVhzxg3200DdMPKSFTzbxUzUwuRzy44cutWwRU3bBVpHpDpvmBkJ amC2N+RKcSF9sk73DCF4rm4myt+M8tSdon8SSK2958qp8ttprCi7vFPlXrmsqahDqPZN 8vvWbgCnB7lfkp60wnNrohvcIwhXpd5ai7D/6cSmA8Pol+DSsE9MQrBhUnwsbEUqnb5R 80aw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=2jN6r5vrIi0padwgEOntdkWRPVKTeqU1oUMQLiCc9u8=; b=xiOP/IfOkbBtPyXE6NIFQdq/3fP+21svsu1621xK8Jbmf8u3ikJrJugef2T77/0qxE FppF17olILCvgq8AOAfF4QG9NACmeJOhK0tlTy/7wCfJi5SqA3USripUS5wzn1z35GoA G9f8iNTlDt0n2I9fH6GAqqpDiydGP2UqgZeZhBE8JLM9b/mK0YDdLmg1waJf6mz4/Mip q8uYG06gmzOIrZD0IOdSqG29y01J3ZQQtm5WKNb7Kn6cyrLKHRhGYgFsigKRRwJvuWUR 3vQKaiA6LizncxO7zU2rl6Kzi0xncSbLiSxbLQFRHebxJURJnpdq500IweY61nZv2E7i jtMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=RVMGvgq4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id co2si42597edb.524.2020.04.20.14.17.47; Mon, 20 Apr 2020 14:18:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=RVMGvgq4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727814AbgDTVQs (ORCPT + 99 others); Mon, 20 Apr 2020 17:16:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57858 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1726123AbgDTVQr (ORCPT ); Mon, 20 Apr 2020 17:16:47 -0400 Received: from mail-lj1-x244.google.com (mail-lj1-x244.google.com [IPv6:2a00:1450:4864:20::244]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 127E0C061A0C for ; Mon, 20 Apr 2020 14:16:46 -0700 (PDT) Received: by mail-lj1-x244.google.com with SMTP id q19so11649295ljp.9 for ; Mon, 20 Apr 2020 14:16:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=2jN6r5vrIi0padwgEOntdkWRPVKTeqU1oUMQLiCc9u8=; b=RVMGvgq4dRxVUT5c5yRM7AhK9ONOictyu7KmC/oVUe3820d9C3IZ+Cl66v1+icCDF2 QgktcgmK+enRUmEsD4pcjlFOgbs2CCGzmQZ3a/E9nbtEQg0f5RbB2c6ns7l8HoIvYshY AzHkoLW3JYQL1BBA+DEehjxgRjaVjJWqfueOc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=2jN6r5vrIi0padwgEOntdkWRPVKTeqU1oUMQLiCc9u8=; b=l5AwXGnONWcov8Xho+EBjlYmjv4jTd+Wr3507UCMrqYxSAjp4PbwJVtYjHMaA+Mm8W oX1wXUFax8qutzpeQkBJK9dVQjdgnMTrFfyzPBss371kKd28TsWev60lCxhwIS/WOyW3 EfWJ/fLFcKSn/K79Y+qvPCr91Kw1W/AxWyjrd7M00SIayBvbP9BOSbykgjogWO/9QqiG n3k7fyQCDRNZhA/1FWJBdWyKlZp1PCrsVZePdbpUEqiRSKo9oCLq1/7XMQrxThhGC0wZ 5v634CETatUvMCkB0zzUR1BQ760A9PSXFx1HMqkGGmH4BGhukifBmIU+0A8tyWEDICMZ HfEA== X-Gm-Message-State: AGi0Pua1T8zOf/Nv40YgDZie/f1BmMKFnxSp37mWH+vvC5aFxFuMNUsm WV50JvyLqFzYALQ9LNTVLBgB7br2j7k= X-Received: by 2002:a2e:8944:: with SMTP id b4mr11486184ljk.84.1587417403210; Mon, 20 Apr 2020 14:16:43 -0700 (PDT) Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com. [209.85.208.182]) by smtp.gmail.com with ESMTPSA id s8sm413056lfd.61.2020.04.20.14.16.41 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 20 Apr 2020 14:16:41 -0700 (PDT) Received: by mail-lj1-f182.google.com with SMTP id u15so11702632ljd.3 for ; Mon, 20 Apr 2020 14:16:41 -0700 (PDT) X-Received: by 2002:a2e:7c1a:: with SMTP id x26mr10530724ljc.209.1587417401085; Mon, 20 Apr 2020 14:16:41 -0700 (PDT) MIME-Version: 1.0 References: <67FF611B-D10E-4BAF-92EE-684C83C9107E@amacapital.net> <3908561D78D1C84285E8C5FCA982C28F7F5FB29E@ORSMSX115.amr.corp.intel.com> In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F7F5FB29E@ORSMSX115.amr.corp.intel.com> From: Linus Torvalds Date: Mon, 20 Apr 2020 14:16:24 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] x86/memcpy: Introduce memcpy_mcsafe_fast To: "Luck, Tony" Cc: "Williams, Dan J" , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , X86 ML , stable , Borislav Petkov , "H. Peter Anvin" , Peter Zijlstra , "Tsaur, Erwin" , Linux Kernel Mailing List , linux-nvdimm Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 20, 2020 at 1:57 PM Luck, Tony wrote: > > > (a) is a trap, not an exception - so the instruction has been done, > > and you don't need to try to emulate it or anything to continue. > > Maybe for errors on the data side of the pipeline. On the instruction > side we can usually recover from user space instruction fetches by > just throwing away the page with the corrupted instructions and reading > from disk into a new page. Then just point the page table to the new > page, and hey presto, its all transparently fixed (modulo time lost fixing > things). That's true for things like ECC on real RAM, with traditional executables. It's not so true of something like nvram that you execute out of directly. There is not necessarily a disk to re-read things from. But it's also not true of things like JIT's. They are kind of a big thing. Asking the JIT to do "hey, I faulted at a random point, you need to re-JIT" is no different from all the other "that's a _really_ painful recovery point, please delay it". Sure, the JIT environment will probably just have to kill that thread anyway, but I do think this falls under the same "you're better off giving the _option_ to just continue and hope for the best" than force a non-recoverable state. For regular ECC, I literally would like the machine to just always continue. I'd like to be informed that there's something bad going on (because it might be RAM going bad, but it might also be a rowhammer attack), but the decision to kill things or not should ultimately be the *users*, not the JIT's, not the kernel. So the basic rule should be that you should always have the _option_ to just continue. The corrupted state might not be critical - or it might be the ECC bits themselves, not the data. There are situations where stopping everything is worse than "let's continue as best we can, and inform the user with a big red blinking light". ECC should not make things less reliable, even if it's another 10+% of bits that can go wrong. It should also be noted that even a good ECC pattern _can_ miss corruption if you're unlucky with the corruption. So the whole black-and-white model of "ECC means you need to stop everything" is questionable to begin with, because the signal isn't that absolute in the first place. So when somebody brings up a "what if I use corrupted data and make things worse", they are making an intellectually dishonest argument. What if you saw corrupted data and simply never caught it, because it was a unlucky multi-bit failure"? There is no "absolute" thing about ECC. The only thing that is _never_ wrong is to report it and try to continue, and let some higher-level entity decide what to do. And that final decision might literally be "I ran this simulation for 2 days, I see that there's an error report, I will buy a new machine. For now I'll use the data it generated, but I'll re-run to validate it later". Linus