Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp3018833imm; Sun, 1 Jul 2018 10:24:40 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJHXX9CB3dGjfrE/+IbxzqQ84h/ZWOD9HWUKDNqi7SKWIa5jK5r1mqHRyLKYt47+LttreHd X-Received: by 2002:a17:902:b08d:: with SMTP id p13-v6mr22675236plr.344.1530465880351; Sun, 01 Jul 2018 10:24:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530465880; cv=none; d=google.com; s=arc-20160816; b=HXDVLTzgjOqyIgQJHkF/eeQ82fSFdWziH29MvQdXbkyBn2JW+pDbrMq1l9swjzYYxV uYF3t2nwsZwVPryu0FJOAZGyaEm/G1ZRQUxVDSFAKoyZ8+qdW7YowTUn554Urj0YBbhO sfPrRaUnyD3JntRgk8b0PrG1gl15a6oMdpAOkHmwfSgRSSLDWB16z5rbrmhHRPdf+Dbu UaD/HwXY00IGbO5T/jM6US0AKFHb/dSeCTkIwqw5zobkWULXDs990nvgnPJFTnEHngiE kajql+xWdKomXy199F8lVk/5xW3xjwadcOAFiAcsnLoEE10LqkhyK1kqQeaSU+W5pRPU CX6Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :in-reply-to:message-id:date:subject:cc:to:from :arc-authentication-results; bh=cv2tc/l1ILtEoNNJJ39FOS7DTITjNj/h616+VRRvfWg=; b=E9YhqGuz5CLwI9xLOIcNQUV1GZbq8bN6el4JsCO+ynSY342qGrYKVexcv5xAch/uey 60fzHKdPgWVB1Te0LCI4vTG25E51SytT0Gk1ArUCFFoMWRwI0qgmEpyksd/6PiuLHaJR kkGx+hkOLJRLMkdMLPDszJNkGmL6EBH7rWSuvVby7Z+wzX1+aU9QC6brEdV54s+r2Xxb 98mIWew1gsm1vbJRV+0nqVvCSQj0N0g8Ijn1rg9/pDFLxp+okk66S2Pv37WO1ePgYjgZ bhWgX527giWUHcAQj1iVtYG7YNt7ChL2jokfLA3FdTI5WHO5oo1oVmwn7kAkKYXItnUj 8F8A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 195-v6si12016553pgb.176.2018.07.01.10.24.25; Sun, 01 Jul 2018 10:24:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965984AbeGARWG (ORCPT + 99 others); Sun, 1 Jul 2018 13:22:06 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:36658 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031113AbeGAQiL (ORCPT ); Sun, 1 Jul 2018 12:38:11 -0400 Received: from localhost (LFbn-1-12247-202.w90-92.abo.wanadoo.fr [90.92.61.202]) by mail.linuxfoundation.org (Postfix) with ESMTPSA id 01EE0AA6; Sun, 1 Jul 2018 16:38:10 +0000 (UTC) From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Tony Luck , Borislav Petkov , Thomas Gleixner , Ashok Raj , Dan Williams , Qiuxu Zhuo , linux-edac Subject: [PATCH 4.17 008/220] x86/mce: Fix incorrect "Machine check from unknown source" message Date: Sun, 1 Jul 2018 18:20:32 +0200 Message-Id: <20180701160908.665706753@linuxfoundation.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180701160908.272447118@linuxfoundation.org> References: <20180701160908.272447118@linuxfoundation.org> User-Agent: quilt/0.65 X-stable: review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 4.17-stable review patch. If anyone has any objections, please let me know. ------------------ From: Tony Luck commit 40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2 upstream. Some injection testing resulted in the following console log: mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134 mce: [Hardware Error]: RIP 10: {pmem_do_bvec+0x11d/0x330 [nd_pmem]} mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88 mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043 mce: [Hardware Error]: Run the above through 'mcelog --ascii' Kernel panic - not syncing: Machine check from unknown source This confused everybody because the first line quite clearly shows that we found a logged error in "Bank 1", while the last line says "unknown source". The problem is that the Linux code doesn't do the right thing for a local machine check that results in a fatal error. It turns out that we know very early in the handler whether the machine check is fatal. The call to mce_no_way_out() has checked all the banks for the CPU that took the local machine check. If it says we must crash, we can do so right away with the right messages. We do scan all the banks again. This means that we might initially not see a problem, but during the second scan find something fatal. If this happens we print a slightly different message (so I can see if it actually every happens). [ bp: Remove unneeded severity assignment. ] Signed-off-by: Tony Luck Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Cc: Ashok Raj Cc: Dan Williams Cc: Qiuxu Zhuo Cc: linux-edac Cc: stable@vger.kernel.org # 4.2 Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com Signed-off-by: Greg Kroah-Hartman --- arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1205,13 +1205,18 @@ void do_machine_check(struct pt_regs *re lmce = m.mcgstatus & MCG_STATUS_LMCES; /* + * Local machine check may already know that we have to panic. + * Broadcast machine check begins rendezvous in mce_start() * Go through all banks in exclusion of the other CPUs. This way we * don't report duplicated events on shared banks because the first one - * to see it will clear it. If this is a Local MCE, then no need to - * perform rendezvous. + * to see it will clear it. */ - if (!lmce) + if (lmce) { + if (no_way_out) + mce_panic("Fatal local machine check", &m, msg); + } else { order = mce_start(&no_way_out); + } for (i = 0; i < cfg->banks; i++) { __clear_bit(i, toclear); @@ -1287,12 +1292,17 @@ void do_machine_check(struct pt_regs *re no_way_out = worst >= MCE_PANIC_SEVERITY; } else { /* - * Local MCE skipped calling mce_reign() - * If we found a fatal error, we need to panic here. + * If there was a fatal machine check we should have + * already called mce_panic earlier in this function. + * Since we re-read the banks, we might have found + * something new. Check again to see if we found a + * fatal error. We call "mce_severity()" again to + * make sure we have the right "msg". */ - if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) - mce_panic("Machine check from unknown source", - NULL, NULL); + if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) { + mce_severity(&m, cfg->tolerant, &msg, true); + mce_panic("Local fatal machine check!", &m, msg); + } } /*