Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp3969196pxv; Mon, 19 Jul 2021 13:11:43 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzLotJjm0ru1zZhGQYcPWaX0UuuOLiUJcUaEATqNQ9r+qkZgKB86O+AWTFG0ePQgJJpCFHL X-Received: by 2002:a05:6402:c17:: with SMTP id co23mr26378092edb.377.1626725503012; Mon, 19 Jul 2021 13:11:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626725503; cv=none; d=google.com; s=arc-20160816; b=C/u5scRd+FiZVc5jQ8DEMw/bo4phEywvnaBgbBirO7p2s4UmIUijrYqm/UN+wkF3tJ TVNl1kaUhmgAOXk+wuIwpfXm6W+ybeoSyNUbCJXX/WJhJBTLtrGsTGeJRzjJ6fnb5HoJ EdNVxyrKpzuLTCFkWnDjBVRYzQxFYl/3oqmDnFU8Rf3NbMIgCxUm1Na0C5NuZRuzlSwa CGjIW6AqocvotZK8wyzHzz1fdcoTq/nOrlDMD1SYuOz31ud54Xu6uzI8pNkXS7BnlXTO mK8Yt0CHXfNcWNJexmnEeCXh+DxbUvUdg91eq7NTlK6I0g7De7X0I8R0RTrEJ+x3g3v4 gkwg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:subject:cc:to:from:date; bh=VqqVUow7JAlMxb9VoddCDR/xktm11KB1ezVdNeGMogw=; b=Rya59mCXuCkVsUq5Fevy0iIiwz+J6xrNrA8JtnqftXMIfOBBLrftxgdN1lyt5whfAR 0YBLrlB1X5JP8i4HPcmbsF/RBQQ6y2sfNsqyAx2K7mJmIoZOSOPD/qmKsMcNsw21CDrz pjxfSz9Lm3Ugl7jlIHG9zRAGLGSJFulg/8foekLftqx2Mn6f5nIdTkj9XfLygIrwSDa6 LohVXC3/z106XFFhtClGLG8pqJ+AUStxHKf5hnE4FgIU8ZwguuqGkg8QrDlJdhYrZtGg XoG6Shcw/DLBKSS2Qkc7OV5bS66gHh07VU+5YOD57eKTtiiJOkWpqtQvGOKqdXpWO69g 70tg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l3si26677124ejd.313.2021.07.19.13.11.19; Mon, 19 Jul 2021 13:11:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1384977AbhGSSmL convert rfc822-to-8bit (ORCPT + 99 others); Mon, 19 Jul 2021 14:42:11 -0400 Received: from shelob.surriel.com ([96.67.55.147]:48214 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1384096AbhGSSUT (ORCPT ); Mon, 19 Jul 2021 14:20:19 -0400 Received: from [2603:3005:d05:2b00:6e0b:84ff:fee2:98bb] (helo=imladris.surriel.com) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1m5YVR-00033U-C0; Mon, 19 Jul 2021 15:00:45 -0400 Date: Mon, 19 Jul 2021 15:00:41 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: Dave Hansen , Andy Lutomirski , kernel-team@fb.com, Peter Zijlstra , Ingo Molnar , Borislav Petkov , x86@kernel.org Subject: [PATCH] x86,mm: print likely CPU at segfault time Message-ID: <20210719150041.3c719c94@imladris.surriel.com> X-Mailer: Claws Mail 3.17.8 (GTK+ 2.24.33; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8BIT Sender: riel@shelob.surriel.com Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From 14d31a44a5186c94399dc9518ba80adf64c99772 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Mon, 19 Jul 2021 14:49:17 -0400 Subject: [PATCH] x86,mm: print likely CPU at segfault time In a large enough fleet of computers, it is common to have a few bad CPUs. Those can often be identified by seeing that some commonly run kernel code (that runs fine everywhere else) keeps crashing on the same CPU core on a particular bad system. One of the failure modes observed is that either the instruction pointer, or some register used to specify the address of data that needs to be fetched gets corrupted, resulting in something like a kernel page fault, null pointer dereference, NX violation, or similar. Those kernel failures are often preceded by similar looking userspace failures. It would be useful to know if those are also happening on the same CPU cores, to get a little more confirmation that it is indeed a hardware issue. Adding a printk to show_signal_msg() achieves that purpose. It isn't perfect since the task might get rescheduled on another CPU between when the fault hit and when the message is printed, but it should be good enough to show correlation between userspace and kernel errors when dealing with a bad CPU. $ ./segfault Segmentation fault (core dumped) $ dmesg | grep segfault segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 Signed-off-by: Rik van Riel --- arch/x86/mm/fault.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index b2eefdefc108..dd6c89c23a3a 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -777,6 +777,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code, print_vma_addr(KERN_CONT " in ", regs->ip); + printk(KERN_CONT " on CPU %d", raw_smp_processor_id()); + printk(KERN_CONT "\n"); show_opcodes(regs, loglvl); -- 2.24.1