Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp7372837rwp; Tue, 18 Jul 2023 14:26:26 -0700 (PDT) X-Google-Smtp-Source: APBJJlFqX96LX3qZJXVykQUUc3LP6+QXFanB5pmnOkQy2RdRouDqsYzFoh/2TkKjlu3MIgWSjY81 X-Received: by 2002:aa7:cd0e:0:b0:51d:f904:ff0c with SMTP id b14-20020aa7cd0e000000b0051df904ff0cmr946275edw.29.1689715586252; Tue, 18 Jul 2023 14:26:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689715586; cv=none; d=google.com; s=arc-20160816; b=u76gTzylek9zFeSDtmwnfxv+AmQd1B6XSZyEgKqCjlwsTDHNFIP7SgrDTEXarqD0t+ uEzmL2i5knlFjqAJb579CV3YOz9XDMi6lImh1O6fWB3J5LtnRfRxP18IjHV2RKBhd2GL RGJvnky1VlRzJOQd15Z+RMnSuJaCHGv+hFN9nI77ZybvcOJWqQpv9A6E7vjIwRAuDak6 1mu+pTCZ89P0IDyxeRL0bcAVauQfWZ7cGhuHVBejWVpYLFPzJDU2IcCeK/TichREH/gl g/LkovUF0aQ97CtqkXasJYQ6esNbCTJ1BTYf5/UuMJWBkFz50gk8TQ4vcZl7nuzZaI8r B+wg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=MossDWisvaj56Muh773qqWdhzxmT/ryk+2hGvPSaFW0=; fh=8JV1YRsGFCU+uHHNh2jQfW8MEDdH2m0F3r2uy6bTVuk=; b=DH512A+Nr8hukwN5RwwA2qVeVO9Qm+a9m+8Uex9IIwlGZ4TNPIKH7nch33rfe4m7rv iPjZ/J+6SIFltjnN428wxx8BeN2w57ZL+D7AcRs3Wgu3sgLBOTOQrBeSZDE9TMi3f21P y18jz3fJGo9mhWN39h3iNWNl5KfivM3x5vYRdvM3yduqB8ZSQHYY0A9IT4A/szteCzBf 8bEUOi2CdTv07Ks/7J1uhNSPm61wnOZ6qlhmmHajdz5N+V4BTfkti0LH4yi86Kf4rE/O vC/I94hGES4vdMhLz2iq1PwgFxzJGOl34qnqcXNDX9S8vW5k5q/GZ21Z90LrOwwd3T7A 6eiA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Ta+4ekeO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c2-20020aa7df02000000b0051e0f8b1699si1674949edy.185.2023.07.18.14.26.02; Tue, 18 Jul 2023 14:26:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Ta+4ekeO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230239AbjGRVJC (ORCPT + 99 others); Tue, 18 Jul 2023 17:09:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32844 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230226AbjGRVJB (ORCPT ); Tue, 18 Jul 2023 17:09:01 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 34EF4EC; Tue, 18 Jul 2023 14:08:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689714539; x=1721250539; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=jD0ikaOVy5BVJjtayK1+12kc2csGinS7MH/OORdIRbU=; b=Ta+4ekeOGc4j6LAY/vSAA8G+cW/QsDKG6zsqQ16h9GGrHcNpXVOnZkg2 9QgEsCm5h/jdPOlzz6zSoaElZgGiU45RcXFeVlgyGi/NuT3kBFaiWnNZ7 UWfbb4hTeDpMFmx23gyQTFWWc53Sy9Ad+nhpnpGJmsOZhqpiQFFovu5hW y8NQtLnX/FmQa/w+fvbDB4plCvCM0e+DRiY5XupSaz00g6Flz6IuLMu28 C2zOr+pQzlivXyUuUwSsWQbJfZ58HUAiYBi1nGKRuM88vcn7pamUluXG5 sKKu/r/rsKqNc72LdGZI0470paUyoyxtL7jaJVzPfQ2t0M+4JCiUmf7Ll A==; X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="369872196" X-IronPort-AV: E=Sophos;i="6.01,215,1684825200"; d="scan'208";a="369872196" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 14:08:58 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="813897793" X-IronPort-AV: E=Sophos;i="6.01,215,1684825200"; d="scan'208";a="813897793" Received: from agluck-desk3.sc.intel.com ([172.25.222.74]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 14:08:58 -0700 From: Tony Luck To: Borislav Petkov Cc: Yazen Ghannam , Smita.KoralahalliChannabasappa@amd.com, dave.hansen@linux.intel.com, x86@kernel.org, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, patches@lists.linux.dev, Tony Luck Subject: [PATCH v7 0/3] Handle corrected machine check interrupt storms Date: Tue, 18 Jul 2023 14:08:10 -0700 Message-Id: <20230718210813.291190-1-tony.luck@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230616182744.17632-1-tony.luck@intel.com> References: <20230616182744.17632-1-tony.luck@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Linux CMCI storm mitigation is a big hammer that just disables the CMCI interrupt globally and switches to polling all banks. There are two problems with this: 1) It really is a big hammer. It means that errors reported in other banks from different functional units are all subject to the same polling delay before being processed. 2) Intel systems signal some uncorrected errors using CMCI (e.g. memory controller patrol scrub on Icelake Xeon and newer). Delaying processing these error reports negates some of the benefit of the patrol scrubber providing early notice of errors before they are consumed and cause a machine check. This series throws away the old storm implementation and replaces it with one that keeps track of the weather on each separate machine check bank. When a storm is detected from a bank. On Intel the storm is mitigated by setting a very high threshold for corrected errors to signal CMCI. This threshold does not affect signaling CMCI for uncorrected errors. Changes since last version: 0) Rebased to v6.5-rc2 1) Yazen & Boris - dropped AMD patch pending integration of AMD machine check bank scanning with the core machine_check_poll() function. 2) Boris - rename track_cmci_storm() as track_storm() in prep for the day when AMD joins in - they don't call the interrupt "CMCI". This function is now "static" and local to core.c. 3) Boris - Define new "struct storm_bank" for all the storm tracking arrays. 4) Move the storm_poll_mode per-CPU tracker into the storm_desc structure. 5) Define STORM_END_POLL_THRESHOLD as "29" instead of "30" with comment that it is used as high end of a bitmask that counts from zero. Drop the " - 1" where it is used. 6) Don't user kernel-doc format comments in mce/internal.h. Suggested change NOT taken: > + * If this is the first bank on this CPU to enter storm mode > + * start polling > + */ > + if (++storm->stormy_bank_count == 1) if (++storm->stormy_bank_count) > + mce_timer_kick(true); As the comment above this code says, only want to "kick" the timer when first bank on a core goes into storm mode. If another bank also goes into storm while the first storm is active, then no need to "start polling" that's already happening for the first storm. Tony Luck (3): x86/mce: Remove old CMCI storm mitigation code x86/mce: Add per-bank CMCI storm mitigation x86/mce: Handle Intel threshold interrupt storms arch/x86/kernel/cpu/mce/internal.h | 49 ++++- arch/x86/kernel/cpu/mce/core.c | 131 +++++++++--- arch/x86/kernel/cpu/mce/intel.c | 333 +++++++++++++---------------- 3 files changed, 290 insertions(+), 223 deletions(-) base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c -- 2.40.1