Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752591AbaGRVcI (ORCPT ); Fri, 18 Jul 2014 17:32:08 -0400 Received: from mail.skyhub.de ([78.46.96.112]:53135 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751256AbaGRVcF (ORCPT ); Fri, 18 Jul 2014 17:32:05 -0400 Date: Fri, 18 Jul 2014 23:31:57 +0200 From: Borislav Petkov To: Tony Luck Cc: Havard Skinnemoen , Linux Kernel , Ewout van Bekkum Subject: Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports. Message-ID: <20140718213157.GB29366@pd.tnic> References: <20140710184416.GE5603@pd.tnic> <20140710191224.GF5603@pd.tnic> <20140711092454.GA17083@pd.tnic> <20140711195200.GA18246@pd.tnic> <20140717105025.GA22549@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 18, 2014 at 02:23:04PM -0700, Tony Luck wrote: > On Thu, Jul 17, 2014 at 3:50 AM, Borislav Petkov wrote: > > Well, maybe it is about time we tracked shared banks. > > For cpus that support CMCI and the MCi_CTL2 registers we do track > sharing. Only one cpu gets to be the "owner" of a bank that supports > CMCI (the first one to find it and set bit 30 in the CTL2 register). > > The test_bit() at the top of the loop in machine_check_poll() makes > sure only the owner of a bank actually looks at it. > > for (i = 0; i < mca_cfg.banks; i++) { > if (!mce_banks[i].ctl || !test_bit(i, *b)) > continue; > > If we don't have CMCI, then we don't have the CTL2 registers, and > so have no way to find out which banks are shared. Ah, so Havard's corrected explanation was this: "I don't think we got the description right here. I think the real issue here was machine check polls happening on multiple CPUs with shared banks, all reporting the same MCEs. This is very reproducible when booting with mce=no_cmci, since all CPUs will handle all banks, and there's AFAICT no good way to identify shared banks without enabling CMCI." Remind me, why would one boot with mce=no_cmci at all, on a CMCI machine? > I'd be surprised if it was a problem in practice. If we have CMCI, > then we limit the banks that we look at (and if we see a high rate > of interrupts, then we turn off interrupts an poll). > > If we don't have CMCI, then we are polling at a pretty low rate > (current code adjusts the rate higher if we are finding errors to > log, but we don't let that rate rise forever ... cap is ~ 1HZ). Right, it would be interesting to see how a huuge machine (4 sockets with lotsa memory) behaves under a CMCI storm... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/