Date: Fri, 17 Apr 2009 15:09:44 +0200
From: Andi Kleen <andi@firstfloor.org>
To: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>, hpa@zytor.com,
       linux-kernel@vger.kernel.org, mingo@elte.hu, tglx@linutronix.de
Subject: Re: [PATCH] [20/28] x86: MCE: Switch x86 machine check handler to Monarch election.
Message-ID: <20090417130944.GL14687@one.firstfloor.org>
References: <20090407507.636692542@firstfloor.org> <20090407150803.1AF0C1D046E@basil.firstfloor.org> <49E866D3.1020003@jp.fujitsu.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <49E866D3.1020003@jp.fujitsu.com>
User-Agent: Mutt/1.4.2.1i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2532
Lines: 80


Thanks for your review.

On Fri, Apr 17, 2009 at 08:24:03PM +0900, Hidetoshi Seto wrote:
> > +		goto out;
> > +	if ((s64)*t < SPINUNIT) {
> > +		/* CHECKME: Make panic default for 1 too? */
> > +		if (tolerant < 1)
> > +			mce_panic("Timeout synchronizing machine check over CPUs",
> > +				  NULL, NULL);
> 
> Assuming that if we came here from mce_start() and panic, then I suppose no mce
> log would be appeared on the console since no cpu have invoked mce_log(&m) yet.

Well it would be rather random if the CPU who detects the timeout
actually has something useful to report.
More likely the useful information is in some CPU's banks who doesn't
answer.

On the other hand the real log will come out to disk after reboot from
the CPU registers (especially together with the new mce panic=30 default)

> Is it expected behavior?

Kind of expected. We could probably fix it, adding a fallback path here,
but due to the reasons above I have my doubts it would improve things
in practice.

> > +	}
> > +	*t -= SPINUNIT;
> > +out:
> > +	touch_nmi_watchdog();
> > +	return 0;
> > +}
> (snip)
> > +/*
> > + * Start of Monarch synchronization. This waits until all CPUs have
> > + * entered the ecception handler and then determines if any of them
>                   ^^^^^^^^^
>                   exception

Thanks fixed. I did actually run the spell checker on the comments,
but that one must have slipped through somehow.

> 
> > +	while (atomic_read(&mce_callin) != cpus) {
> > +		if (mce_timed_out(&timeout)) {
> > +			atomic_set(&global_nwo, 0);
> > +			*order = -1;
> > +			return no_way_out;
> > +		}
> > +		ndelay(SPINUNIT);
> > +	}
> > +
> > +	/*
> > +	 * Cache the global no_way_out state.
> > +	 */
> > +	nwo = atomic_read(&global_nwo);
> > +
> > +	/*
> > +	 * Monarch starts executing now, the others wait.
> > +	 */
> > +	if (*order == 1) {
> > +		atomic_set(&global_nwo, 0);
> 
> Monarch should clear global_nwo after all Subjects have read it.

The subjects don't care about the global nwo state, it only matters to the 
Monarch who does the panic in mce_end(). The only exception would be timeout, 
but in this case all the decisions are local only anyways.
We ensure that all the subjects have written it first.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/