2006-05-03 20:23:57

by Tim Small

[permalink] [raw]
Subject: Re: Problems with EDAC coexisting with BIOS

Alan Cox wrote:

>On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
>
>
>>To me, periodical is not a good design for error handling, it wastes
>>transaction bandwidth that should be used for other more productive
>>purposes.
>>
>>
>
>The periodical choice is mostly down to the brain damaged choice of NMI
>as the viable alternative, which is as good as 'not usable'
>
>
Hi,

As I believe that the majority of the bluesmoke/EDAC developers are
(were) operating under the assumption that it would be possible to do
something with NMI-signalled errors, I was wondering what the problems
with using NMI-signalled ECC errors were?

Are there some systems states in which an incoming NMI throws a spanner
in to the works in an unrecoverable way? If this is the case, is it so
on all x86/x86-64 systems, or just a subset, and is there no way to
implement some sort of top half / bottom half style NMI handler
cleanly? As I am certainly not an x86 architecture expert, I would
appreciate any input from the resident gurus ;o).

Quickly returning to the original problem - I know this isn't a proper
API by any stretch of the imagination, and would require changes to
existing BIOSs, but the EDAC module could reprogram the chipset
error-signalling registers, so that an ECC error no longer triggers an
SMI. The BIOS SMI handler could then read the signalling registers, and
leave the ECC registers well alone if ECC errors are not set to generate
an SMI.

Cheers,

Tim.


2006-05-03 20:37:47

by Tim Hockin

[permalink] [raw]
Subject: Re: Problems with EDAC coexisting with BIOS

On Wed, May 03, 2006 at 09:25:01PM +0100, Tim Small wrote:
> existing BIOSs, but the EDAC module could reprogram the chipset
> error-signalling registers, so that an ECC error no longer triggers an

This is key, I think.

> SMI. The BIOS SMI handler could then read the signalling registers, and
> leave the ECC registers well alone if ECC errors are not set to generate
> an SMI.

The fundamental problem with SMI is that we CAN'T know what it is doing.
I've seen systems which trigger SMI from a GPIO toggled by a clock. I've
seen systems trigger SMI from a chipset-internal periodic timer. I've
seen chipsets route NMI->SMI or even MCE->SMI. If the BIOS is polling the
error status registers from a periodic SMI, we're GOING to lose data.

The big hammer - turn off SMI - is probably OK on some systems, but is not
a general solution. More and more hardware workarounds and features are
SMI based. There are some rather interesting things that can be done in
SMM, *iff* we could get the BIOS out of the way.

Tim (watching EDAC from time to time, quietly)

2006-05-03 21:33:25

by Alan

[permalink] [raw]
Subject: Re: Problems with EDAC coexisting with BIOS

On Mer, 2006-05-03 at 21:25 +0100, Tim Small wrote:
> something with NMI-signalled errors, I was wondering what the problems
> with using NMI-signalled ECC errors were?

The big problem with NMI is that it can occur *during* a PCI
configuration sequence (ie during pci_config_* functions). That means we
can't safely do some I/O, especially configuration space I/O in an NMI
handler. At best we could set a flag and catch it afterwards.

2006-05-04 09:01:24

by Tim Small

[permalink] [raw]
Subject: Re: Problems with EDAC coexisting with BIOS

Alan Cox wrote:

>On Mer, 2006-05-03 at 21:25 +0100, Tim Small wrote:
>
>
>>something with NMI-signalled errors, I was wondering what the problems
>>with using NMI-signalled ECC errors were?
>>
>>
>
>The big problem with NMI is that it can occur *during* a PCI
>configuration sequence (ie during pci_config_* functions). That means we
>can't safely do some I/O, especially configuration space I/O in an NMI
>handler. At best we could set a flag and catch it afterwards.
>
>
I was assuming this was the case - but I don't think that deferring the
work until after the NMI handler has returned is necessarily a big
disadvantage - at least as far as ECC register-status checking is
concerned - since none of the hardware that I've looked at makes any
sort of guarantee about the timeliness of ECC-error-triggered NMI
delivery anyway - so any of the really smart (and urgent) stuff that you
could potentially do as part of the ECC error handling (e.g. terminating
a process if one of their physical pages was mangled) is not possible to
do in a reliable manner anyway.

About the best thing it is possible to do is to try and arrange to take
the page(s) in which an uncorrectable error occurred out of further use
(maybe do the same for correctable errors, if the same physical page
sees repeated correctable errors), plus maybe give the option of
panicing if an uncorrectable page was in use by the kernel?

My first thought was to schedule a tasklet as part of the ECC-specific
NMI handling, or are there any gotchas with doing this from within an
NMI handler?

Cheers,

Tim.

2006-05-04 09:44:35

by Tim Small

[permalink] [raw]
Subject: Re: Problems with EDAC coexisting with BIOS

[email protected] wrote:

>On Wed, May 03, 2006 at 09:25:01PM +0100, Tim Small wrote:
>
>
>>existing BIOSs, but the EDAC module could reprogram the chipset
>>error-signalling registers, so that an ECC error no longer triggers an
>>
>>
>
>This is key, I think.
>
>
>
>>SMI. The BIOS SMI handler could then read the signalling registers, and
>>leave the ECC registers well alone if ECC errors are not set to generate
>>an SMI.
>>
>>
>
>The fundamental problem with SMI is that we CAN'T know what it is doing.
>I've seen systems which trigger SMI from a GPIO toggled by a clock. I've
>seen systems trigger SMI from a chipset-internal periodic timer. I've
>seen chipsets route NMI->SMI or even MCE->SMI. If the BIOS is polling the
>error status registers from a periodic SMI, we're GOING to lose data.
>
>The big hammer - turn off SMI - is probably OK on some systems, but is not
>a general solution. More and more hardware workarounds and features are
>SMI based. There are some rather interesting things that can be done in
>SMM, *iff* we could get the BIOS out of the way.
>
>
Agreed - I have had experience of a system (Intel 855GME chipset based,
AMI BIOS) which emulates the i8042 in the BIOS at SMI time. Mmm nice.
When the Linux i8042 driver can polled the (pretend) i8042, the system
spent ages in the BIOS, and general interrupt latency on the system fell
apart... Oh what a mess.

A limited solution is probably to modify the existing EDAC drivers so
that they ensure SMI generation is disabled (for the specific errors
that the EDAC drivers are designed to handle). The OS is then at least
doing the right thing, even if the BIOS isn't... This should improve
upon the current behaviour on some systems, and shouldn't (as far as I
can see) break any others. The EDAC code also probably needs to be
toughened up (at least on some chipsets) so that it doesn't fall over
when the BIOS steps on its toes.

Tim.