LinuxLists.cc - [RFC] How drivers notice a HW error?

2003-11-27 08:30:22

Subject: [RFC] How drivers notice a HW error?

Hi all,

This is a request for comments, especially comments from driver developers.

On some platform, for example IA64, the chipset detects an error caused by
driver's operation such as I/O read, and reports it to kernel. Linux kernel
analyzes the error and decides to kill the driver or reboot at worst.
I want to convey the error information to the offending driver, and want to
enable the driver to recover the failed operation.

So, just a plan, I think about a readb_check function that has checking ability
enable it to return error value if error is occurred on read. Drivers could use
readb_check instead of usual readb, and could diagnosis whether a retry be
required or not, by the return value of readb_check.

To realize this, I consider following two images:

+ readb_check on driver (with Notifier)
[Outline]:
- Hardware error handler (for example in IA64, MCA handler) has a Notifier
as hook point.
- Driver may register a hook function to the Notifier.
- Notifier calls over registered functions when error is occurred.
- Called hook function checks address of error, and if the error seems
to be concerned with the parent driver, ups internal error flag and
stops Notifier by returning OK.
- Hardware error handler regards state of Notifier, and decides the system
to resume or not.
- Restarted driver may refer the error flag after read, and may retry the
read if flag is up.
[Issue]:
- Some interfaces such as register hooks would be required.
- Coding a hook function would be a bother of developers.

+ readb_check on kernel
[Outline]:
- Kernel has readb_check function.
- Drivers may use readb_check instead of usual readb.
- Hardware error handler checks address of error, and if it occurs in
readb_check, changes return value of readb_check and resumes
interrupted context.
- Driver may refer the return value to notice an error in last read
procedure.
[Issue]:
- Overhead would be involved. (Possibly, it could say negligible since
I/O reads are already horribly slow.)

IMO, this is a general-purpose function that should be available on many
platforms. I also hear that Solaris has some similar implementations like this.

If you have any comment about this feature or any idea different from this,
please tell me.

Best regards,

------

H.Seto <[email protected]>

2003-11-27 11:38:09

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC] How drivers notice a HW error?

Hidetoshi Seto <[email protected]> writes:

> On some platform, for example IA64, the chipset detects an error caused by
> driver's operation such as I/O read, and reports it to kernel. Linux kernel
> analyzes the error and decides to kill the driver or reboot at worst.
> I want to convey the error information to the offending driver, and want to
> enable the driver to recover the failed operation.
>A
> So, just a plan, I think about a readb_check function that has checking ability
> enable it to return error value if error is occurred on read. Drivers could use
> readb_check instead of usual readb, and could diagnosis whether a retry be
> required or not, by the return value of readb_check.

I don't think that's an good portable API. On many architectures it is hard to
associate an MCE with an specific instruction because the MCE
happnes asynchronously. All the MCE handler gets is an address. Also
adding error checks to every read* would make the driver source quite
unreadable.

Also I think most drivers would not attempt to specially handle every
access but just implement a generic handler that shutdowns the device
(otherwise it would be a testing nightmare).

So better would be:

Add a callback to the pci_dev/device. When an error occurs in a mmio
area associated with a driver call that callback.

Add another function to register other memory areas (in case a driver
does mmio not visible in PCI config) for error handling.

-Andi