2005-03-17 18:34:09

by Nguyen, Tom L

[permalink] [raw]
Subject: RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

On Wednesday, March 16, 2005 7:52 PM Paul Mackerras wrote:
>> We need some PCI
>> based error flows to understand the details of the flow so we can
>> develop an interface compatible with both.
>
>Here is a basic outline of what happens with EEH (Enhanced Error
>Handling) on IBM PPC64 platforms. This applies to PCI, PCI-X and
>PCI-Express devices.

Is EEH a PCI-SIG specification? Is EEH specs available in public?

>We have a PCI-PCI bridge per slot. The bridge (and the PCI fabric
>generally) look for errors such as address parity errors,
>out-of-bounds DMA accesses by the device, or anything that would
>normally cause SERR to be set. If such an error occurs, the bridge
>immediately isolates the device, meaning that writes by the CPU to the
>device are discarded, reads by the CPU are returned with all 1s data,
>and DMA accesses by the device are blocked.

It seems that a PCI-PCI bridge per slot is hardware implementation
specific. The fact that the PCI-PCI Bridge can isolate the slot is
hardware feature specific.

>What happens at the driver level depends on whether the driver is
>EEH-aware or not. (This description is more what we would like to
>have rather than what is necessarily implemented at present).

PCI Express AER driver uses similar concept of determining whether the
driver is AER-aware or not except that PCI Express AER is independent
from firmware support.

>If the driver is not EEH-aware but is hot-plug capable, then the
>platform code will notice that reads from the device are returning all
>1s and query firmware about the state of the slot. Firmware will
>indicate that the slot has been isolated. Platform code can obtain
>more specific information about the error from firmware and log it.
>Then, platform code will generate a hot-unplug event for the slot.
>After the driver has cleaned up and notified higher levels that its
>device has gone away, platform code will call firmware to reset and
>unisolate the slot, and then generate a hotplug event to tell the
>driver that it can use the device - but as far as the driver is
>concerned, it is a new device.

Where does the platform code reside and where does it log the error?

In PCI Express if the driver is not AER-aware the fatal error message is
reported by its upstream switch, the AER driver obtains comprehensive
error information from the upstream switch (like EEH platform code
obtains error information from the firmware). Since the driver is not
AER-aware, the fatal error is reported to user to make a policy decision
since the PCI Express does not have a hot-plug event for the slot like
EEH platform.

So it looks like the hot-plug capability of the driver is being used in
lieu of specific callbacks to freeze and thaw IO in the case of a
non-aware driver. If the driver does not support hot-plug then the
error is just logged. Do you leave the slot isolated or perform error
recovery anyway?

On a fatal error the interface is down. No matter what the driver
supports (AER aware, EEH aware, unaware) all IO is likely to fail.
Resetting a bus in a point-to-point environment like PCI Express or EEH
(as you describe) should have little adverse effect. The risk is the
bus reset will cause a card reset and the driver must understand to
re-initialize the card. A link reset in PCI Express will not cause a
card reset. We assume the driver will reset its card if necessary.

>If the driver is EEH-aware, then we use the API that Ben has
>proposed. Platform code can either reset the slot (by calling
>firmware) or not, depending on what the driver asks for, and also
>depending on any other information the platform code has available to
>it, such as specific information about the error that has occurred.
>Platform code then unisolates the slot and then informs the driver
>that it can reinitialize the device and restart any transfers that
>were in progress.

In PCI Express the AER driver obtains fatal error information from the
upstream switch driver. We can use the same API with message =
PCIERR_ERROR_RECOVER to notify the endpoint driver, which is maybe
unaware of the fatal error reported by its upstream device. Mostly the
driver will respond with PCIERR_RESULT_NEED_RESET.

>Ben's API is aimed at supporting the code flows that we need for EEH
>as well as those needed for recovery from errors on PCI Express. Part
>of the reason for not just requiring the driver to do everything
>itself is that a slot isolation event can affect multiple drivers,
>because the card in the slot could have a PCI-PCI bridge with multiple
>devices behind it. Thus the recovery process potentially requires a
>degree of coordination between multiple drivers, and Ben's API
>addresses that. The same coordination could be required on PCI
>Express, if I understand correctly, because a fault on an upstream
>link could affect many devices downstream of that link.

Yes the same case applies to PCI Express upstream links. So halting IO
is desired when other devices are affected.

Thanks,
Long


2005-03-17 22:58:54

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)


> On a fatal error the interface is down. No matter what the driver
> supports (AER aware, EEH aware, unaware) all IO is likely to fail.
> Resetting a bus in a point-to-point environment like PCI Express or EEH
> (as you describe) should have little adverse effect. The risk is the
> bus reset will cause a card reset and the driver must understand to
> re-initialize the card. A link reset in PCI Express will not cause a
> card reset. We assume the driver will reset its card if necessary.

Does the link side of PCIE provides a way to trigger a hard reset of the
rest of the card ? If not, then it's dodgy as there may be no way to
consistently "reset" the card if it's in a bad state. I have to double
check, but I suspect that IBM's implementation of EEH-compliant PCIE
will add a full hard reset not just a link reset.



2005-03-18 03:45:10

by Paul Mackerras

[permalink] [raw]
Subject: RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

Nguyen, Tom L writes:

> Is EEH a PCI-SIG specification? Is EEH specs available in public?

No and no (not yet anyway).

> It seems that a PCI-PCI bridge per slot is hardware implementation
> specific. The fact that the PCI-PCI Bridge can isolate the slot is
> hardware feature specific.

Well, it's a common feature across all current IBM PPC64 machines.

> PCI Express AER driver uses similar concept of determining whether the
> driver is AER-aware or not except that PCI Express AER is independent
> from firmware support.

Don't worry about the firmware; the driver won't have to interact with
firmware itself, that's the job of the ppc64-specific platform code.

> Where does the platform code reside and where does it log the error?

By platform code I meant the code under the arch directory that knows
the details of the I/O topology of the machine, how to access the PCI
host bridges, etc. How and where it logs the error is a platform
policy; on IBM ppc64 machines we have an error log daemon for this
purpose, which can do things like log the error to a file or send it
to another machine.

> In PCI Express if the driver is not AER-aware the fatal error message is
> reported by its upstream switch, the AER driver obtains comprehensive
> error information from the upstream switch (like EEH platform code
> obtains error information from the firmware). Since the driver is not
> AER-aware, the fatal error is reported to user to make a policy decision
> since the PCI Express does not have a hot-plug event for the slot like
> EEH platform.

If there is a permanent failure of an upstream link, then maybe
generating unplug events for the devices below it would be a useful
thing to do.

> So it looks like the hot-plug capability of the driver is being used in
> lieu of specific callbacks to freeze and thaw IO in the case of a
> non-aware driver. If the driver does not support hot-plug then the
> error is just logged. Do you leave the slot isolated or perform error
> recovery anyway?

The choice is really to leave the slot isolated or to panic the
system. Leaving the slot isolated risks having the driver loop in an
interrupt routine or deliver bad data to userspace, so we currently
panic the system.

> On a fatal error the interface is down. No matter what the driver

Which interface do you mean here?

> supports (AER aware, EEH aware, unaware) all IO is likely to fail.
> Resetting a bus in a point-to-point environment like PCI Express or EEH
> (as you describe) should have little adverse effect. The risk is the
> bus reset will cause a card reset and the driver must understand to
> re-initialize the card. A link reset in PCI Express will not cause a
> card reset. We assume the driver will reset its card if necessary.

How will the driver reset its card?

> In PCI Express the AER driver obtains fatal error information from the
> upstream switch driver. We can use the same API with message =
> PCIERR_ERROR_RECOVER to notify the endpoint driver, which is maybe
> unaware of the fatal error reported by its upstream device. Mostly the
> driver will respond with PCIERR_RESULT_NEED_RESET.

Sounds fine.

Paul.