LinuxLists.cc - RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 17:24:58

Subject: RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

On Thursday, March 17, 2005 8:01 PM Paul Mackerras wrote:
> Does the PCI Express AER specification define an API for drivers?

No. That is why we agree a general API that works for all platforms.

>Likewise, with EEH the device driver could take recovery action on its
>own. But we don't want to end up with multiple sets of recovery code
>in drivers, if possible. Also we want the recovery code to be as
>simple as possible, otherwise driver authors will get it wrong.

Drivers own their devices register sets. Therefore if there are any
vendor unique actions that can be taken by the driver to recovery we
expect the driver to do so. For example, if the drivers see "xyz" error
and there is a known errata and workaround that involves resetting some
registers on the card. From our perspective we see drivers taking care
of their own cards but the AER driver and your platform code will take
care of the bus/link interfaces.

>I would see the AER driver as being included in the "platform" code.
>The AER driver would be be closely involved in the recovery process.

Our goal is to have the AER driver be part of the general code base
because it is based on a PCI SIG specification that can be implemented
across all architectures.

>What is the state of a link during the time between when an error is
>detected and when a link reset is done? Is the link usable? What
>happens if you try to do a MMIO read from a device downstream of the
>link?

For a FATAL error the link is "unreliable". This means MMIO operations
may or may not succeed. That is why the reset is performed by the
upstream port driver. The interface to that is reliable. A reset of an
upstream port will propagate to all downstream links. So we need an
interface to the bus/port driver to request a reset on its downstream
link. We don't want the AER driver writing port bus driver bridge
control registers. We are trying to keep the ownership of the devices
register read/write within the domain of the devices driver. In our
case the port bus driver.

Thanks,
Long

2005-03-18 18:08:46

by Grant Grundler

[permalink] [raw]

Subject: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote:
> >Likewise, with EEH the device driver could take recovery action on its
> >own. But we don't want to end up with multiple sets of recovery code
> >in drivers, if possible. Also we want the recovery code to be as
> >simple as possible, otherwise driver authors will get it wrong.
>
> Drivers own their devices register sets. Therefore if there are any
> vendor unique actions that can be taken by the driver to recovery we
> expect the driver to do so.
...

All drivers also need to cleanup driver state if they can't
simply recover (and restart pending IOs). ie they need to release
DMA resources and return suitable errors for pending requests.

> >I would see the AER driver as being included in the "platform" code.
> >The AER driver would be be closely involved in the recovery process.
>
> Our goal is to have the AER driver be part of the general code base
> because it is based on a PCI SIG specification that can be implemented
> across all architectures.

To the driver writer, it's all "platform" code.
Folks who maintain PCI (and other) services differentiate between
"generic" and "arch/platform" specific. Think first like a driver
writer and then worry about if/how that can be divided between platform
generic and platform/arch specific code.

Even PCI-Express has *some* arch specific component. At a minimum each
architecture has it's own chipset and firmware to deal with
for PCI Express bus discovery and initialization. But driver writers
don't have to worry about that and they shouldn't for error
recovery either.

> For a FATAL error the link is "unreliable". This means MMIO operations
> may or may not succeed. That is why the reset is performed by the
> upstream port driver. The interface to that is reliable. A reset of an
> upstream port will propagate to all downstream links. So we need an
> interface to the bus/port driver to request a reset on its downstream
> link. We don't want the AER driver writing port bus driver bridge
> control registers. We are trying to keep the ownership of the devices
> register read/write within the domain of the devices driver. In our
> case the port bus driver.

A port bus driver does NOT sound like a normal device driver.
If PCI Express defines a standard register set for a bridge
device (like PCI COnfig space for PCI-PCI Bridges), then I
don't see a problem with PCI-Express error handling code mucking
with those registers. Look at how PCI-PCI bridges are supported
today and which bits of code poke registers on PCI-PCI Bridges.

hth,
grant

2005-03-18 23:14:37

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

On Fri, 2005-03-18 at 11:10 -0700, Grant Grundler wrote:
> On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote:
> > >Likewise, with EEH the device driver could take recovery action on its
> > >own. But we don't want to end up with multiple sets of recovery code
> > >in drivers, if possible. Also we want the recovery code to be as
> > >simple as possible, otherwise driver authors will get it wrong.
> >
> > Drivers own their devices register sets. Therefore if there are any
> > vendor unique actions that can be taken by the driver to recovery we
> > expect the driver to do so.
> ...
>
> All drivers also need to cleanup driver state if they can't
> simply recover (and restart pending IOs). ie they need to release
> DMA resources and return suitable errors for pending requests.

Additionally, in "real life", very few errors are cause by known errata.
If the drivers know about the errata, they usually already work around
them. Afaik, most of the errors are caused by transcient conditions on
the bus or the device, like a bit beeing flipped, or thermal
conditions...

> To the driver writer, it's all "platform" code.
> Folks who maintain PCI (and other) services differentiate between
> "generic" and "arch/platform" specific. Think first like a driver
> writer and then worry about if/how that can be divided between platform
> generic and platform/arch specific code.
>
> Even PCI-Express has *some* arch specific component. At a minimum each
> architecture has it's own chipset and firmware to deal with
> for PCI Express bus discovery and initialization. But driver writers
> don't have to worry about that and they shouldn't for error
> recovery either.

Exactly. A given platform could use Intel's code as-is, or may choose to
do things differently while still showing the same interface to drivers.
Eventually we may end up adding platform hooks to the generic PCIE code
like we have in the PCI code if some platforms require them.

2005-03-19 00:35:48

by linas

[permalink] [raw]

Subject: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark:
>
> Additionally, in "real life", very few errors are cause by known errata.
> If the drivers know about the errata, they usually already work around
> them. Afaik, most of the errors are caused by transcient conditions on
> the bus or the device, like a bit beeing flipped, or thermal
> conditions...

Heh. Let me describe "real life" a bit more accurately.

We've been running with pci error detection enabled here for the last
two years. Based on this experience, the ballpark figures are:

90% of all detected errors were device driver bugs coupled to
pci card hardware errata

9% poorly seated pci cards (remove/reseat will make problem go away)

1% transient/other.

We've seen *EVERY* and I mean *EVERY* device driver that we've put
under stress tests (e.g. peak i/o rates for > 72 hours, e.g.
massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
driver tripped on an EEH error detect that was traced back to
a device driver bug. Not to blame the drivers, a lot of these
were related to pci card hardware/foirmware bugs. For example,
I think grepping for "split completion" and "NAPI" in the
patches/errata for e100 and e1000 for the last year will reveal
some of the stuff that was found. As far as I know,
for every bug found, a patch made it into mainline.

As a rule, it seems that finding these device driver bugs was
very hard; we had some people work on these for months, and in
the case of the e1000, we managed to get Intel engineers to fly
out here and stare at PCI bus traces for a few days. (Thanks Intel!)
Ditto for Emulex. For ipr, we had inhouse people.

So overall, PCI error detection did have the expected effect
(protecting the kernel from corruption, e.g. due to DMA's going
to wild addresses), but I don't think anybody expected that the
vast majority would be software/hardware bugs, instead of transient
effects.

What's ironic in all of this is that by adding error recovery,
device driver bugs will be able to hide more effectively ...
if there's a pci bus error due to a driver bug, the pci card
will get rebooted, the kernel will burp for 3 seconds, and
things will keep going, and most sysadmins won't notice or
won't care.

--linas

2005-03-19 01:25:49

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote:
> On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark:
> >
> > Additionally, in "real life", very few errors are cause by known errata.
> > If the drivers know about the errata, they usually already work around
> > them. Afaik, most of the errors are caused by transcient conditions on
> > the bus or the device, like a bit beeing flipped, or thermal
> > conditions...
>
>
> Heh. Let me describe "real life" a bit more accurately.
>
> We've been running with pci error detection enabled here for the last
> two years. Based on this experience, the ballpark figures are:
>
> 90% of all detected errors were device driver bugs coupled to
> pci card hardware errata

Well, this have been in-lab testing to fight driver bugs/errata on early
rlease kernels, I'm talking about the context of a released solution
with stable drivers/hw.

> 9% poorly seated pci cards (remove/reseat will make problem go away)
>
> 1% transient/other.

Ok.

> We've seen *EVERY* and I mean *EVERY* device driver that we've put
> under stress tests (e.g. peak i/o rates for > 72 hours, e.g.
> massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
> driver tripped on an EEH error detect that was traced back to
> a device driver bug. Not to blame the drivers, a lot of these
> were related to pci card hardware/foirmware bugs. For example,
> I think grepping for "split completion" and "NAPI" in the
> patches/errata for e100 and e1000 for the last year will reveal
> some of the stuff that was found. As far as I know,
> for every bug found, a patch made it into mainline.

Yah, those are a pain. But then, it isn't the context described by
Nguyen where the driver "knows" about the errata and how to recover.
It's the context of a bug where the driver does not know what's going on
and/or doesn't have the proper workaround. My point was more that there
are very few cases where a driver will have to do recovery of PCI error
in known cases where it actually expect an error to happen.

> As a rule, it seems that finding these device driver bugs was
> very hard; we had some people work on these for months, and in
> the case of the e1000, we managed to get Intel engineers to fly
> out here and stare at PCI bus traces for a few days. (Thanks Intel!)
> Ditto for Emulex. For ipr, we had inhouse people.
>
> So overall, PCI error detection did have the expected effect
> (protecting the kernel from corruption, e.g. due to DMA's going
> to wild addresses), but I don't think anybody expected that the
> vast majority would be software/hardware bugs, instead of transient
> effects.
>
> What's ironic in all of this is that by adding error recovery,
> device driver bugs will be able to hide more effectively ...
> if there's a pci bus error due to a driver bug, the pci card
> will get rebooted, the kernel will burp for 3 seconds, and
> things will keep going, and most sysadmins won't notice or
> won't care.

Yes, but it will be logged at least, so we'll spot a lot of these during
our tests.

Ben.