2006-09-01 21:25:51

by linas

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Fri, Sep 01, 2006 at 11:33:49AM +0800, Zhang, Yanmin wrote:
> On Fri, 2006-09-01 at 01:50, Linas Vepstas wrote:
> > On Thu, Aug 31, 2006 at 03:10:12PM +0800, Zhang, Yanmin wrote:
> > > Linas,
> > >
> > > I am reviewing the error handlers of e1000 driver and got some ideas. My
> > > startpoint is to simplify the err handler implementations for drivers, or
> > > driver developers are *not willing* to add it if it's too complicated.
> >
> > I don't see that its to complicated ...
> Originally, I didn't think so, but after I try to add err_handlers to some
> drivers, I feel it's too complicated.

Which drivers are you working on?

> > > 1) Callback mmio_enabled looks useless. Documentation/pci-error-recovery.txt
> > > says the current powerpc implementation does not implement this callback.
> >
> > I don't know if its useless or not. I have not needed it yet for the
> > symbios, ipr and e1000 drivers, but its possible that some more
> > sophisticated device may want it. I'm tempted to keep it a while
> > longer befoe discarding it.
> >
> > The scenario is this: the device driver decides that, rather than asking
> > for a full electical reset of the card, instead, it wants to perform
> > its own recovery. It can do this as follows:
> >
> > a) enable MMIO
> > b) issue reset command to adapter
> > c) enable DMA.
> >
> > If we enabled both DMA and MMIO at the same time, there are mnay cases
> > where the card will immediately trap again -- for example, if its
> > DMA'ing to some crazy address. Thus, typically, one wants DMA disabled
> > until after the card reset. Withouth the mmio_enabled() reset, there
> > is no way of doing this.
> The new error_resume, or the old slot_reset could take care of it. The specific
> device driver knows all the details about how to initiate the devices. The
> error_resume could call the step a) b) c) sequencially while doing checking among
> steps.

Again, consider the multi-function cards. On pSeries, I can only enable
DMA on a er-slot basis, not a per-function basis. So if one driver
enables DMA before some other driver has reset appropriately, everything
breaks.

> If there is really a device having specific requirement to reinitiate it (very rarely),
> it could use walkaround, such like schedule a WORKER. No need to provide a generic
> mmio_enabled.

I don't understand. Enabling MMIO and enabling DMA both require specific
commands to be sent to the PCI-host bridge. These commands are not a
part of the PCI spec.

> > > 2) Callback slot_reset could be merged with resume. The new resume could be:
> > > int (*error_resume)(struct pci_dev *dev); I checked e1000 and e100 drivers and
> > > think there is no actual reason to have both slot_reset and resume.
> >
> > The idea here was to handle multi-function cards. On a multi-function card,
> > *all* devices need to indicate that they were able to reset. Once all devices
> > have been successfuly reset, then operation can be resumed. If the reset
> > of one function fails, then operation is not resumed for any f the
> > functions.
> I don't think we need slot_reset to coordinate multi-function devices. The new
> error_resume could take care of multi-function card.

How?

> 'reset' here means driver
> need do I/O to detect if the device (function) still works well. If a function
> of a multi-function device couldn't reset while other functions could reset,
> other functions could just go on to reinitiate. In the end, the error recovery
> procedure (handle_eeh_events in PowerPC implementation) could check all the
> returning values of error_resume. If there is a failure value, then removes
> all the functions' pci_dev of the device from the bus.

I can only enable or disable an entire PCI slot, and not individual PCI
functions. If there are some pins that are shorted, or parity errors or
whatever, I can only turn off the whole card.

> > > During
> > > our last discussion on LKML, you said PowerPC will block further I/O if the platform captures
> > > a pci error, so the all I/O in e1000_down will be blocked. Later on, e1000_io_slot_reset
> > > will reenable pci device and initiate NIC. I guess late initiate might fail because prior
> > > e1000_down I/O don't reach NIC.
> >
> > Why would it fail? The e1000_down serves primarily to get the Linux
> > kernel into a known state. It doesn't matter what happens to the card,
> > since the next step will be to perform an electrical reset of the card.
> Who will perform the electrical reset of the card? Function e1000_reset or the platform?

The platform. By "electrical reset", I mean "dropping the #RST pin low
for 200mS". Only the platform can do this.

> If it's the platform, I agree with you, but if it's e1000_reset, it might not work because
> e1000_reset uses a e1000-specific approach to reset the card.

The driver has to choices: it can ask for the electrical reset, by
returning PCI_ERS_RESULT_NEED_RESET. But if the driver doesn't need
the electrical reset, then it can return PCI_ERS_RESULT_CAN_RECOVER,
and issue whatever device-specific commands it needs to reset.

> I'm not sure if the e1000_reset
> will restore the NIC to fresh system power-on state. At least, from the source codes, e1000_reset
> couldn't.

I have no idea. That's why this driver issues PCI_ERS_RESULT_NEED_RESET,
which will get it into a fresh system power-on state. Its easy, its
brute-force, it works.

--linas

--
VGER BF report: U 0.5


2006-09-04 05:49:30

by Yanmin Zhang

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Sat, 2006-09-02 at 05:25, Linas Vepstas wrote:
> On Fri, Sep 01, 2006 at 11:33:49AM +0800, Zhang, Yanmin wrote:
> > On Fri, 2006-09-01 at 01:50, Linas Vepstas wrote:
> > > On Thu, Aug 31, 2006 at 03:10:12PM +0800, Zhang, Yanmin wrote:
> > > > Linas,
> > > >
> > > > I am reviewing the error handlers of e1000 driver and got some ideas. My
> > > > startpoint is to simplify the err handler implementations for drivers, or
> > > > driver developers are *not willing* to add it if it's too complicated.
> > >
> > > I don't see that its to complicated ...
> > Originally, I didn't think so, but after I try to add err_handlers to some
> > drivers, I feel it's too complicated.
>
> Which drivers are you working on?
I worked out error handlers for tg3 NIC driver. I'm also checking e1000 driver
to try to move all I/O operations from e1000_io_error_detected to e1000_io_slot_reset.

>
> > > > 1) Callback mmio_enabled looks useless. Documentation/pci-error-recovery.txt
> > > > says the current powerpc implementation does not implement this callback.
> > >
> > > I don't know if its useless or not. I have not needed it yet for the
> > > symbios, ipr and e1000 drivers, but its possible that some more
> > > sophisticated device may want it. I'm tempted to keep it a while
> > > longer befoe discarding it.
> > >
> > > The scenario is this: the device driver decides that, rather than asking
> > > for a full electical reset of the card, instead, it wants to perform
> > > its own recovery. It can do this as follows:
> > >
> > > a) enable MMIO
> > > b) issue reset command to adapter
> > > c) enable DMA.
> > >
> > > If we enabled both DMA and MMIO at the same time, there are mnay cases
> > > where the card will immediately trap again -- for example, if its
> > > DMA'ing to some crazy address. Thus, typically, one wants DMA disabled
> > > until after the card reset.
I think most drivers' error_detected callbacks return PCI_ERS_RESULT_NEED_RESET,
so the slot will be reset. Then, the example that one wants DMA disabled
until after the card reset is not reasonable.


> Withouth the mmio_enabled() reset, there
> > > is no way of doing this.
> > The new error_resume, or the old slot_reset could take care of it. The specific
> > device driver knows all the details about how to initiate the devices. The
> > error_resume could call the step a) b) c) sequencially while doing checking among
> > steps.
>
> Again, consider the multi-function cards. On pSeries, I can only enable
> DMA on a er-slot basis, not a per-function basis. So if one driver
> enables DMA before some other driver has reset appropriately, everything
> breaks.
Does here 'reset' mean hardware slot reset? In error_detected, driver
needs cancel all pending request and don't start any I/O operations.
Then, if the slot is always reset, there will be no the problem. Function
handle_eeh_events always resets slot except hard failure. See beow more comments.

As you know, all functions of a device share the same bus number and 5 bit dev number.
They just have different 3 bit function number. We could deduce if functions are in the same
device (slot).

If mmio_enabled is not used currently, I think we could delete it firstly. Later on,
if a platform really need it, we could add it, so we could keep the simplied codes.

>
> > If there is really a device having specific requirement to reinitiate it (very rarely),
> > it could use walkaround, such like schedule a WORKER. No need to provide a generic
> > mmio_enabled.
>
> I don't understand. Enabling MMIO and enabling DMA both require specific
> commands to be sent to the PCI-host bridge. These commands are not a
> part of the PCI spec.
Thanks. Now I understand why you specified mmio_enabled and slot_reset. They are just
to map to pSeries platform hardware operation steps. I know little about pSeries hardware,
but is it possible to merge such hardware steps from software point of view?

I checked the source codes of pSeries eeh_driver. Function handle_eeh_events does nothing
between pci_walk_bus(frozen_bus, eeh_report_reset, NULL) and
pci_walk_bus(frozen_bus, eeh_report_resume, NULL), that is, it doesn't enable DMA after
slot_reset. handle_eeh_events always resets slot except hard failure. So, slot_reset
could be merged with resume.


>
>
> > > > 2) Callback slot_reset could be merged with resume. The new resume could be:
> > > > int (*error_resume)(struct pci_dev *dev); I checked e1000 and e100 drivers and
> > > > think there is no actual reason to have both slot_reset and resume.
> > >
> > > The idea here was to handle multi-function cards. On a multi-function card,
> > > *all* devices need to indicate that they were able to reset. Once all devices
> > > have been successfuly reset, then operation can be resumed. If the reset
> > > of one function fails, then operation is not resumed for any f the
> > > functions.
> > I don't think we need slot_reset to coordinate multi-function devices. The new
> > error_resume could take care of multi-function card.
>
> How?
>
> > 'reset' here means driver
> > need do I/O to detect if the device (function) still works well. If a function
> > of a multi-function device couldn't reset while other functions could reset,
> > other functions could just go on to reinitiate. In the end, the error recovery
> > procedure (handle_eeh_events in PowerPC implementation) could check all the
> > returning values of error_resume. If there is a failure value, then removes
> > all the functions' pci_dev of the device from the bus.
>
> I can only enable or disable an entire PCI slot, and not individual PCI
> functions. If there are some pins that are shorted, or parity errors or
> whatever, I can only turn off the whole card.
It doesn't matter with the simplification. I don't mean that a device function
should be disabled immediately after the error_resume of the function driver
returns.The disable operation could be delayed till all error_resume return.

>
>
> > > > During
> > > > our last discussion on LKML, you said PowerPC will block further I/O if the platform captures
> > > > a pci error, so the all I/O in e1000_down will be blocked. Later on, e1000_io_slot_reset
> > > > will reenable pci device and initiate NIC. I guess late initiate might fail because prior
> > > > e1000_down I/O don't reach NIC.
> > >
> > > Why would it fail? The e1000_down serves primarily to get the Linux
> > > kernel into a known state. It doesn't matter what happens to the card,
> > > since the next step will be to perform an electrical reset of the card.
> > Who will perform the electrical reset of the card? Function e1000_reset or the platform?
>
> The platform. By "electrical reset", I mean "dropping the #RST pin low
> for 200mS". Only the platform can do this.
Thanks for your explanation. I assume after the electrical reset, all device
functions of the device slot will go back to the initial status before
attaching their drivers.

>
> > If it's the platform, I agree with you, but if it's e1000_reset, it might not work because
> > e1000_reset uses a e1000-specific approach to reset the card.
>
> The driver has to choices: it can ask for the electrical reset, by
> returning PCI_ERS_RESULT_NEED_RESET. But if the driver doesn't need
> the electrical reset, then it can return PCI_ERS_RESULT_CAN_RECOVER,
> and issue whatever device-specific commands it needs to reset.
> > I'm not sure if the e1000_reset
> > will restore the NIC to fresh system power-on state. At least, from the source codes, e1000_reset
> > couldn't.
>
> I have no idea. That's why this driver issues PCI_ERS_RESULT_NEED_RESET,
> which will get it into a fresh system power-on state. Its easy, its
> brute-force, it works.
I found a problem of e1000 driver when testing its error handlers. After the NIC is resumed,
its RX/TX packets numbers are crazy. Now, I think it's a bug of function e1000_reset, not
the error handlers. Sorry for bothering you on e1000.

I copy another email below, so we could keep the discussion in one thread.
On Fri, Sep 01, 2006 at 05:04:09PM +0800, Zhang, Yanmin wrote:
> > One more comment: The second parameter of error_detected also could be deleted
> > because recovery procedures will save error to pci_dev->error_state.
>
> Yes, I beleive so.
Thanks.

>
> > So, the err_handler pci_error_handlers could be:
> > struct pci_error_handlers
> > {
> > pci_ers_result_t (*error_detected)(struct pci_dev *dev);
> > pci_ers_result_t (*error_resume)(struct pci_dev *dev);
> > };
>
> No, as per other email, we still need a multi-step process for
> multi-function cards,
As above discussion, reset slot could resolve it like you did for pSeries.

> and for cards that may not want to get
> a full electrical reset.
So I think slot is reset only when a error_detected returns
PCI_ERS_RESULT_NEED_RESET.

> Finally, there might be platforms
> that cannot perform a per-slot electrical reset, and would
> therefore require drivers that can recover on thier own.
The new pci_error_handlers could process it easily. The driver's error_resume just
need schedule a driver-specific worker and returns PCI_ERS_RESULT_RECOVERED. The
worker could do recover on the driver own later on.

By checking drivers who support err_handler in the latest kernel, we could find
they all returns PCI_ERS_RESULT_NEED_RESET. They all could be converted to
use the new simplified pci_error_handlers. The new pci_error_handlers also gives
drivers flexibility to have more control on error recovery.

It's hard to look for a perfect solution. I mean, it's a trade-off. As long as
it could finish most functionality, the simpler, the better.

Yanmin

--
VGER BF report: H 0.00323297

2006-09-04 09:03:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci error recovery procedure


> As you know, all functions of a device share the same bus number and 5 bit dev number.
> They just have different 3 bit function number. We could deduce if functions are in the same
> device (slot).

Until you have a P2P bridge ...

> Thanks. Now I understand why you specified mmio_enabled and slot_reset. They are just
> to map to pSeries platform hardware operation steps. I know little about pSeries hardware,
> but is it possible to merge such hardware steps from software point of view?

One of the ideas we had when defining those steps is to be precise
enough to let drivers who _can_ deal with those fine grained pSeries
step implement them, but also have the fallback to slot reset whenever
possible.

Now, if in practice, after actually implementing this in a number of
drivers, we see that slot reset is the only ever used path, then we
might want to simplify things a bit. I didn't want to impose that
restriction in the initial design though.

It's my understanding that doing no slot reset (hardware reset) but just
re-enabling MMIO, DMA and clearing pending error status in the PCI
config space is, as far as the driver is concerned, almost functionally
equivalent to a PCIe link reset. That is, the link reset might not (or
will not) actually reset the hardware beyond the PCIe link layer.

Thus we could simplify the split between link reset / hard reset. The
former is an attempt at recovery with only resetting the PCI path to the
device, which on PCIe becomes a link reset, and on old PCI, just
clearing of the various error bits along the path (and on pSeries,
re-enabling MMIO and DMA access). However, there is still the problem
that if you do that, on pSeries at least, you really want to 1- enable
MMIO, 2- soft reset the card using MMIO, that is make sure all pending
DMA is stopped, and 3- re-enable DMA. While if we collapse that into a
single 'link reset' type of operation, we'll end up re-enabling MMIO and
DMA before the driver has a chance to stop pending DMA's and thus
increase the chance that we crap out due to a pending DMA on the chip.

Ben.



--
VGER BF report: H 0.449348

2006-09-05 02:34:09

by Yanmin Zhang

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Mon, 2006-09-04 at 17:03, Benjamin Herrenschmidt wrote:
> > As you know, all functions of a device share the same bus number and 5 bit dev number.
> > They just have different 3 bit function number. We could deduce if functions are in the same
> > device (slot).
>
> Until you have a P2P bridge ...
> > Thanks. Now I understand why you specified mmio_enabled and slot_reset. They are just
> > to map to pSeries platform hardware operation steps. I know little about pSeries hardware,
> > but is it possible to merge such hardware steps from software point of view?
>
> One of the ideas we had when defining those steps is to be precise
> enough to let drivers who _can_ deal with those fine grained pSeries
> step implement them, but also have the fallback to slot reset whenever
> possible.
>
> Now, if in practice, after actually implementing this in a number of
> drivers, we see that slot reset is the only ever used path, then we
> might want to simplify things a bit. I didn't want to impose that
> restriction in the initial design though.
Thanks for your explanation. Now it's the time to delete mmio_enabled
and merge slot_reset with resume.

>
> It's my understanding that doing no slot reset (hardware reset) but just
> re-enabling MMIO, DMA and clearing pending error status in the PCI
> config space is, as far as the driver is concerned, almost functionally
> equivalent to a PCIe link reset. That is, the link reset might not (or
> will not) actually reset the hardware beyond the PCIe link layer.
I agree.

>
> Thus we could simplify the split between link reset / hard reset. The
> former is an attempt at recovery with only resetting the PCI path to the
> device, which on PCIe becomes a link reset, and on old PCI, just
> clearing of the various error bits along the path (and on pSeries,
> re-enabling MMIO and DMA access). However, there is still the problem
> that if you do that, on pSeries at least, you really want to 1- enable
> MMIO, 2- soft reset the card using MMIO, that is make sure all pending
> DMA is stopped, and 3- re-enable DMA. While if we collapse that into a
> single 'link reset' type of operation, we'll end up re-enabling MMIO and
> DMA before the driver has a chance to stop pending DMA's
Is it the exclusive reason to have multi-steps?

1) Here link reset and hard reset are hardware operations, not the
link_reset and slot_reset callback in pci_error_handlers.

2) Callback error_detected will notify drivers there is PCI errors. Drivers
shouldn't do any I/O in error_detected.

3) If both the link and slot are reset after all error_detected are called,
the device should go back to initial status and all DMA should be stopped
automatically. Why does the driver still need a chance to stop DMA? The
error_detected of the drivers in the latest kernel who support err handlers
always returns PCI_ERS_RESULT_NEED_RESET. They are typical examples.

> and thus
> increase the chance that we crap out due to a pending DMA on the chip.
>
> Ben.

Above discussion is only about if mmio_enabled is needed.
As for slot_reset, I think it could be merged with resume, because platforms
do nothing between calling slot_reset and resume. Any comment?

Yanmin

2006-09-05 18:50:24

by linas

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Mon, Sep 04, 2006 at 07:03:12PM +1000, Benjamin Herrenschmidt wrote:
>
> > As you know, all functions of a device share the same bus number and 5 bit dev number.
> > They just have different 3 bit function number. We could deduce if functions are in the same
> > device (slot).
>
> Until you have a P2P bridge ...

And this is not theoretical: for example, the matrox graphics cards:

0000:c8:01.0 PCI bridge: Hint Corp HB6 Universal PCI-PCI bridge (non-transparent mode) (rev 13)
0000:c9:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 85)

Now, I could have sworn there was another device behind this bridge,
some serial or joystick controller or something, although this
particular card doesn't seem to have it.

------
It's not clear to me what hardware may show up in the future.
For example, someone may build a 32x PCI-E card that will act
as a bridge to a drawer with half-a-dozen ordinary PCI-X slots
in it. This is perhaps a bit hypothetical, but changing the API
will make it harder to implement eror recovery for such a system.

FWIW, there is at least one pSeries system in the lab which has
several hundred PCI slots attached to it, although I've never
done testing on it. Hmm. Maybe its time I did ...

--linas

2006-09-05 19:01:22

by linas

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Tue, Sep 05, 2006 at 10:32:08AM +0800, Zhang, Yanmin wrote:
> Is it the exclusive reason to have multi-steps?

I don't understand the question. A previous email explained the reason
to have mutiple steps.

> 1) Here link reset and hard reset are hardware operations, not the
> link_reset and slot_reset callback in pci_error_handlers.

I don't understand the comment.

> 2) Callback error_detected will notify drivers there is PCI errors. Drivers
> shouldn't do any I/O in error_detected.

It shouldn't matter. If it is truly important for a particular platform
to make sure that there is no i/o, then the low-level i/o routines
could be modified to drop any accidentally issued i/o on the floor.
This doesn't require a change to either the API or the policy.

> 3) If both the link and slot are reset after all error_detected are called,
> the device should go back to initial status and all DMA should be stopped
> automatically. Why does the driver still need a chance to stop DMA?

As explained previously, not all drivers may want to have a full
electrical device reset.

> The
> error_detected of the drivers in the latest kernel who support err handlers
> always returns PCI_ERS_RESULT_NEED_RESET. They are typical examples.

Just because the current drivers do it this way does not mean that this is
the best way to do things. A full reset is time-consuming. Some drivers
may want to implement a faster and quicker reset.

--linas

2006-09-05 19:17:43

by linas

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Mon, Sep 04, 2006 at 01:47:30PM +0800, Zhang, Yanmin wrote:
> >
> > Again, consider the multi-function cards. On pSeries, I can only enable
> > DMA on a per-slot basis, not a per-function basis. So if one driver
> > enables DMA before some other driver has reset appropriately, everything
> > breaks.
> Does here 'reset' mean hardware slot reset?

I should have said: If one driver of a multi-function card enables DMA before
another driver has stabilized its harware, then everything breaks.

> Then, if the slot is always reset, there will be no the problem.

But that assumes that a hardware #RST will always be done. The API
was designed to get away from this requirement.

> If mmio_enabled is not used currently, I think we could delete it firstly. Later on,
> if a platform really need it, we could add it, so we could keep the simplied codes.

It would be very difficult to add it later. And it would be especially
silly, given that someone would find this discussion in the mailing list
archives.

> Thanks. Now I understand why you specified mmio_enabled and slot_reset. They are just
> to map to pSeries platform hardware operation steps. I know little about pSeries hardware,

The hardware was designed that way because the hardware engineers
thought that this is what the device driver writers would need.
Thay are there to map to actual recovery steps that actual device
drivers might want to do.

> but is it possible to merge such hardware steps from software point of view?

The previous email explained why this would be a bad idea.

> > The platform. By "electrical reset", I mean "dropping the #RST pin low
> > for 200mS". Only the platform can do this.
> Thanks for your explanation. I assume after the electrical reset, all device
> functions of the device slot will go back to the initial status before
> attaching their drivers.

Maybe. Depends on what yur BIOS does. On pSeries, I also need to
set up the adress BARs

> I found a problem of e1000 driver when testing its error handlers. After the NIC is resumed,
> its RX/TX packets numbers are crazy.

Hmm. There is a patch to prevent this from happening. I thought
it was applied a long time ago. e1000_update_stats() should include the
lines:

if (pdev->error_state && pdev->error_state != pci_channel_io_normal)
return;

which is enough to prevent crazy stats on my machine.

--linas

2006-09-05 21:19:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Tue, 2006-09-05 at 13:50 -0500, Linas Vepstas wrote:
> On Mon, Sep 04, 2006 at 07:03:12PM +1000, Benjamin Herrenschmidt wrote:
> >
> > > As you know, all functions of a device share the same bus number and 5 bit dev number.
> > > They just have different 3 bit function number. We could deduce if functions are in the same
> > > device (slot).
> >
> > Until you have a P2P bridge ...
>
> And this is not theoretical: for example, the matrox graphics cards:
>
> 0000:c8:01.0 PCI bridge: Hint Corp HB6 Universal PCI-PCI bridge (non-transparent mode) (rev 13)
> 0000:c9:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 85)

It's also very common with multiple ports network cards

Ben.


2006-09-06 01:30:00

by Yanmin Zhang

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Wed, 2006-09-06 at 03:01, Linas Vepstas wrote:
> On Tue, Sep 05, 2006 at 10:32:08AM +0800, Zhang, Yanmin wrote:
> > Is it the exclusive reason to have multi-steps?
>
> I don't understand the question. A previous email explained the reason
> to have mutiple steps.
The question is against Ben's comments. Pls. don't delete his comments
in your reply.

>
> > 1) Here link reset and hard reset are hardware operations, not the
> > link_reset and slot_reset callback in pci_error_handlers.
>
> I don't understand the comment.
I wanted to clarify that we need differentiate link/hard reset from
callback link_reset and slot_reset when discussing the API.

>
> > 2) Callback error_detected will notify drivers there is PCI errors. Drivers
> > shouldn't do any I/O in error_detected.
>
> It shouldn't matter. If it is truly important for a particular platform
> to make sure that there is no i/o, then the low-level i/o routines
> could be modified to drop any accidentally issued i/o on the floor.
> This doesn't require a change to either the API or the policy.
> > 3) If both the link and slot are reset after all error_detected are called,
> > the device should go back to initial status and all DMA should be stopped
> > automatically. Why does the driver still need a chance to stop DMA?
>
> As explained previously, not all drivers may want to have a full
> electrical device reset.
I need repeat my idea.
1) My new pci_error_handlers doesn't always choose to reset slot. It
still depends on the return value of error_detected.
2) As a matter of fact, most cases of specific device's error_detected callback
will choose to return PCI_ERS_RESULT_NEED_RESET. Like what you did for
e100/e1000/ipr.

>
> > The
> > error_detected of the drivers in the latest kernel who support err handlers
> > always returns PCI_ERS_RESULT_NEED_RESET. They are typical examples.
>
> Just because the current drivers do it this way does not mean that this is
> the best way to do things.
If it's not the best way, why did you choose to reset slot for e1000/e100/ipr
error handlers? They are typical widely-used devices. To make it easier to
add error handlers?


> A full reset is time-consuming. Some drivers
> may want to implement a faster and quicker reset.
>
> --linas

2006-09-06 01:37:20

by Yanmin Zhang

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Wed, 2006-09-06 at 02:50, Linas Vepstas wrote:
> On Mon, Sep 04, 2006 at 07:03:12PM +1000, Benjamin Herrenschmidt wrote:
> >
> > > As you know, all functions of a device share the same bus number and 5 bit dev number.
> > > They just have different 3 bit function number. We could deduce if functions are in the same
> > > device (slot).
> >
> > Until you have a P2P bridge ...
>
> And this is not theoretical: for example, the matrox graphics cards:
>
> 0000:c8:01.0 PCI bridge: Hint Corp HB6 Universal PCI-PCI bridge (non-transparent mode) (rev 13)
> 0000:c9:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 85)
>
> Now, I could have sworn there was another device behind this bridge,
> some serial or joystick controller or something, although this
> particular card doesn't seem to have it.
Thanks. My comments above in this email is just to try to find a method
to judge if 2 or more functions belongs to the same device. If it's
not right, it still doesn't hurt the new API pci_error_handlers.

>
> ------
> It's not clear to me what hardware may show up in the future.
> For example, someone may build a 32x PCI-E card that will act
> as a bridge to a drawer with half-a-dozen ordinary PCI-X slots
> in it. This is perhaps a bit hypothetical, but changing the API
> will make it harder to implement eror recovery for such a system.
I agree that it's difficult to predict the future. At least the new API
could process the example.

> FWIW, there is at least one pSeries system in the lab which has
> several hundred PCI slots attached to it, although I've never
> done testing on it. Hmm. Maybe its time I did ...
>
> --linas

2006-09-06 02:06:19

by Yanmin Zhang

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Wed, 2006-09-06 at 03:17, Linas Vepstas wrote:
> On Mon, Sep 04, 2006 at 01:47:30PM +0800, Zhang, Yanmin wrote:
> > >
> > > Again, consider the multi-function cards. On pSeries, I can only enable
> > > DMA on a per-slot basis, not a per-function basis. So if one driver
> > > enables DMA before some other driver has reset appropriately, everything
> > > breaks.
> > Does here 'reset' mean hardware slot reset?
>
> I should have said: If one driver of a multi-function card enables DMA before
> another driver has stabilized its harware, then everything breaks.
What's another driver's hardware? A function of the previous multi-function
card? Or a function of another device?

Ok. now, I copy what you said before below for more discussion.
> If we enabled both DMA and MMIO at the same time, there are mnay cases
> where the card will immediately trap again -- for example, if its
> DMA'ing to some crazy address. Thus, typically, one wants DMA disabled
> until after the card reset. Withouth the mmio_enabled() reset, there
> is no way of doing this.
Did you asume the card reset is executed by callback mmio_enabled?


> Again, consider the multi-function cards. On pSeries, I can only enable
> DMA on a er-slot basis, not a per-function basis. So if one driver
> enables DMA before some other driver has reset appropriately, everything
> breaks.
What does 'I' above stand for? The platform error recovery procedure
or the error callbacks of drivers? I guess it means platform, that is,
only platform enables DMA for the whole slot. But why does the last sentence
become driver enables DMA? As you know, driver binds device function instead of
slot. Could driver enable DMA for a function?


>
> > Then, if the slot is always reset, there will be no the problem.
>
> But that assumes that a hardware #RST will always be done. The API
> was designed to get away from this requirement.
>
> > If mmio_enabled is not used currently, I think we could delete it firstly. Later on,
> > if a platform really need it, we could add it, so we could keep the simplied codes.
>
> It would be very difficult to add it later. And it would be especially
> silly, given that someone would find this discussion in the mailing list
> archives.
You stick to keep mmio_enabled which is not used currently, but if there will be
a new platform who uses a more fine-grained steps to recover pci/pci-e, would
you say 'it would be very difficut' and refuse add new callbacks?

>
> > Thanks. Now I understand why you specified mmio_enabled and slot_reset. They are just
> > to map to pSeries platform hardware operation steps. I know little about pSeries hardware,
>
> The hardware was designed that way because the hardware engineers
> thought that this is what the device driver writers would need.
> Thay are there to map to actual recovery steps that actual device
> drivers might want to do.
It doesn't prevent software from merging some steps. And, we want
to implement pci/pci-e error recovery for more platforms instead of just
pSeries.

>
> > but is it possible to merge such hardware steps from software point of view?
>
> The previous email explained why this would be a bad idea.
Obviously, such conclusion is too early.

>
> > > The platform. By "electrical reset", I mean "dropping the #RST pin low
> > > for 200mS". Only the platform can do this.
> > Thanks for your explanation. I assume after the electrical reset, all device
> > functions of the device slot will go back to the initial status before
> > attaching their drivers.
>
> Maybe. Depends on what yur BIOS does. On pSeries, I also need to
> set up the adress BARs
>
> > I found a problem of e1000 driver when testing its error handlers. After the NIC is resumed,
> > its RX/TX packets numbers are crazy.
>
> Hmm. There is a patch to prevent this from happening. I thought
> it was applied a long time ago. e1000_update_stats() should include the
> lines:
>
> if (pdev->error_state && pdev->error_state != pci_channel_io_normal)
> return;
>
> which is enough to prevent crazy stats on my machine.
Thanks a lot!

Yanmin

2006-09-06 20:02:08

by linas

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Wed, Sep 06, 2006 at 09:26:56AM +0800, Zhang, Yanmin wrote:
> > > The
> > > error_detected of the drivers in the latest kernel who support err handlers
> > > always returns PCI_ERS_RESULT_NEED_RESET. They are typical examples.
> >
> > Just because the current drivers do it this way does not mean that this is
> > the best way to do things.
>
> If it's not the best way, why did you choose to reset slot for e1000/e100/ipr
> error handlers? They are typical widely-used devices. To make it easier to
> add error handlers?

I did it that way just to get going, get something working. I do not
have hardware specs for any of these devices, and do not have much of
an idea of what they are capable of; the recovery code I wrote is of
"brute force, hit it with a hammer"-nature. Driver writers who
know thier hardware well, and are interested in a more refined
approach are encouraged to actualy use a more refined approach.

--linas

2006-09-06 20:39:42

by linas

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Wed, Sep 06, 2006 at 10:04:31AM +0800, Zhang, Yanmin wrote:
> On Wed, 2006-09-06 at 03:17, Linas Vepstas wrote:
> > On Mon, Sep 04, 2006 at 01:47:30PM +0800, Zhang, Yanmin wrote:
> > > >
> > > > Again, consider the multi-function cards. On pSeries, I can only enable
> > > > DMA on a per-slot basis, not a per-function basis. So if one driver
> > > > enables DMA before some other driver has reset appropriately, everything
> > > > breaks.
> > > Does here 'reset' mean hardware slot reset?
> >
> > I should have said: If one driver of a multi-function card enables DMA before
> > another driver has stabilized its harware, then everything breaks.
> What's another driver's hardware? A function of the previous multi-function
> card? Or a function of another device?

Yes. Either. Both. Doesn't matter. Enabling DMA is "granular" at a
different size scale than pci functions, and possibly even pci devices
or slots, dependeing on the architecture. Before DMA can be enabled,
*all* affected device drivers have to be approve, and have to be ready
for it.

> > If we enabled both DMA and MMIO at the same time, there are many cases
> > where the card will immediately trap again -- for example, if its
> > DMA'ing to some crazy address. Thus, typically, one wants DMA disabled
> > until after the card reset. Without the mmio_enabled() reset, there
> > is no way of doing this.
>
> Did you asume the card reset is executed by callback mmio_enabled?

I am assuming that, when a driver receives the mmio_enabled() callback,
it will perform some sort of register i/o. For example, I am currently
planning to modify the e1000 driver to do the following:

-- The error_occurred() callback returns PCI_ERS_RESULT_CAN_RECOVER
-- The arch enables mmio, and then calls the mmio_enabled() callback.
-- The mmio_enabled() callback in the driver takes a full dump of all
of the regsters on the card. It then returns PCI_ERS_RESULT_NEED_RESET
-- The arch performs the full electrical #RST of device. Recovery from
this point proceeds as before.

> > Again, consider the multi-function cards. On pSeries, I can only enable
> > DMA on a per-slot basis, not a per-function basis. So if one driver
> > enables DMA before some other driver has reset appropriately, everything
> > breaks.
>
> What does 'I' above stand for? The platform error recovery procedure

Yes. The pSeries platform error recovery procedure can only enable DMA
on a per-slot basis.

> I guess it means platform, that is,
> only platform enables DMA for the whole slot.

Yes.

> But why does the last sentence
> become driver enables DMA?

In your proposal, you were suggesting that MMIO and DMA be enabled with
one and the same routine, and I was attempting to explain why that can't
work.

> Could driver enable DMA for a function?

No, not on pSeries hardware.

> > > If mmio_enabled is not used currently, I think we could delete it firstly. Later on,
> > > if a platform really need it, we could add it, so we could keep the simplied codes.
> >
> > It would be very difficult to add it later. And it would be especially
> > silly, given that someone would find this discussion in the mailing list
> > archives.
> You stick to keep mmio_enabled which is not used currently, but if there will be
> a new platform who uses a more fine-grained steps to recover pci/pci-e, would
> you say 'it would be very difficut' and refuse add new callbacks?

Yes.

> It doesn't prevent software from merging some steps. And, we want
> to implement pci/pci-e error recovery for more platforms instead of just
> pSeries.

Yes. The API was designed so that it could be supported on every
current and future platform we could think of. You haven't yet
claimed that "pci-e can't be supported". Based on what
I understand, changing the API wouldn't make the implementation
any easier. (It is very easy to call a callback, and then
examine its return value. Removing a few callbacks does not
materially simplify the recovery mechanism. Managing these
callbacks is *not* the hard part of implementing this thing.)

--linas

2006-09-07 01:58:14

by Yanmin Zhang

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Thu, 2006-09-07 at 04:01, Linas Vepstas wrote:
> On Wed, Sep 06, 2006 at 09:26:56AM +0800, Zhang, Yanmin wrote:
> > > > The
> > > > error_detected of the drivers in the latest kernel who support err handlers
> > > > always returns PCI_ERS_RESULT_NEED_RESET. They are typical examples.
> > >
> > > Just because the current drivers do it this way does not mean that this is
> > > the best way to do things.
> >
> > If it's not the best way, why did you choose to reset slot for e1000/e100/ipr
> > error handlers? They are typical widely-used devices. To make it easier to
> > add error handlers?
>
> I did it that way just to get going, get something working. I do not
> have hardware specs for any of these devices, and do not have much of
> an idea of what they are capable of;
Yes, it's difficult to add fine-grained error handlers for guys who are not
the driver developers.

> the recovery code I wrote is of
> "brute force, hit it with a hammer"-nature. Driver writers who
> know thier hardware well, and are interested in a more refined
> approach are encouraged to actualy use a more refined approach.
I guess almost no driver developer is happy to spend lots of time to
add refined steps. They would like to focus on normal process (for achievement
feeling? :) ).
In addition, if they use fine-grained steps in error handlers, all these
steps might be rewritten when the device specs is upgraded. Fine-grained steps in
error handlers are more difficut to debug.

It's impossible for you to develop error handlers for all device drivers.

The error handlers look a little like suspend/resume. Of course, it's more
complicated. If we could keep it as simple as suspend/resume, it's more welcomed.

pci error shouldn't happen frequently. And when it happens, I think mostly it's
an endpoint device instead of bridge. When it happens, if we choose always
reset slot, performance could be degraded, but not too much. I just deduce, and
didn't test it on a machine with hundreds of devices.

2006-09-07 03:20:49

by Yanmin Zhang

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Thu, 2006-09-07 at 04:39, Linas Vepstas wrote:
> On Wed, Sep 06, 2006 at 10:04:31AM +0800, Zhang, Yanmin wrote:
> > On Wed, 2006-09-06 at 03:17, Linas Vepstas wrote:
> > > On Mon, Sep 04, 2006 at 01:47:30PM +0800, Zhang, Yanmin wrote:
> > > > >
> > > > > Again, consider the multi-function cards. On pSeries, I can only enable
> > > > > DMA on a per-slot basis, not a per-function basis. So if one driver
> > > > > enables DMA before some other driver has reset appropriately, everything
> > > > > breaks.
> > > > Does here 'reset' mean hardware slot reset?
> > >
> > > I should have said: If one driver of a multi-function card enables DMA before
> > > another driver has stabilized its harware, then everything breaks.
> > What's another driver's hardware? A function of the previous multi-function
> > card? Or a function of another device?
>
> Yes. Either. Both. Doesn't matter. Enabling DMA is "granular" at a
> different size scale than pci functions, and possibly even pci devices
> or slots, dependeing on the architecture. Before DMA can be enabled,
> *all* affected device drivers have to be approve, and have to be ready
> for it.
> > > If we enabled both DMA and MMIO at the same time, there are many cases
> > > where the card will immediately trap again -- for example, if its
> > > DMA'ing to some crazy address. Thus, typically, one wants DMA disabled
> > > until after the card reset. Without the mmio_enabled() reset, there
> > > is no way of doing this.
> >
> > Did you asume the card reset is executed by callback mmio_enabled?
>
> I am assuming that, when a driver receives the mmio_enabled() callback,
> it will perform some sort of register i/o. For example, I am currently
> planning to modify the e1000 driver to do the following:
>
> -- The error_occurred() callback returns PCI_ERS_RESULT_CAN_RECOVER
> -- The arch enables mmio, and then calls the mmio_enabled() callback.
> -- The mmio_enabled() callback in the driver takes a full dump of all
> of the regsters on the card. It then returns PCI_ERS_RESULT_NEED_RESET
Such dumping are random data and might be useless. The error recovery procedures
are to process pci hardware errors instead of device driver bug.

> -- The arch performs the full electrical #RST of device. Recovery from
> this point proceeds as before.
The steps are exquisite. Scenario:

The e1000 NIC and another device (maybe a function) are on the same bus. The
error_detected of the second device returns PCI_ERS_RESULT_NEED_RESET, so although
error_detected of e1000 returns PCI_ERS_RESULT_CAN_RECOVER, the slot will
be reset immediately, then error recovery will go to call slot_reset callback
directly. The mmio_enabled is not called.

My above scenario is just to say something is easy to be out of control if the steps
are complicated.

>
> > > Again, consider the multi-function cards. On pSeries, I can only enable
> > > DMA on a per-slot basis, not a per-function basis. So if one driver
> > > enables DMA before some other driver has reset appropriately, everything
> > > breaks.
> >
> > What does 'I' above stand for? The platform error recovery procedure
>
> Yes. The pSeries platform error recovery procedure can only enable DMA
> on a per-slot basis.
>
> > I guess it means platform, that is,
> > only platform enables DMA for the whole slot.
>
> Yes.
>
> > But why does the last sentence
> > become driver enables DMA?
>
> In your proposal, you were suggesting that MMIO and DMA be enabled with
> one and the same routine, and I was attempting to explain why that can't
> work.
Thanks for your explanations. My point is that if driver could enable DMA,
it could do so in the new error_resume. Driver should do more checking before
enabling DMA.

Your scenario only exists when:
1) Only platform could enable DMA and enable it per-slot instead of per-function.
2) And at least one device doesn't want a hard slot reset to recover while
all other impacted devices also don't want a hard slot; Because if one device want a
hard reset, mmio_enabled of all impacted drivers won't be called.
3) And at least one device's DMA is crazy.

If using my new API, I just need destroy one condition above. My requirement is:
Only if a device uses DMA and the driver is not sure or sure if DMA is pending,
its error_detected should return PCI_ERS_RESULT_NEED_RESET. Otherwise, error_detected
is allowed to return whatever.

>
> > Could driver enable DMA for a function?
>
> No, not on pSeries hardware.
>
> > > > If mmio_enabled is not used currently, I think we could delete it firstly. Later on,
> > > > if a platform really need it, we could add it, so we could keep the simplied codes.
> > >
> > > It would be very difficult to add it later. And it would be especially
> > > silly, given that someone would find this discussion in the mailing list
> > > archives.
> > You stick to keep mmio_enabled which is not used currently, but if there will be
> > a new platform who uses a more fine-grained steps to recover pci/pci-e, would
> > you say 'it would be very difficut' and refuse add new callbacks?
>
> Yes.
It's not fare to such other platforms although I have no such example now.

>
> > It doesn't prevent software from merging some steps. And, we want
> > to implement pci/pci-e error recovery for more platforms instead of just
> > pSeries.
>
> Yes. The API was designed so that it could be supported on every
> current and future platform we could think of. You haven't yet
> claimed that "pci-e can't be supported".
Current error handler infrastructure could support pci-e, but I want a better
solution to faciliate driver developers to add error handlers more easily. My
startpoint is driver developer. If they are not willing to add error handlers,
it's impossible to do so for all drivers by you and me.


> Based on what
> I understand, changing the API wouldn't make the implementation
> any easier. (It is very easy to call a callback, and then
> examine its return value.
It's not easy. Just like above scenario, mmio_enabled might be jumped over when
coordinating 2 more devices.
Checking current e100/e1000/ipr error handlers, they look ugly.

> Removing a few callbacks does not
> materially simplify the recovery mechanism. Managing these
> callbacks is *not* the hard part of implementing this thing.)
Above comments are totally from error recovery design point of view. No considering
for driver developers.

BTW, most discussion is about if mmio_enabled should be deleted. As for merging
slot_reset and resume, my reason is that there is no platform specific operation
between calling slot_reset and resume.

Yanmin

2006-09-12 19:39:01

by linas

[permalink] [raw]
Subject: Re: pci error recovery procedure

On Thu, Sep 07, 2006 at 11:18:56AM +0800, Zhang, Yanmin wrote:
> The error recovery procedures
> are to process pci hardware errors instead of device driver bug.

Over the last three years, we've uncovered (and fixed) dozens of
device driver bugs that were only detected because of the pci error
detection hardware. The ability to get device dumps is important,
because many of these bugs are hard to reproduce, require getting
PCI bus analyzers attached to the system, etc.

> Current error handler infrastructure could support pci-e, but I want a better
> solution to faciliate driver developers to add error handlers more easily. My
> startpoint is driver developer. If they are not willing to add error handlers,
> it's impossible to do so for all drivers by you and me.

Right. As a result, we only care about the products that we actually
sell to customers. PCI error recovery is not some "gee its nice" piece
of eye-candy or chrome: either one is serious about high-availability,
or one is not.

--linas