2006-03-10 17:03:56

by Doug Thompson

[permalink] [raw]
Subject: Re: [PATCH] EDAC: core EDAC support code

On Fri, 2006-03-10 at 11:40 +0000, Arjan van de Ven wrote:
> On Fri, 2006-03-10 at 11:06 +0000, Tim Small wrote:
> > Arjan van de Ven wrote:
> >
> > > It depends on how many PCI devices in your machine you wish to
> > >
> > >>blacklist or whitelist. The motivation for this feature is that
> > >>certain known badly-designed devices report an endless stream of
> > >>spurious PCI bus parity errors. We want to skip such devices when
> > >>checking for PCI bus parity errors.
> > >>
> > >>
> > >
> > >ok so this is actually a per pci device property!
> > >I would suggest moving this property to the pci device itself, not doing
> > >it inside an edac directory.
> > >
> > >
> > Yes, this seems more sensible to me. For one thing, I suspect that just
> > keying on vendor:device is probably too blunt for this and that
> > blacklisting a particular PCI device revision is a likely requirement,
> > as well as subsystem vendor/subsystem device.
>
> and maybe even something as funky as firmware version.
> So it for sure is a per device (not per ID) property, and something that
> needs a global quirk table kind of thing with the option to do per
> driver overrides

Very definitely, this non-conforming misfeature of PCI compliance is a
per PCI device attribute. At the very least it is tied to VENDOR:DEVICE
tuple, and probably a subsystem vendor/device tuple as well. As to
firmware, that is also likely. Mellanox promised a new firmware update
to their board that supposely fixes this issue. Yet, I find no firmware
value in the PCI spec, just the Revision ID, which could be used as
firmware identifier. THis is up to the vendor.

So in order to be sure I understand, if this PARITY Non-Conformance
attribute was "moved" to the per device directory of sysfs
(/sys/devices/pci0000:00/0000:00:06.0 for an example), then we would
need a userland attribute file created here and then stored in the
'pci_dev' structure or the mentioned quirk structure. This field then
could be set by userland script(s), then EDAC-PCI could example that
data in its iteration of pci devices. Is that correct?

I will admit I have heard of the "quirk" tables in the kernel, but don't
fully understand them. From what I read here, a PCI device quirk table
would be a parallel structure to the 'struct pci_dev' for a given PCI
device. So every pci_dev structure created, then a quirk table structure
would be created, and in that quirk entry is a PARITY data item. That
data item is exposed into sysfs in the /sys/devices/pci* as the example
above.

An new getter functions would be needed so the EDAC PCI iterator could
'get' the current value of the attribute.

If the above is correct, then who would we need to contact for said
modification or approval for such? Is that you Greg KH, since you are
listed as the PCI SUBSYSTEM maintainer?

thanks

doug t




2006-03-10 17:11:59

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] EDAC: core EDAC support code


> > and maybe even something as funky as firmware version.
> > So it for sure is a per device (not per ID) property, and something that
> > needs a global quirk table kind of thing with the option to do per
> > driver overrides
>
> Very definitely, this non-conforming misfeature of PCI compliance is a
> per PCI device attribute. At the very least it is tied to VENDOR:DEVICE
> tuple, and probably a subsystem vendor/device tuple as well. As to
> firmware, that is also likely. Mellanox promised a new firmware update
> to their board that supposely fixes this issue. Yet, I find no firmware
> value in the PCI spec, just the Revision ID, which could be used as
> firmware identifier. THis is up to the vendor.

exactly. So this is why a device driver needs to be able to override.
Eg for such device turn it off with a global quirk, and then let the
driver say "eh it's ok for THIS case"


> So in order to be sure I understand, if this PARITY Non-Conformance
> attribute was "moved" to the per device directory of sysfs
> (/sys/devices/pci0000:00/0000:00:06.0 for an example), then we would
> need a userland attribute file created here and then stored in the
> 'pci_dev' structure

yes. Well to some degree I'm not even sure it needs to be exposed to
userland like this. At least normally the kernel should know this
internally and automatically. (after all the kernel has the job to
abstract the hardware for the rest of the system; dealing with broken
hardware is part of that)


> or the mentioned quirk structure. This field then
> could be set by userland script(s), then EDAC-PCI could example that
> data in its iteration of pci devices. Is that correct?

that sounds way way way too complex. If this is "just" a field in the
pci device... why would userland need to get involved? Your kernel side
should be able to see that directly just fine.



> If the above is correct, then who would we need to contact for said
> modification or approval for such? Is that you Greg KH, since you are
> listed as the PCI SUBSYSTEM maintainer?

Greg needs to OK the addition to the pci struct, but I don't forsee a
problem personally since this is a more or less obvious and logical
thing to add, and useful for more than one architecture