PCI Express error signaling can occur on the PCI Express link itself
or on behalf of transactions initiated on the link. PCI Express
defines the Advanced Error Reporting capability, which is implemented
with a PCI Express advanced error reporting extended capability
structure, to provide more robust error reporting. With the Advanced
Error Reporting capability a PCI Express component, which detects an
error, can send an error message to the Root Port associated with
its hierarchy.
The PCI Express Advanced Error Reporting driver is a PCI Express Bus's
service driver to handle Advanced Error Reporting on Root Ports. The
PCI Express AER Root driver provides the following functions:
- A mechanism to allow a driver of a PCI Express component to
register/un-register its AER aware callback handle with the
PCI Express AER Root driver. This mechanism is provided as
an option to allow the PCI Express AER Root driver to query
the PCI Express component device driver to determine more
precisely which error and what severity occurred.
- A mechanism to process the error reporting message detected
by Root Ports, and
- Report the errors to user.
This patchset, which is based on Linux kernel 2.6.11-rc5, consists
of patches in numeric order as they should be applied.
[PATCH 1/6] <- first patch to be applied
[PATCH 2/6] <- second patch to be applied
[PATCH 3/6] <- third patch to be applied
[PATCH 4/6] <- fourth patch to be applied
[PATCH 5/6] <- fifth patch to be applied
[PATCH 6/6] <- last patch to be applied
Please send us any suggestions, feedback, comments or alternative
designs.
Signed-off-by: T. Long Nguyen <[email protected]>
--------------------------------------------------------------------
On Fri, Mar 11, 2005 at 04:10:28PM -0800, long wrote:
>
> - Report the errors to user.
This is done through the syslog, right? Is that acceptable?
It looks like you are logging a lot of stuff, all without a kernel log
level, which is going to really mess up syslog parsers.
Have you thought about just providing userspace with access to the error
message, in binary form, from a sysfs file, and causing a kevent to wake
userspace up to know to read from the file? That way all of the parsing
of the error log can be done in userspace, and there is no formatting of
the messages from within the kernel.
thanks,
greg k-h
On Friday, March 11, 2005 11:21 PM Greg KH wrote:
>>
>> - Report the errors to user.
>>
>This is done through the syslog, right? Is that acceptable?
Reporting the errors to user can be written automatically to
/var/log/messages or be manually consumed through the syslog. I am not
sure whether it is acceptable or not, but I like your below suggestion.
>It looks like you are logging a lot of stuff, all without a kernel log
>level, which is going to really mess up syslog parsers.
>
>Have you thought about just providing userspace with access to the
error
>message, in binary form, from a sysfs file, and causing a kevent to
wake
>userspace up to know to read from the file? That way all of the
parsing
>of the error log can be done in userspace, and there is no formatting
of
>the messages from within the kernel.
Again, I like this suggestion.
Thanks,
Long
Tom,
A co-worker made the following observation (I'm paraphrasing):
...this proposal does not deal with the Error Reporting ECN.
For example, they do not show the advisory non-fatal bit in
the correctable error status register.
I believe he is referring to the "Error Clarifications ECN":
http://www.pcisig.com/specifications/pciexpress/ECN_-_Error_Clarifications.pdf
Looks like all PCI-E ECNs are available [just not the original docs :^( ]:
http://www.pcisig.com/specifications/pciexpress/specifications
hth,
grant
On Tuesday, March 15, 2005 12:12 PM Grant Grundler wrote:
>Tom,
>A co-worker made the following observation (I'm paraphrasing):
> ...this proposal does not deal with the Error Reporting ECN.
> For example, they do not show the advisory non-fatal bit in
> the correctable error status register.
Does he refer to the ECN update on the Received Error Bit[0] of the
Correctable Error Status Register and on the Training Error Bit[0] of
the Uncorrectable Error Status Register? If not, please clarify his
comments for us.
Thanks,
Long
On Tue, Mar 15, 2005 at 01:54:32PM -0800, Nguyen, Tom L wrote:
> On Tuesday, March 15, 2005 12:12 PM Grant Grundler wrote:
> >Tom,
> >A co-worker made the following observation (I'm paraphrasing):
> > ...this proposal does not deal with the Error Reporting ECN.
> > For example, they do not show the advisory non-fatal bit in
> > the correctable error status register.
>
> Does he refer to the ECN update on the Received Error Bit[0] of the
> Correctable Error Status Register and on the Training Error Bit[0] of
> the Uncorrectable Error Status Register? If not, please clarify his
> comments for us.
Yes - I believe so.
grant
On Tuesday, March 15, 2005 2:38 PM Grant Grundler wrote:
>> >A co-worker made the following observation (I'm paraphrasing):
>> > ...this proposal does not deal with the Error Reporting ECN.
>> > For example, they do not show the advisory non-fatal bit in
>> > the correctable error status register.
>>
>> Does he refer to the ECN update on the Received Error Bit[0] of the
>> Correctable Error Status Register and on the Training Error Bit[0] of
>> the Uncorrectable Error Status Register? If not, please clarify his
>> comments for us.
>Yes - I believe so.
Great! I will make changes to reflect this update. Thanks for pointing
it out.
Thanks,
Long
On Tue, Mar 15, 2005 at 01:11:39PM -0700, Grant Grundler wrote:
> Tom,
> A co-worker made the following observation (I'm paraphrasing):
> ...this proposal does not deal with the Error Reporting ECN.
> For example, they do not show the advisory non-fatal bit in
> the correctable error status register.
>
> I believe he is referring to the "Error Clarifications ECN":
>
> http://www.pcisig.com/specifications/pciexpress/ECN_-_Error_Clarifications.pdf
Tom,
Sorry - I got this wrong.
He was referring to an unpublished draft "Error Reporting ECN".
You'll have to talk to Intel's PCI-SIG representative to get a copy.
[ Ugh. And everyone else is SOL - sorry ]
I'm annoyed he wanted me to raise this in a public forum without
having a public document to point at. And I'm annoyed at myself
for being lazy and not verifying that before hand...
sorry,
grant
On Tue, Mar 15, 2005 at 07:12:07PM -0700, Grant Grundler wrote:
> On Tue, Mar 15, 2005 at 01:11:39PM -0700, Grant Grundler wrote:
> > Tom,
> > A co-worker made the following observation (I'm paraphrasing):
> > ...this proposal does not deal with the Error Reporting ECN.
> > For example, they do not show the advisory non-fatal bit in
> > the correctable error status register.
> >
> > I believe he is referring to the "Error Clarifications ECN":
> >
> > http://www.pcisig.com/specifications/pciexpress/ECN_-_Error_Clarifications.pdf
>
> Tom,
> Sorry - I got this wrong.
> He was referring to an unpublished draft "Error Reporting ECN".
> You'll have to talk to Intel's PCI-SIG representative to get a copy.
> [ Ugh. And everyone else is SOL - sorry ]
Then we have no obligation to be compliant with a unpublished spec :)
greg k-h
On Tue, Mar 15, 2005 at 07:12:07PM -0700, Grant Grundler wrote:
...
> He was referring to an unpublished draft "Error Reporting ECN".
> You'll have to talk to Intel's PCI-SIG representative to get a copy.
Good News: the "Error Reporting ECN" is now posted on the PCISIG website.
http://www.pcisig.com/specifications/pciexpress/specifications/ECN_Error_Reporting_050127_clean.pdf
Tom, please review and see if/how that changes your implementation.
thanks,
grant
On Friday, March 18, 2005 10:26 AM Grant Grundler wrote:
>> He was referring to an unpublished draft "Error Reporting ECN".
>> You'll have to talk to Intel's PCI-SIG representative to get a copy.
>
>Good News: the "Error Reporting ECN" is now posted on the PCISIG
website.
>
>Tom, please review and see if/how that changes your implementation.
Agree. Thanks for the update.
Thanks,
Long