2018-05-30 17:57:05

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH v1 0/2] PCI/AER: Clean up minor logging issues

AER logging printed the plain 16-bit Requester ID straight out of the TLP,
which is hard to interpret, e.g., id=00e4 corresponds to what we normally
see as 00:1c.4 in dmesg or lspci.

Also, there's no need to print the vendor/device ID of the root port
reporting an error; we can easily find that from dmesg or lspci.

---

Bjorn Helgaas (2):
PCI/AER: Decode Error Source Requester ID
PCI/AER: Stop printing vendor/device ID


drivers/pci/pcie/aer/aerdrv_errprint.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)


2018-05-30 17:55:17

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH v1 1/2] PCI/AER: Decode Error Source Requester ID

From: Bjorn Helgaas <[email protected]>

Decode the Requester ID from the AER Error Source Register into domain/
bus/device/function format to match other logging. In cases where the ID
matches the device used for pci_err(), drop the extra ID completely so we
don't print it twice.

Signed-off-by: Bjorn Helgaas <[email protected]>
---
drivers/pci/pcie/aer/aerdrv_errprint.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 21ca5e1b0ded..d7fde8368d81 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -163,17 +163,17 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
int id = ((dev->bus->number << 8) | dev->devfn);

if (!info->status) {
- pci_err(dev, "PCIe Bus Error: severity=%s, type=Unaccessible, id=%04x(Unregistered Agent ID)\n",
- aer_error_severity_string[info->severity], id);
+ pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+ aer_error_severity_string[info->severity]);
goto out;
}

layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);

- pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, id=%04x(%s)\n",
+ pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
aer_error_severity_string[info->severity],
- aer_error_layer[layer], id, aer_agent_string[agent]);
+ aer_error_layer[layer], aer_agent_string[agent]);

pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
dev->vendor, dev->device,
@@ -186,7 +186,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)

out:
if (info->id && info->error_dev_num > 1 && info->id == id)
- pci_err(dev, " Error of this Agent(%04x) is reported first\n", id);
+ pci_err(dev, " Error of this Agent is reported first\n");

trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
info->severity, info->tlp_header_valid, &info->tlp);
@@ -194,9 +194,13 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)

void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
{
- pci_info(dev, "AER: %s%s error received: id=%04x\n",
+ u8 bus = info->id >> 8;
+ u8 devfn = info->id & 0xff;
+
+ pci_info(dev, "AER: %s%s error received: %04x:%02x:%02x.%d\n",
info->multi_error_valid ? "Multiple " : "",
- aer_error_severity_string[info->severity], info->id);
+ aer_error_severity_string[info->severity],
+ pci_domain_nr(dev->bus), bus, devfn >> 3, devfn & 0x7);
}

#ifdef CONFIG_ACPI_APEI_PCIEAER


2018-05-30 17:55:31

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH v1 2/2] PCI/AER: Stop printing vendor/device ID

From: Bjorn Helgaas <[email protected]>

The Vendor and Device ID of the root port that raised an AER interrupt is
irrelevant and already available via normal enumeration dmesg logging or
lspci.

Remove the Vendor and Device ID from AER logging.

Signed-off-by: Bjorn Helgaas <[email protected]>
---
drivers/pci/pcie/aer/aerdrv_errprint.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index d7fde8368d81..16116844531c 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -175,9 +175,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
aer_error_severity_string[info->severity],
aer_error_layer[layer], aer_agent_string[agent]);

- pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
- dev->vendor, dev->device,
- info->status, info->mask);
+ pci_err(dev, " error status/mask=%08x/%08x\n", info->status,
+ info->mask);

__aer_print_error(dev, info);



2018-05-30 18:20:18

by Rajat Jain

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] PCI/AER: Stop printing vendor/device ID

On Wed, May 30, 2018 at 10:54 AM Bjorn Helgaas <[email protected]> wrote:

> From: Bjorn Helgaas <[email protected]>

> The Vendor and Device ID of the root port that raised an AER interrupt is
> irrelevant and already available via normal enumeration dmesg logging or
> lspci.

Er, what is getting printed is not the vendor/device id of the root port
but that of the AER source device (the one that root port got an ERR_*
message from). In case of fatal AERs, the end point device may become
inaccessible so lspci will not be available, and enumeration logs (from
boot) may have gotten rolled over. So I think it is still better to print
this information here.

Just my opinion :-)

Thanks,

Rajat


> Remove the Vendor and Device ID from AER logging.

> Signed-off-by: Bjorn Helgaas <[email protected]>
> ---
> drivers/pci/pcie/aer/aerdrv_errprint.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)

> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c
b/drivers/pci/pcie/aer/aerdrv_errprint.c
> index d7fde8368d81..16116844531c 100644
> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -175,9 +175,8 @@ void aer_print_error(struct pci_dev *dev, struct
aer_err_info *info)
> aer_error_severity_string[info->severity],
> aer_error_layer[layer], aer_agent_string[agent]);

> - pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
> - dev->vendor, dev->device,
> - info->status, info->mask);
> + pci_err(dev, " error status/mask=%08x/%08x\n", info->status,
> + info->mask);

> __aer_print_error(dev, info);

2018-05-30 18:34:07

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] PCI/AER: Decode Error Source Requester ID

On Wed, May 30, 2018 at 12:54:15PM -0500, Bjorn Helgaas wrote:
> void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
> {
> - pci_info(dev, "AER: %s%s error received: id=%04x\n",
> + u8 bus = info->id >> 8;
> + u8 devfn = info->id & 0xff;
> +
> + pci_info(dev, "AER: %s%s error received: %04x:%02x:%02x.%d\n",
> info->multi_error_valid ? "Multiple " : "",
> - aer_error_severity_string[info->severity], info->id);
> + aer_error_severity_string[info->severity],
> + pci_domain_nr(dev->bus), bus, devfn >> 3, devfn & 0x7);

I think PCI_SLOT(devfn), PCI_FUNC(devfn) is a bit more readable.

2018-05-30 18:42:51

by Rajat Jain

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] PCI/AER: Decode Error Source Requester ID

On Wed, May 30, 2018 at 10:54 AM Bjorn Helgaas <[email protected]> wrote:

> From: Bjorn Helgaas <[email protected]>

> Decode the Requester ID from the AER Error Source Register into domain/
> bus/device/function format to match other logging. In cases where the ID
> matches the device used for pci_err(), drop the extra ID completely so we
> don't print it twice.

> Signed-off-by: Bjorn Helgaas <[email protected]>
> ---
> drivers/pci/pcie/aer/aerdrv_errprint.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)

> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c
b/drivers/pci/pcie/aer/aerdrv_errprint.c
> index 21ca5e1b0ded..d7fde8368d81 100644
> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -163,17 +163,17 @@ void aer_print_error(struct pci_dev *dev, struct
aer_err_info *info)
> int id = ((dev->bus->number << 8) | dev->devfn);

> if (!info->status) {
> - pci_err(dev, "PCIe Bus Error: severity=%s,
type=Unaccessible, id=%04x(Unregistered Agent ID)\n",
> - aer_error_severity_string[info->severity], id);
> + pci_err(dev, "PCIe Bus Error: severity=%s,
type=Inaccessible, (Unregistered Agent ID)\n",
> + aer_error_severity_string[info->severity]);

Does this code path indicate that a requester id was decoded to a device
that is not registered with the kernel? If so, shouldn't we log the bad
requester ID for better debugging, specifically since there is not going to
be any subsequent print about this ID (since we return from this function
in this case)?

> goto out;
> }

> layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> agent = AER_GET_AGENT(info->severity, info->status);

> - pci_err(dev, "PCIe Bus Error: severity=%s, type=%s,
id=%04x(%s)\n",
> + pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> aer_error_severity_string[info->severity],
> - aer_error_layer[layer], id, aer_agent_string[agent]);
> + aer_error_layer[layer], aer_agent_string[agent]);

> pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
> dev->vendor, dev->device,
> @@ -186,7 +186,7 @@ void aer_print_error(struct pci_dev *dev, struct
aer_err_info *info)

> out:
> if (info->id && info->error_dev_num > 1 && info->id == id)
> - pci_err(dev, " Error of this Agent(%04x) is reported
first\n", id);
> + pci_err(dev, " Error of this Agent is reported first\n");

> trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> info->severity, info->tlp_header_valid,
&info->tlp);
> @@ -194,9 +194,13 @@ void aer_print_error(struct pci_dev *dev, struct
aer_err_info *info)

> void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
> {
> - pci_info(dev, "AER: %s%s error received: id=%04x\n",
> + u8 bus = info->id >> 8;
> + u8 devfn = info->id & 0xff;
> +
> + pci_info(dev, "AER: %s%s error received: %04x:%02x:%02x.%d\n",
> info->multi_error_valid ? "Multiple " : "",
> - aer_error_severity_string[info->severity], info->id);
> + aer_error_severity_string[info->severity],
> + pci_domain_nr(dev->bus), bus, devfn >> 3, devfn & 0x7);
> }

> #ifdef CONFIG_ACPI_APEI_PCIEAER

2018-05-31 00:29:15

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v1 2/2] PCI/AER: Stop printing vendor/device ID

On Wed, May 30, 2018 at 11:18:35AM -0700, Rajat Jain wrote:
> On Wed, May 30, 2018 at 10:54 AM Bjorn Helgaas <[email protected]> wrote:
>
> > From: Bjorn Helgaas <[email protected]>
>
> > The Vendor and Device ID of the root port that raised an AER interrupt is
> > irrelevant and already available via normal enumeration dmesg logging or
> > lspci.
>
> Er, what is getting printed is not the vendor/device id of the root port
> but that of the AER source device (the one that root port got an ERR_*
> message from). In case of fatal AERs, the end point device may become
> inaccessible so lspci will not be available, and enumeration logs (from
> boot) may have gotten rolled over. So I think it is still better to print
> this information here.

Thanks for looking this over!

You're right, "dev" here is not necessarily the Root Port, so this
changelog is bogus. "dev" came from e_info->dev[] from
aer_process_err_devices().

I think to be more precise, aer_irq() reads the Root Port's
PCI_ERR_ROOT_ERR_SRC register, which gives us the Requester ID from
the ERR_* message. Then find_source_device() walks the tree starting
with the Root Port, looking for:

- a device that matches the Requester ID, or
- a device that doesn't match the Requester ID (e.g., because a VMD
port clears the source ID) but has AER enabled and has logged an
error of the same type (ERR_COR vs ERR_FATAL/NONFATAL) we're
currently decoding

So there might be multiple "dev" pointers in e_info->dev[] because
several devices could have logged errors.

I'm not convinced the vendor/device ID is that useful because there
might be several devices with the same ID, so it doesn't really tell
you which one. The Requester ID (bus/device/function) is the
important thing.

The current code is not ideal because the find_source_device() path
depends on the pci_dev still being present and even accessible (so we
can read DEVCTL, ERR_COR_STATUS, etc), which might not be the case.

If find_source_device() fails, i.e., it can't find a matching pci_dev
and prints the "can't find device of ID%04x" message, we're in real
trouble because we don't call aer_process_err_devices(), which means
we don't clear PCI_ERR_COR_STATUS.

Anyway, I'll abandon this change for now since it's not a clear
improvement.

> > Remove the Vendor and Device ID from AER logging.
>
> > Signed-off-by: Bjorn Helgaas <[email protected]>
> > ---
> > drivers/pci/pcie/aer/aerdrv_errprint.c | 5 ++---
> > 1 file changed, 2 insertions(+), 3 deletions(-)
>
> > diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c
> b/drivers/pci/pcie/aer/aerdrv_errprint.c
> > index d7fde8368d81..16116844531c 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> > @@ -175,9 +175,8 @@ void aer_print_error(struct pci_dev *dev, struct
> aer_err_info *info)
> > aer_error_severity_string[info->severity],
> > aer_error_layer[layer], aer_agent_string[agent]);
>
> > - pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
> > - dev->vendor, dev->device,
> > - info->status, info->mask);
> > + pci_err(dev, " error status/mask=%08x/%08x\n", info->status,
> > + info->mask);
>
> > __aer_print_error(dev, info);

2018-05-31 04:44:13

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] PCI/AER: Decode Error Source Requester ID

On Wed, May 30, 2018 at 11:41:23AM -0700, Rajat Jain wrote:
> On Wed, May 30, 2018 at 10:54 AM Bjorn Helgaas <[email protected]> wrote:
>
> > From: Bjorn Helgaas <[email protected]>
>
> > Decode the Requester ID from the AER Error Source Register into domain/
> > bus/device/function format to match other logging. In cases where the ID
> > matches the device used for pci_err(), drop the extra ID completely so we
> > don't print it twice.
>
> > Signed-off-by: Bjorn Helgaas <[email protected]>
> > ---
> > drivers/pci/pcie/aer/aerdrv_errprint.c | 18 +++++++++++-------
> > 1 file changed, 11 insertions(+), 7 deletions(-)
>
> > diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c
> b/drivers/pci/pcie/aer/aerdrv_errprint.c
> > index 21ca5e1b0ded..d7fde8368d81 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> > @@ -163,17 +163,17 @@ void aer_print_error(struct pci_dev *dev, struct
> aer_err_info *info)
> > int id = ((dev->bus->number << 8) | dev->devfn);

> > if (!info->status) {
> > - pci_err(dev, "PCIe Bus Error: severity=%s,
> type=Unaccessible, id=%04x(Unregistered Agent ID)\n",
> > - aer_error_severity_string[info->severity], id);
> > + pci_err(dev, "PCIe Bus Error: severity=%s,
> type=Inaccessible, (Unregistered Agent ID)\n",
> > + aer_error_severity_string[info->severity]);
>
> Does this code path indicate that a requester id was decoded to a device
> that is not registered with the kernel? If so, shouldn't we log the bad
> requester ID for better debugging, specifically since there is not going to
> be any subsequent print about this ID (since we return from this function
> in this case)?

Previously we printed "id", which was constructed above from "dev":

id = ((dev->bus->number << 8) | dev->devfn);

so even if we print "id=%04x", it contains exactly the same
information as the bus/device/function printed using "dev".

So no, I don't think "Unregistered Agent ID" means a device not registered
with the kernel. At any rate, we do have a pci_dev for it.

I *think* "info->status == 0" means PCI_ERR_COR_STATUS (or
PCI_ERR_UNCOR_STATUS) was zero, i.e., we didn't find any error status
bits set for this device. I don't think "Unregistered Agent ID" is a
very good description of this situation.

Bjorn

2018-05-31 04:55:17

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] PCI/AER: Decode Error Source Requester ID

On Wed, May 30, 2018 at 08:32:41PM +0200, Lukas Wunner wrote:
> On Wed, May 30, 2018 at 12:54:15PM -0500, Bjorn Helgaas wrote:
> > void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
> > {
> > - pci_info(dev, "AER: %s%s error received: id=%04x\n",
> > + u8 bus = info->id >> 8;
> > + u8 devfn = info->id & 0xff;
> > +
> > + pci_info(dev, "AER: %s%s error received: %04x:%02x:%02x.%d\n",
> > info->multi_error_valid ? "Multiple " : "",
> > - aer_error_severity_string[info->severity], info->id);
> > + aer_error_severity_string[info->severity],
> > + pci_domain_nr(dev->bus), bus, devfn >> 3, devfn & 0x7);
>
> I think PCI_SLOT(devfn), PCI_FUNC(devfn) is a bit more readable.

I used those originally, but of course those definitions predate PCIe
so they aren't clearly related to a Requester ID.

I searched the PCIe spec for the specifics of the Requester ID
composition. It was surprisingly hard to find a clear statement. The
best I found was PCIe r4.0, sec 6.13, which says

Routing IDs, Requester IDs, and Completer IDs are 16-bit
identifiers traditionally composed of three fields: an 8-bit Bus
Number, a 5-bit Device Number, and a 3-bit Function Number.

Even that isn't specific about where the fields are,

But it's probably not worth obsessing over this and PCI_SLOT() and
PCI_FUNC() are definitely more readable, so I changed them.

2018-05-31 17:37:10

by Rajat Jain

[permalink] [raw]
Subject: Re: [PATCH v1 1/2] PCI/AER: Decode Error Source Requester ID

On Wed, May 30, 2018 at 9:42 PM Bjorn Helgaas <[email protected]> wrote:
>
> On Wed, May 30, 2018 at 11:41:23AM -0700, Rajat Jain wrote:
> > On Wed, May 30, 2018 at 10:54 AM Bjorn Helgaas <[email protected]> wrote:
> >
> > > From: Bjorn Helgaas <[email protected]>
> >
> > > Decode the Requester ID from the AER Error Source Register into domain/
> > > bus/device/function format to match other logging. In cases where the ID
> > > matches the device used for pci_err(), drop the extra ID completely so we
> > > don't print it twice.
> >
> > > Signed-off-by: Bjorn Helgaas <[email protected]>
> > > ---
> > > drivers/pci/pcie/aer/aerdrv_errprint.c | 18 +++++++++++-------
> > > 1 file changed, 11 insertions(+), 7 deletions(-)
> >
> > > diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c
> > b/drivers/pci/pcie/aer/aerdrv_errprint.c
> > > index 21ca5e1b0ded..d7fde8368d81 100644
> > > --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> > > +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> > > @@ -163,17 +163,17 @@ void aer_print_error(struct pci_dev *dev, struct
> > aer_err_info *info)
> > > int id = ((dev->bus->number << 8) | dev->devfn);
>
> > > if (!info->status) {
> > > - pci_err(dev, "PCIe Bus Error: severity=%s,
> > type=Unaccessible, id=%04x(Unregistered Agent ID)\n",
> > > - aer_error_severity_string[info->severity], id);
> > > + pci_err(dev, "PCIe Bus Error: severity=%s,
> > type=Inaccessible, (Unregistered Agent ID)\n",
> > > + aer_error_severity_string[info->severity]);
> >
> > Does this code path indicate that a requester id was decoded to a device
> > that is not registered with the kernel? If so, shouldn't we log the bad
> > requester ID for better debugging, specifically since there is not going to
> > be any subsequent print about this ID (since we return from this function
> > in this case)?
>
> Previously we printed "id", which was constructed above from "dev":
>
> id = ((dev->bus->number << 8) | dev->devfn);
>
> so even if we print "id=%04x", it contains exactly the same
> information as the bus/device/function printed using "dev".

Sorry, my bad, I missed it, despite it being right there in my face :-).

>
> So no, I don't think "Unregistered Agent ID" means a device not registered
> with the kernel. At any rate, we do have a pci_dev for it.
>
> I *think* "info->status == 0" means PCI_ERR_COR_STATUS (or
> PCI_ERR_UNCOR_STATUS) was zero, i.e., we didn't find any error status
> bits set for this device. I don't think "Unregistered Agent ID" is a
> very good description of this situation.

Agree, may be something along the lines of "Unknown Error Status"
might be better.

Thanks,

Rajat

>
> Bjorn

2018-06-05 22:23:23

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v1 0/2] PCI/AER: Clean up minor logging issues

On Wed, May 30, 2018 at 12:54:10PM -0500, Bjorn Helgaas wrote:
> AER logging printed the plain 16-bit Requester ID straight out of the TLP,
> which is hard to interpret, e.g., id=00e4 corresponds to what we normally
> see as 00:1c.4 in dmesg or lspci.
>
> Also, there's no need to print the vendor/device ID of the root port
> reporting an error; we can easily find that from dmesg or lspci.
>
> ---
>
> Bjorn Helgaas (2):
> PCI/AER: Decode Error Source Requester ID
> PCI/AER: Stop printing vendor/device ID
>
>
> drivers/pci/pcie/aer/aerdrv_errprint.c | 23 +++++++++++++----------
> 1 file changed, 13 insertions(+), 10 deletions(-)

I applied the first but not the second to pci/aer for v4.18.