2018-06-04 15:57:24

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v3] PCI: Check for PCIe downtraining conditions

PCIe downtraining happens when both the device and PCIe port are
capable of a larger bus width or higher speed than negotiated.
Downtraining might be indicative of other problems in the system, and
identifying this from userspace is neither intuitive, nor straigh
forward.

The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---

Changes since v2:
- Check dev->is_virtfn flag

Changes since v1:
- Use pcie_print_link_status() instead of reimplementing logic

drivers/pci/probe.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac91b6fd0bcd..a88ec8c25dd5 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2146,6 +2146,25 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
return dev;
}

+static void pcie_check_upstream_link(struct pci_dev *dev)
+{
+
+ if (!pci_is_pcie(dev))
+ return;
+
+ /* Look from the device up to avoid downstream ports with no devices. */
+ if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
+ return;
+
+ /* Multi-function PCIe share the same link/status. */
+ if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
+ return;
+
+ pcie_print_link_status(dev);
+}
+
static void pci_init_capabilities(struct pci_dev *dev)
{
/* Enhanced Allocation */
@@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
/* Advanced Error Reporting */
pci_aer_init(dev);

+ /* Check link and detect downtrain errors */
+ pcie_check_upstream_link(dev);
+
if (pci_probe_reset_function(dev) == 0)
dev->reset_fn = 1;
}
--
2.14.4



2018-06-05 12:28:02

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

On Mon, Jun 4, 2018 at 6:55 PM, Alexandru Gagniuc <[email protected]> wrote:
> PCIe downtraining happens when both the device and PCIe port are
> capable of a larger bus width or higher speed than negotiated.
> Downtraining might be indicative of other problems in the system, and
> identifying this from userspace is neither intuitive, nor straigh
> forward.
>
> The easiest way to detect this is with pcie_print_link_status(),
> since the bottleneck is usually the link that is downtrained. It's not
> a perfect solution, but it works extremely well in most cases.

Have you seen any of my comments?
For your convenience repeating below.

>
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
>
> Changes since v2:
> - Check dev->is_virtfn flag
>
> Changes since v1:
> - Use pcie_print_link_status() instead of reimplementing logic
>
> drivers/pci/probe.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac91b6fd0bcd..a88ec8c25dd5 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2146,6 +2146,25 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
> return dev;
> }
>
> +static void pcie_check_upstream_link(struct pci_dev *dev)
> +{

> +

This is redundant blank line.

> + if (!pci_is_pcie(dev))
> + return;
> +
> + /* Look from the device up to avoid downstream ports with no devices. */
> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
> + return;

I looked briefly at the use of these calls and perhaps it might make
sense to introduce
pci_is_pcie_type(dev, type) which unifies pci_is_pcie() + pci_pcie_type().

> +
> + /* Multi-function PCIe share the same link/status. */

> + if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)

The one pair of parens is not needed.

> + return;
> +
> + pcie_print_link_status(dev);
> +}
> +
> static void pci_init_capabilities(struct pci_dev *dev)
> {
> /* Enhanced Allocation */
> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
> /* Advanced Error Reporting */
> pci_aer_init(dev);
>
> + /* Check link and detect downtrain errors */
> + pcie_check_upstream_link(dev);
> +
> if (pci_probe_reset_function(dev) == 0)
> dev->reset_fn = 1;
> }
> --
> 2.14.4
>



--
With Best Regards,
Andy Shevchenko

2018-06-05 13:06:57

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

On Tue, Jun 5, 2018 at 3:27 PM, Andy Shevchenko
<[email protected]> wrote:
> On Mon, Jun 4, 2018 at 6:55 PM, Alexandru Gagniuc <[email protected]> wrote:
>> PCIe downtraining happens when both the device and PCIe port are
>> capable of a larger bus width or higher speed than negotiated.
>> Downtraining might be indicative of other problems in the system, and
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
>>
>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
>
> Have you seen any of my comments?
> For your convenience repeating below.

Ah, found the answer in a pile of emails. OK, I see your point about
helper, though the rest is still applicable here.

--
With Best Regards,
Andy Shevchenko

2018-07-16 21:18:26

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

[+cc maintainers of drivers that already use pcie_print_link_status()
and GPU folks]

On Mon, Jun 04, 2018 at 10:55:21AM -0500, Alexandru Gagniuc wrote:
> PCIe downtraining happens when both the device and PCIe port are
> capable of a larger bus width or higher speed than negotiated.
> Downtraining might be indicative of other problems in the system, and
> identifying this from userspace is neither intuitive, nor straigh
> forward.

s/straigh/straight/
In this context, I think "straightforward" should be closed up
(without the space).

> The easiest way to detect this is with pcie_print_link_status(),
> since the bottleneck is usually the link that is downtrained. It's not
> a perfect solution, but it works extremely well in most cases.

This is an interesting idea. I have two concerns:

Some drivers already do this on their own, and we probably don't want
duplicate output for those devices. In most cases (ixgbe and mlx* are
exceptions), the drivers do this unconditionally so we *could* remove
it from the driver if we add it to the core. The dmesg order would
change, and the message wouldn't be associated with the driver as it
now is.

Also, I think some of the GPU devices might come up at a lower speed,
then download firmware, then reset the device so it comes up at a
higher speed. I think this patch will make us complain about about
the low initial speed, which might confuse users.

So I'm not sure whether it's better to do this in the core for all
devices, or if we should just add it to the high-performance drivers
that really care.

> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
>
> Changes since v2:
> - Check dev->is_virtfn flag
>
> Changes since v1:
> - Use pcie_print_link_status() instead of reimplementing logic
>
> drivers/pci/probe.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac91b6fd0bcd..a88ec8c25dd5 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2146,6 +2146,25 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
> return dev;
> }
>
> +static void pcie_check_upstream_link(struct pci_dev *dev)
> +{
> +
> + if (!pci_is_pcie(dev))
> + return;
> +
> + /* Look from the device up to avoid downstream ports with no devices. */
> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
> + return;

Do we care about Upstream Ports here? I suspect that ultimately we
only care about the bandwidth to Endpoints, and if an Endpoint is
constrained by a slow link farther up the tree,
pcie_print_link_status() is supposed to identify that slow link.

I would find this test easier to read as

if (!(type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_LEG_END))
return;

But maybe I'm the only one that finds the conjunction of inequalities
hard to read. No big deal either way.

> + /* Multi-function PCIe share the same link/status. */
> + if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
> + return;
> +
> + pcie_print_link_status(dev);
> +}
> +
> static void pci_init_capabilities(struct pci_dev *dev)
> {
> /* Enhanced Allocation */
> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
> /* Advanced Error Reporting */
> pci_aer_init(dev);
>
> + /* Check link and detect downtrain errors */
> + pcie_check_upstream_link(dev);
> +
> if (pci_probe_reset_function(dev) == 0)
> dev->reset_fn = 1;
> }
> --
> 2.14.4
>

2018-07-16 22:29:49

by Alex_Gagniuc

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
> [+cc maintainers of drivers that already use pcie_print_link_status()
> and GPU folks]

Thanks for finding them!

[snip]
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
>
> s/straigh/straight/
> In this context, I think "straightforward" should be closed up
> (without the space).

That's a straightforward edit. Thanks for the feedback!

>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
>
> This is an interesting idea. I have two concerns:
>
> Some drivers already do this on their own, and we probably don't want
> duplicate output for those devices. In most cases (ixgbe and mlx* are
> exceptions), the drivers do this unconditionally so we *could* remove
> it from the driver if we add it to the core. The dmesg order would
> change, and the message wouldn't be associated with the driver as it
> now is.

Oh, there are only 8 users of that. Even I could patch up the drivers to
remove the call, assuming we reach agreement about this change.

> Also, I think some of the GPU devices might come up at a lower speed,
> then download firmware, then reset the device so it comes up at a
> higher speed. I think this patch will make us complain about about
> the low initial speed, which might confuse users.

I spoke to one of the PCIe spec writers. It's allowable for a device to
downtrain speed or width. It would also be extremely dumb to downtrain
with the intent to re-train at a higher speed later, but it's possible
devices do dumb stuff like that. That's why it's an informational
message, instead of a warning.

Another case: Some devices (lower-end GPUs) use silicon (and marketing)
that advertises x16, but they're only routed for x8. I'm okay with
seeing an informational message in this case. In fact, I didn't know
that my Quadro card for three years is only wired for x8 until I was
testing this patch.

> So I'm not sure whether it's better to do this in the core for all
> devices, or if we should just add it to the high-performance drivers
> that really care.

You're thinking "do I really need that bandwidth" because I'm using a
function called "_bandwidth_". The point of the change is very far from
that: it is to help in system troubleshooting by detecting downtraining
conditions.

>> Signed-off-by: Alexandru Gagniuc <[email protected]>
[snip]
>> + /* Look from the device up to avoid downstream ports with no devices. */
>> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>> + return;
>
> Do we care about Upstream Ports here?

YES! Switches. e.g. an x16 switch with 4x downstream ports could
downtrain at 8x and 4x, and we'd never catch it.

> I suspect that ultimately we
> only care about the bandwidth to Endpoints, and if an Endpoint is
> constrained by a slow link farther up the tree,
> pcie_print_link_status() is supposed to identify that slow link.

See above.

> I would find this test easier to read as
>
> if (!(type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_LEG_END))
> return;
>
> But maybe I'm the only one that finds the conjunction of inequalities
> hard to read. No big deal either way.
>
>> + /* Multi-function PCIe share the same link/status. */
>> + if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>> + return;
>> +
>> + pcie_print_link_status(dev);
>> +}
>> +
>> static void pci_init_capabilities(struct pci_dev *dev)
>> {
>> /* Enhanced Allocation */
>> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>> /* Advanced Error Reporting */
>> pci_aer_init(dev);
>>
>> + /* Check link and detect downtrain errors */
>> + pcie_check_upstream_link(dev);
>> +
>> if (pci_probe_reset_function(dev) == 0)
>> dev->reset_fn = 1;
>> }
>> --
>> 2.14.4
>>
>


2018-07-18 13:40:06

by Tal Gilboa

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
> [+cc maintainers of drivers that already use pcie_print_link_status()
> and GPU folks]
>
> On Mon, Jun 04, 2018 at 10:55:21AM -0500, Alexandru Gagniuc wrote:
>> PCIe downtraining happens when both the device and PCIe port are
>> capable of a larger bus width or higher speed than negotiated.
>> Downtraining might be indicative of other problems in the system, and
>> identifying this from userspace is neither intuitive, nor straigh
>> forward.
>
> s/straigh/straight/
> In this context, I think "straightforward" should be closed up
> (without the space).
>
>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
>
> This is an interesting idea. I have two concerns:
>
> Some drivers already do this on their own, and we probably don't want
> duplicate output for those devices. In most cases (ixgbe and mlx* are
> exceptions), the drivers do this unconditionally so we *could* remove
> it from the driver if we add it to the core. The dmesg order would
> change, and the message wouldn't be associated with the driver as it
> now is.
>
> Also, I think some of the GPU devices might come up at a lower speed,
> then download firmware, then reset the device so it comes up at a
> higher speed. I think this patch will make us complain about about
> the low initial speed, which might confuse users.
>
> So I'm not sure whether it's better to do this in the core for all
> devices, or if we should just add it to the high-performance drivers
> that really care.
>
>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>> ---
>>
>> Changes since v2:
>> - Check dev->is_virtfn flag
>>
>> Changes since v1:
>> - Use pcie_print_link_status() instead of reimplementing logic
>>
>> drivers/pci/probe.c | 22 ++++++++++++++++++++++
>> 1 file changed, 22 insertions(+)
>>
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index ac91b6fd0bcd..a88ec8c25dd5 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -2146,6 +2146,25 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
>> return dev;
>> }
>>
>> +static void pcie_check_upstream_link(struct pci_dev *dev)
>> +{
>> +
>> + if (!pci_is_pcie(dev))
>> + return;
>> +
>> + /* Look from the device up to avoid downstream ports with no devices. */
>> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>> + return;
>
> Do we care about Upstream Ports here? I suspect that ultimately we
> only care about the bandwidth to Endpoints, and if an Endpoint is
> constrained by a slow link farther up the tree,
> pcie_print_link_status() is supposed to identify that slow link.
>
> I would find this test easier to read as
>
> if (!(type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_LEG_END))
> return;
>
> But maybe I'm the only one that finds the conjunction of inequalities
> hard to read. No big deal either way.
>
>> + /* Multi-function PCIe share the same link/status. */
>> + if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>> + return;
>> +
>> + pcie_print_link_status(dev);
>> +}

Is this function called by default for every PCIe device? What about
VFs? We make an exception for them on our driver since a VF doesn't have
access to the needed information in order to provide a meaningful message.

>> +
>> static void pci_init_capabilities(struct pci_dev *dev)
>> {
>> /* Enhanced Allocation */
>> @@ -2181,6 +2200,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>> /* Advanced Error Reporting */
>> pci_aer_init(dev);
>>
>> + /* Check link and detect downtrain errors */
>> + pcie_check_upstream_link(dev);
>> +
>> if (pci_probe_reset_function(dev) == 0)
>> dev->reset_fn = 1;
>> }
>> --
>> 2.14.4
>>

2018-07-18 21:54:53

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

[+cc Mike (hfi1)]

On Mon, Jul 16, 2018 at 10:28:35PM +0000, [email protected] wrote:
> On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
> >> ...
> >> The easiest way to detect this is with pcie_print_link_status(),
> >> since the bottleneck is usually the link that is downtrained. It's not
> >> a perfect solution, but it works extremely well in most cases.
> >
> > This is an interesting idea. I have two concerns:
> >
> > Some drivers already do this on their own, and we probably don't want
> > duplicate output for those devices. In most cases (ixgbe and mlx* are
> > exceptions), the drivers do this unconditionally so we *could* remove
> > it from the driver if we add it to the core. The dmesg order would
> > change, and the message wouldn't be associated with the driver as it
> > now is.
>
> Oh, there are only 8 users of that. Even I could patch up the drivers to
> remove the call, assuming we reach agreement about this change.
>
> > Also, I think some of the GPU devices might come up at a lower speed,
> > then download firmware, then reset the device so it comes up at a
> > higher speed. I think this patch will make us complain about about
> > the low initial speed, which might confuse users.
>
> I spoke to one of the PCIe spec writers. It's allowable for a device to
> downtrain speed or width. It would also be extremely dumb to downtrain
> with the intent to re-train at a higher speed later, but it's possible
> devices do dumb stuff like that. That's why it's an informational
> message, instead of a warning.

FWIW, here's some of the discussion related to hfi1 from [1]:

> Btw, why is the driver configuring the PCIe link speed? Isn't
> this something we should be handling in the PCI core?

The device comes out of reset at the 5GT/s speed. The driver
downloads device firmware, programs PCIe registers, and co-ordinates
the transition to 8GT/s.

This recipe is device specific and is therefore implemented in the
hfi1 driver built on top of PCI core functions and macros.

Also several DRM drivers seem to do this (see cik_pcie_gen3_enable(),
si_pcie_gen3_enable()); from [2]:

My understanding was that some platfoms only bring up the link in gen 1
mode for compatibility reasons.

[1] https://lkml.kernel.org/r/32E1700B9017364D9B60AED9960492BC627FF54C@fmsmsx120.amr.corp.intel.com
[2] https://lkml.kernel.org/r/BN6PR12MB1809BD30AA5B890C054F9832F7B50@BN6PR12MB1809.namprd12.prod.outlook.com

> Another case: Some devices (lower-end GPUs) use silicon (and marketing)
> that advertises x16, but they're only routed for x8. I'm okay with
> seeing an informational message in this case. In fact, I didn't know
> that my Quadro card for three years is only wired for x8 until I was
> testing this patch.

Yeah, it's probably OK. I don't want bug reports from people who
think something's broken when it's really just a hardware limitation
of their system. But hopefully the message is not alarming.

> > So I'm not sure whether it's better to do this in the core for all
> > devices, or if we should just add it to the high-performance drivers
> > that really care.
>
> You're thinking "do I really need that bandwidth" because I'm using a
> function called "_bandwidth_". The point of the change is very far from
> that: it is to help in system troubleshooting by detecting downtraining
> conditions.

I'm not sure what you think I'm thinking :) My question is whether
it's worthwhile to print this extra information for *every* PCIe
device, given that your use case is the tiny percentage of broken
systems.

If we only printed the info in the "bw_avail < bw_cap" case, i.e.,
when the device is capable of more than it's getting, that would make
a lot of sense to me. The normal case line is more questionable. I
think the reason that's there is because the network drivers are very
performance sensitive and like to see that info all the time.

Maybe we need something like this:

pcie_print_link_status(struct pci_dev *dev, int verbose)
{
...
if (bw_avail >= bw_cap) {
if (verbose)
pci_info(dev, "... available PCIe bandwidth ...");
} else
pci_info(dev, "... available PCIe bandwidth, limited by ...");
}

So the core could print only the potential problems with:

pcie_print_link_status(dev, 0);

and drivers that really care even if there's no problem could do:

pcie_print_link_status(dev, 1);

> >> Signed-off-by: Alexandru Gagniuc <[email protected]>
> [snip]
> >> + /* Look from the device up to avoid downstream ports with no devices. */
> >> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
> >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
> >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
> >> + return;
> >
> > Do we care about Upstream Ports here?
>
> YES! Switches. e.g. an x16 switch with 4x downstream ports could
> downtrain at 8x and 4x, and we'd never catch it.

OK, I think I see your point: if the upstream port *could* do 16x but
only trains to 4x, and two endpoints below it are both capable of 4x,
the endpoints *think* they're happy but in fact they have to share 4x
when they could use more.

Bjorn

2018-07-19 15:47:42

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions



On 07/18/2018 04:53 PM, Bjorn Helgaas wrote:
> [+cc Mike (hfi1)]
>
> On Mon, Jul 16, 2018 at 10:28:35PM +0000, [email protected] wrote:
>> On 7/16/2018 4:17 PM, Bjorn Helgaas wrote:
>>>> ...
>>>> The easiest way to detect this is with pcie_print_link_status(),
>>>> since the bottleneck is usually the link that is downtrained. It's not
>>>> a perfect solution, but it works extremely well in most cases.
>>>
>>> This is an interesting idea. I have two concerns:
>>>
>>> Some drivers already do this on their own, and we probably don't want
>>> duplicate output for those devices. In most cases (ixgbe and mlx* are
>>> exceptions), the drivers do this unconditionally so we *could* remove
>>> it from the driver if we add it to the core. The dmesg order would
>>> change, and the message wouldn't be associated with the driver as it
>>> now is.
>>
>> Oh, there are only 8 users of that. Even I could patch up the drivers to
>> remove the call, assuming we reach agreement about this change.
>>
>>> Also, I think some of the GPU devices might come up at a lower speed,
>>> then download firmware, then reset the device so it comes up at a
>>> higher speed. I think this patch will make us complain about about
>>> the low initial speed, which might confuse users.
>>
>> I spoke to one of the PCIe spec writers. It's allowable for a device to
>> downtrain speed or width. It would also be extremely dumb to downtrain
>> with the intent to re-train at a higher speed later, but it's possible
>> devices do dumb stuff like that. That's why it's an informational
>> message, instead of a warning.
>
> FWIW, here's some of the discussion related to hfi1 from [1]:
>
> > Btw, why is the driver configuring the PCIe link speed? Isn't
> > this something we should be handling in the PCI core?
>
> The device comes out of reset at the 5GT/s speed. The driver
> downloads device firmware, programs PCIe registers, and co-ordinates
> the transition to 8GT/s.
>
> This recipe is device specific and is therefore implemented in the
> hfi1 driver built on top of PCI core functions and macros.
>
> Also several DRM drivers seem to do this (see ),
> si_pcie_gen3_enable()); from [2]:
>
> My understanding was that some platfoms only bring up the link in gen 1
> mode for compatibility reasons.
>
> [1] https://lkml.kernel.org/r/32E1700B9017364D9B60AED9960492BC627FF54C@fmsmsx120.amr.corp.intel.com
> [2] https://lkml.kernel.org/r/BN6PR12MB1809BD30AA5B890C054F9832F7B50@BN6PR12MB1809.namprd12.prod.outlook.com

Downtraining a link "for compatibility reasons" is one of those dumb
things that devices do. I'm SURPRISED AMD HW does it, although it is
perfectly permissible by PCIe spec.

>> Another case: Some devices (lower-end GPUs) use silicon (and marketing)
>> that advertises x16, but they're only routed for x8. I'm okay with
>> seeing an informational message in this case. In fact, I didn't know
>> that my Quadro card for three years is only wired for x8 until I was
>> testing this patch.
>
> Yeah, it's probably OK. I don't want bug reports from people who
> think something's broken when it's really just a hardware limitation
> of their system. But hopefully the message is not alarming.

It looks fairly innocent:

[ 0.749415] pci 0000:18:00.0: 4.000 Gb/s available PCIe bandwidth,
limited by 5 GT/s x1 link at 0000:17:03.0 (capable of 15.752 Gb/s with 8
GT/s x2 link)

>>> So I'm not sure whether it's better to do this in the core for all
>>> devices, or if we should just add it to the high-performance drivers
>>> that really care.
>>
>> You're thinking "do I really need that bandwidth" because I'm using a
>> function called "_bandwidth_". The point of the change is very far from
>> that: it is to help in system troubleshooting by detecting downtraining
>> conditions.
>
> I'm not sure what you think I'm thinking :) My question is whether
> it's worthwhile to print this extra information for *every* PCIe
> device, given that your use case is the tiny percentage of broken
> systems.

I think this information is a lot more useful than a bunch of other info
that's printed. Is "type 00 class 0x088000" more valuable? What about
"reg 0x20: [mem 0x9d950000-0x9d95ffff 64bit pref]", which is also
available under /proc/iomem for those curious?

> If we only printed the info in the "bw_avail < bw_cap" case, i.e.,
> when the device is capable of more than it's getting, that would make
> a lot of sense to me. The normal case line is more questionable. I
> think the reason that's there is because the network drivers are very
> performance sensitive and like to see that info all the time.

I agree that can be an acceptable compromise.

> Maybe we need something like this:
>
> pcie_print_link_status(struct pci_dev *dev, int verbose)
> {
> ...
> if (bw_avail >= bw_cap) {
> if (verbose)
> pci_info(dev, "... available PCIe bandwidth ...");
> } else
> pci_info(dev, "... available PCIe bandwidth, limited by ...");
> }
>
> So the core could print only the potential problems with:
>
> pcie_print_link_status(dev, 0);
>
> and drivers that really care even if there's no problem could do:
>
> pcie_print_link_status(dev, 1);

Sounds good. I'll try to push out updated PATCH early next week.

>>>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>> [snip]
>>>> + /* Look from the device up to avoid downstream ports with no devices. */
>>>> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>>>> + return;
>>>
>>> Do we care about Upstream Ports here?
>>
>> YES! Switches. e.g. an x16 switch with 4x downstream ports could
>> downtrain at 8x and 4x, and we'd never catch it.
>
> OK, I think I see your point: if the upstream port *could* do 16x but
> only trains to 4x, and two endpoints below it are both capable of 4x,
> the endpoints *think* they're happy but in fact they have to share 4x
> when they could use more.
>
> Bjorn
>

2018-07-19 15:51:12

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions



On 07/18/2018 08:38 AM, Tal Gilboa wrote:
> On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
>> [+cc maintainers of drivers that already use pcie_print_link_status()
>> and GPU folks]
[snip]
>>
>>> +    /* Multi-function PCIe share the same link/status. */
>>> +    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>>> +        return;
>>> +
>>> +    pcie_print_link_status(dev);
>>> +}
>
> Is this function called by default for every PCIe device? What about
> VFs? We make an exception for them on our driver since a VF doesn't have
> access to the needed information in order to provide a meaningful message.

I'm assuming VF means virtual function. pcie_print_link_status() doesn't
care if it's passed a virtual function. It will try to do its job.
That's why I bail out three lines above, with 'dev->is_virtfn' check.

Alex

2018-07-23 05:22:53

by Tal Gilboa

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

On 7/19/2018 6:49 PM, Alex G. wrote:
>
>
> On 07/18/2018 08:38 AM, Tal Gilboa wrote:
>> On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
>>> [+cc maintainers of drivers that already use pcie_print_link_status()
>>> and GPU folks]
> [snip]
>>>
>>>> +    /* Multi-function PCIe share the same link/status. */
>>>> +    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>>>> +        return;
>>>> +
>>>> +    pcie_print_link_status(dev);
>>>> +}
>>
>> Is this function called by default for every PCIe device? What about
>> VFs? We make an exception for them on our driver since a VF doesn't
>> have access to the needed information in order to provide a meaningful
>> message.
>
> I'm assuming VF means virtual function. pcie_print_link_status() doesn't
> care if it's passed a virtual function. It will try to do its job.
> That's why I bail out three lines above, with 'dev->is_virtfn' check.
>
> Alex

That's the point - we don't want to call pcie_print_link_status() for
virtual functions. We make the distinction in our driver. If you want to
change the code to call this function by default it shouldn't affect the
current usage.

2018-07-23 17:03:12

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

On 07/23/2018 12:21 AM, Tal Gilboa wrote:
> On 7/19/2018 6:49 PM, Alex G. wrote:
>>
>>
>> On 07/18/2018 08:38 AM, Tal Gilboa wrote:
>>> On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
>>>> [+cc maintainers of drivers that already use pcie_print_link_status()
>>>> and GPU folks]
>> [snip]
>>>>
>>>>> +    /* Multi-function PCIe share the same link/status. */
>>>>> +    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>>>>> +        return;
>>>>> +
>>>>> +    pcie_print_link_status(dev);
>>>>> +}
>>>
>>> Is this function called by default for every PCIe device? What about
>>> VFs? We make an exception for them on our driver since a VF doesn't
>>> have access to the needed information in order to provide a
>>> meaningful message.
>>
>> I'm assuming VF means virtual function. pcie_print_link_status()
>> doesn't care if it's passed a virtual function. It will try to do its
>> job. That's why I bail out three lines above, with 'dev->is_virtfn'
>> check.
>>
>> Alex
>
> That's the point - we don't want to call pcie_print_link_status() for
> virtual functions. We make the distinction in our driver. If you want to
> change the code to call this function by default it shouldn't affect the
> current usage.

I'm not understanding very well what you're asking. I understand you
want to avoid printing this message on virtual functions, and that's
already taken care of. I'm also not changing current behavior. Let's
get v2 out and start the discussion again based on that.

Alex

2018-07-23 20:02:43

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.

This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/pci/pcie/aer.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..85c3e173c025 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
aer_set_firmware_first(dev);
return dev->__aer_firmware_first;
}
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+ return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}
+
#define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)

@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)

int pci_enable_pcie_error_reporting(struct pci_dev *dev)
{
- if (pcie_aer_get_firmware_first(dev))
- return -EIO;
-
- if (!dev->aer_cap)
+ if (!pcie_aer_is_kernel_first(dev))
return -EIO;

return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);

int pci_disable_pcie_error_reporting(struct pci_dev *dev)
{
- if (pcie_aer_get_firmware_first(dev))
+ if (!pcie_aer_is_kernel_first(dev))
return -EIO;

return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
if (!pci_is_pcie(dev))
return -ENODEV;

- pos = dev->aer_cap;
- if (!pos)
+ if (pcie_aer_is_kernel_first(dev))
return -EIO;

+ pos = dev->aer_cap;
port_type = pci_pcie_type(dev);
if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, &status);
--
2.17.1


2018-07-23 20:04:53

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v5] PCI: Check for PCIe downtraining conditions

PCIe downtraining happens when both the device and PCIe port are
capable of a larger bus width or higher speed than negotiated.
Downtraining might be indicative of other problems in the system, and
identifying this from userspace is neither intuitive, nor
straightforward.

The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---

For the sake of review, I've created a __pcie_print_link_status() which
takes a 'verbose' argument. If we agree want to go this route, and update
the users of pcie_print_link_status(), I can split this up in two patches.
I prefer just printing this information in the core functions, and letting
drivers not have to worry about this. Though there seems to be strong for
not going that route, so here it goes:

Changes since v4:
- Use 'verbose' argumnet to print bandwidth under normal conditions
- Without verbose, only downtraining conditions are reported

Changes since v3:
- Remove extra newline and parentheses.

Changes since v2:
- Check dev->is_virtfn flag

Changes since v1:
- Use pcie_print_link_status() instead of reimplementing logic

drivers/pci/pci.c | 22 ++++++++++++++++++----
drivers/pci/probe.c | 21 +++++++++++++++++++++
include/linux/pci.h | 1 +
3 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 316496e99da9..414ad7b3abdb 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
}

/**
- * pcie_print_link_status - Report the PCI device's link speed and width
+ * __pcie_print_link_status - Report the PCI device's link speed and width
* @dev: PCI device to query
+ * @verbose: Be verbose -- print info even when enough bandwidth is available.
*
* Report the available bandwidth at the device. If this is less than the
* device is capable of, report the device's maximum possible bandwidth and
* the upstream link that limits its performance to less than that.
*/
-void pcie_print_link_status(struct pci_dev *dev)
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose)
{
enum pcie_link_width width, width_cap;
enum pci_bus_speed speed, speed_cap;
@@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev)
bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width);

- if (bw_avail >= bw_cap)
+ if (bw_avail >= bw_cap && verbose)
pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n",
bw_cap / 1000, bw_cap % 1000,
PCIE_SPEED2STR(speed_cap), width_cap);
- else
+ else if (bw_avail < bw_cap)
pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
bw_avail / 1000, bw_avail % 1000,
PCIE_SPEED2STR(speed), width,
@@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev)
bw_cap / 1000, bw_cap % 1000,
PCIE_SPEED2STR(speed_cap), width_cap);
}
+
+/**
+ * pcie_print_link_status - Report the PCI device's link speed and width
+ * @dev: PCI device to query
+ *
+ * Report the available bandwidth at the device. If this is less than the
+ * device is capable of, report the device's maximum possible bandwidth and
+ * the upstream link that limits its performance to less than that.
+ */
+void pcie_print_link_status(struct pci_dev *dev)
+{
+ __pcie_print_link_status(dev, true);
+}
EXPORT_SYMBOL(pcie_print_link_status);

/**
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ac876e32de4b..1f7336377c3b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
return dev;
}

+static void pcie_check_upstream_link(struct pci_dev *dev)
+{
+ if (!pci_is_pcie(dev))
+ return;
+
+ /* Look from the device up to avoid downstream ports with no devices. */
+ if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
+ return;
+
+ /* Multi-function PCIe share the same link/status. */
+ if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
+ return;
+
+ __pcie_print_link_status(dev, false);
+}
+
static void pci_init_capabilities(struct pci_dev *dev)
{
/* Enhanced Allocation */
@@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
/* Advanced Error Reporting */
pci_aer_init(dev);

+ /* Check link and detect downtrain errors */
+ pcie_check_upstream_link(dev);
+
if (pci_probe_reset_function(dev) == 0)
dev->reset_fn = 1;
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index abd5d5e17aee..15bfab8f7a1b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev,
enum pci_bus_speed *speed,
enum pcie_link_width *width);
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
void pcie_print_link_status(struct pci_dev *dev);
int pcie_flr(struct pci_dev *dev);
int __pci_reset_function_locked(struct pci_dev *dev);
--
2.17.1


2018-07-23 21:03:32

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:
> PCIe downtraining happens when both the device and PCIe port are
> capable of a larger bus width or higher speed than negotiated.
> Downtraining might be indicative of other problems in the system, and
> identifying this from userspace is neither intuitive, nor
> straightforward.
>
> The easiest way to detect this is with pcie_print_link_status(),
> since the bottleneck is usually the link that is downtrained. It's not
> a perfect solution, but it works extremely well in most cases.
>
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
>
> For the sake of review, I've created a __pcie_print_link_status() which
> takes a 'verbose' argument. If we agree want to go this route, and update
> the users of pcie_print_link_status(), I can split this up in two patches.
> I prefer just printing this information in the core functions, and letting
> drivers not have to worry about this. Though there seems to be strong for
> not going that route, so here it goes:

FWIW the networking drivers print PCIe BW because sometimes the network
bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
card on a x8 link.

Sorry to bike shed, but currently the networking cards print the info
during probe. Would it make sense to move your message closer to probe
time? Rather than when device is added. If driver structure is
available, we could also consider adding a boolean to struct pci_driver
to indicate if driver wants the verbose message? This way we avoid
duplicated prints.

I have no objection to current patch, it LGTM. Just a thought.

> drivers/pci/pci.c | 22 ++++++++++++++++++----
> drivers/pci/probe.c | 21 +++++++++++++++++++++
> include/linux/pci.h | 1 +
> 3 files changed, 40 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 316496e99da9..414ad7b3abdb 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
> }
>
> /**
> - * pcie_print_link_status - Report the PCI device's link speed and width
> + * __pcie_print_link_status - Report the PCI device's link speed and width
> * @dev: PCI device to query
> + * @verbose: Be verbose -- print info even when enough bandwidth is available.
> *
> * Report the available bandwidth at the device. If this is less than the
> * device is capable of, report the device's maximum possible bandwidth and
> * the upstream link that limits its performance to less than that.
> */
> -void pcie_print_link_status(struct pci_dev *dev)
> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose)
> {
> enum pcie_link_width width, width_cap;
> enum pci_bus_speed speed, speed_cap;
> @@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev)
> bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
> bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width);
>
> - if (bw_avail >= bw_cap)
> + if (bw_avail >= bw_cap && verbose)
> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n",
> bw_cap / 1000, bw_cap % 1000,
> PCIE_SPEED2STR(speed_cap), width_cap);
> - else
> + else if (bw_avail < bw_cap)
> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
> bw_avail / 1000, bw_avail % 1000,
> PCIE_SPEED2STR(speed), width,
> @@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev)
> bw_cap / 1000, bw_cap % 1000,
> PCIE_SPEED2STR(speed_cap), width_cap);
> }
> +
> +/**
> + * pcie_print_link_status - Report the PCI device's link speed and width
> + * @dev: PCI device to query
> + *
> + * Report the available bandwidth at the device. If this is less than the
> + * device is capable of, report the device's maximum possible bandwidth and
> + * the upstream link that limits its performance to less than that.
> + */
> +void pcie_print_link_status(struct pci_dev *dev)
> +{
> + __pcie_print_link_status(dev, true);
> +}
> EXPORT_SYMBOL(pcie_print_link_status);
>
> /**
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ac876e32de4b..1f7336377c3b 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
> return dev;
> }
>
> +static void pcie_check_upstream_link(struct pci_dev *dev)
> +{
> + if (!pci_is_pcie(dev))
> + return;
> +
> + /* Look from the device up to avoid downstream ports with no devices. */
> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
> + return;
> +
> + /* Multi-function PCIe share the same link/status. */
> + if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
> + return;
> +
> + __pcie_print_link_status(dev, false);
> +}
> +
> static void pci_init_capabilities(struct pci_dev *dev)
> {
> /* Enhanced Allocation */
> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
> /* Advanced Error Reporting */
> pci_aer_init(dev);
>
> + /* Check link and detect downtrain errors */
> + pcie_check_upstream_link(dev);
> +
> if (pci_probe_reset_function(dev) == 0)
> dev->reset_fn = 1;
> }
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index abd5d5e17aee..15bfab8f7a1b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
> u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev,
> enum pci_bus_speed *speed,
> enum pcie_link_width *width);
> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
> void pcie_print_link_status(struct pci_dev *dev);
> int pcie_flr(struct pci_dev *dev);
> int __pci_reset_function_locked(struct pci_dev *dev);


2018-07-23 21:37:03

by Tal Gilboa

[permalink] [raw]
Subject: Re: [PATCH v3] PCI: Check for PCIe downtraining conditions

On 7/23/2018 8:01 PM, Alex G. wrote:
> On 07/23/2018 12:21 AM, Tal Gilboa wrote:
>> On 7/19/2018 6:49 PM, Alex G. wrote:
>>>
>>>
>>> On 07/18/2018 08:38 AM, Tal Gilboa wrote:
>>>> On 7/16/2018 5:17 PM, Bjorn Helgaas wrote:
>>>>> [+cc maintainers of drivers that already use pcie_print_link_status()
>>>>> and GPU folks]
>>> [snip]
>>>>>
>>>>>> +    /* Multi-function PCIe share the same link/status. */
>>>>>> +    if ((PCI_FUNC(dev->devfn) != 0) || dev->is_virtfn)
>>>>>> +        return;
>>>>>> +
>>>>>> +    pcie_print_link_status(dev);
>>>>>> +}
>>>>
>>>> Is this function called by default for every PCIe device? What about
>>>> VFs? We make an exception for them on our driver since a VF doesn't
>>>> have access to the needed information in order to provide a
>>>> meaningful message.
>>>
>>> I'm assuming VF means virtual function. pcie_print_link_status()
>>> doesn't care if it's passed a virtual function. It will try to do its
>>> job. That's why I bail out three lines above, with 'dev->is_virtfn'
>>> check.
>>>
>>> Alex
>>
>> That's the point - we don't want to call pcie_print_link_status() for
>> virtual functions. We make the distinction in our driver. If you want
>> to change the code to call this function by default it shouldn't
>> affect the current usage.
>
> I'm not understanding very well what you're asking. I understand you
> want to avoid printing this message on virtual functions, and that's
> already taken care of. I'm also not changing current behavior.  Let's
> get v2 out and start the discussion again based on that.
>
> Alex

Oh ok I see. In this case, please remove the explicit call in mlx4/5
drivers so it won't be duplicated.

2018-07-23 21:54:17

by Tal Gilboa

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On 7/24/2018 12:01 AM, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:
>> PCIe downtraining happens when both the device and PCIe port are
>> capable of a larger bus width or higher speed than negotiated.
>> Downtraining might be indicative of other problems in the system, and
>> identifying this from userspace is neither intuitive, nor
>> straightforward.
>>
>> The easiest way to detect this is with pcie_print_link_status(),
>> since the bottleneck is usually the link that is downtrained. It's not
>> a perfect solution, but it works extremely well in most cases.
>>
>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>> ---
>>
>> For the sake of review, I've created a __pcie_print_link_status() which
>> takes a 'verbose' argument. If we agree want to go this route, and update
>> the users of pcie_print_link_status(), I can split this up in two patches.
>> I prefer just printing this information in the core functions, and letting
>> drivers not have to worry about this. Though there seems to be strong for
>> not going that route, so here it goes:
>
> FWIW the networking drivers print PCIe BW because sometimes the network
> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
> card on a x8 link.
>
> Sorry to bike shed, but currently the networking cards print the info
> during probe. Would it make sense to move your message closer to probe
> time? Rather than when device is added. If driver structure is
> available, we could also consider adding a boolean to struct pci_driver
> to indicate if driver wants the verbose message? This way we avoid
> duplicated prints.
>
> I have no objection to current patch, it LGTM. Just a thought.

I don't see the reason for having two functions. What's the problem with
adding the verbose argument to the original function?

>
>> drivers/pci/pci.c | 22 ++++++++++++++++++----
>> drivers/pci/probe.c | 21 +++++++++++++++++++++
>> include/linux/pci.h | 1 +
>> 3 files changed, 40 insertions(+), 4 deletions(-) >>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 316496e99da9..414ad7b3abdb 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
>> }
>>
>> /**
>> - * pcie_print_link_status - Report the PCI device's link speed and width
>> + * __pcie_print_link_status - Report the PCI device's link speed and width
>> * @dev: PCI device to query
>> + * @verbose: Be verbose -- print info even when enough bandwidth is available.
>> *
>> * Report the available bandwidth at the device. If this is less than the
>> * device is capable of, report the device's maximum possible bandwidth and
>> * the upstream link that limits its performance to less than that.
>> */
>> -void pcie_print_link_status(struct pci_dev *dev)
>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose)
>> {
>> enum pcie_link_width width, width_cap;
>> enum pci_bus_speed speed, speed_cap;
>> @@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev)
>> bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
>> bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width);
>>
>> - if (bw_avail >= bw_cap)
>> + if (bw_avail >= bw_cap && verbose)
>> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n",
>> bw_cap / 1000, bw_cap % 1000,
>> PCIE_SPEED2STR(speed_cap), width_cap);
>> - else
>> + else if (bw_avail < bw_cap)
>> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
>> bw_avail / 1000, bw_avail % 1000,
>> PCIE_SPEED2STR(speed), width,
>> @@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev)
>> bw_cap / 1000, bw_cap % 1000,
>> PCIE_SPEED2STR(speed_cap), width_cap);
>> }
>> +
>> +/**
>> + * pcie_print_link_status - Report the PCI device's link speed and width
>> + * @dev: PCI device to query
>> + *
>> + * Report the available bandwidth at the device. If this is less than the
>> + * device is capable of, report the device's maximum possible bandwidth and
>> + * the upstream link that limits its performance to less than that.
>> + */
>> +void pcie_print_link_status(struct pci_dev *dev)
>> +{
>> + __pcie_print_link_status(dev, true);
>> +}
>> EXPORT_SYMBOL(pcie_print_link_status);
>>
>> /**
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index ac876e32de4b..1f7336377c3b 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
>> return dev;
>> }
>>
>> +static void pcie_check_upstream_link(struct pci_dev *dev)
>> +{
>> + if (!pci_is_pcie(dev))
>> + return;
>> +
>> + /* Look from the device up to avoid downstream ports with no devices. */
>> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>> + return;
>> +
>> + /* Multi-function PCIe share the same link/status. */
>> + if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
>> + return;
>> +
>> + __pcie_print_link_status(dev, false);
>> +}
>> +
>> static void pci_init_capabilities(struct pci_dev *dev)
>> {
>> /* Enhanced Allocation */
>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
>> /* Advanced Error Reporting */
>> pci_aer_init(dev);
>>
>> + /* Check link and detect downtrain errors */
>> + pcie_check_upstream_link(dev);

This is called for every PCIe device right? Won't there be a duplicated
print in case a device loads with lower PCIe bandwidth than needed?

>> +
>> if (pci_probe_reset_function(dev) == 0)
>> dev->reset_fn = 1;
>> }
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index abd5d5e17aee..15bfab8f7a1b 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
>> u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev,
>> enum pci_bus_speed *speed,
>> enum pcie_link_width *width);
>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
>> void pcie_print_link_status(struct pci_dev *dev);
>> int pcie_flr(struct pci_dev *dev);
>> int __pci_reset_function_locked(struct pci_dev *dev);
>

2018-07-23 22:15:51

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:
> On 7/24/2018 12:01 AM, Jakub Kicinski wrote:
> > On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:
> >> PCIe downtraining happens when both the device and PCIe port are
> >> capable of a larger bus width or higher speed than negotiated.
> >> Downtraining might be indicative of other problems in the system, and
> >> identifying this from userspace is neither intuitive, nor
> >> straightforward.
> >>
> >> The easiest way to detect this is with pcie_print_link_status(),
> >> since the bottleneck is usually the link that is downtrained. It's not
> >> a perfect solution, but it works extremely well in most cases.
> >>
> >> Signed-off-by: Alexandru Gagniuc <[email protected]>
> >> ---
> >>
> >> For the sake of review, I've created a __pcie_print_link_status() which
> >> takes a 'verbose' argument. If we agree want to go this route, and update
> >> the users of pcie_print_link_status(), I can split this up in two patches.
> >> I prefer just printing this information in the core functions, and letting
> >> drivers not have to worry about this. Though there seems to be strong for
> >> not going that route, so here it goes:
> >
> > FWIW the networking drivers print PCIe BW because sometimes the network
> > bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
> > card on a x8 link.
> >
> > Sorry to bike shed, but currently the networking cards print the info
> > during probe. Would it make sense to move your message closer to probe
> > time? Rather than when device is added. If driver structure is
> > available, we could also consider adding a boolean to struct pci_driver
> > to indicate if driver wants the verbose message? This way we avoid
> > duplicated prints.
> >
> > I have no objection to current patch, it LGTM. Just a thought.
>
> I don't see the reason for having two functions. What's the problem with
> adding the verbose argument to the original function?

IMHO it's reasonable to keep the default parameter to what 90% of users
want by a means on a wrapper. The non-verbose output is provided by
the core already for all devices.

What do you think of my proposal above Tal? That would make the extra
wrapper unnecessary since the verbose parameter would be part of the
driver structure, and it would avoid the duplicated output.

2018-07-24 00:00:57

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions



On 07/23/2018 05:14 PM, Jakub Kicinski wrote:
> On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:
>> On 7/24/2018 12:01 AM, Jakub Kicinski wrote:
>>> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:
>>>> PCIe downtraining happens when both the device and PCIe port are
>>>> capable of a larger bus width or higher speed than negotiated.
>>>> Downtraining might be indicative of other problems in the system, and
>>>> identifying this from userspace is neither intuitive, nor
>>>> straightforward.
>>>>
>>>> The easiest way to detect this is with pcie_print_link_status(),
>>>> since the bottleneck is usually the link that is downtrained. It's not
>>>> a perfect solution, but it works extremely well in most cases.
>>>>
>>>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>>>> ---
>>>>
>>>> For the sake of review, I've created a __pcie_print_link_status() which
>>>> takes a 'verbose' argument. If we agree want to go this route, and update
>>>> the users of pcie_print_link_status(), I can split this up in two patches.
>>>> I prefer just printing this information in the core functions, and letting
>>>> drivers not have to worry about this. Though there seems to be strong for
>>>> not going that route, so here it goes:
>>>
>>> FWIW the networking drivers print PCIe BW because sometimes the network
>>> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
>>> card on a x8 link.
>>>
>>> Sorry to bike shed, but currently the networking cards print the info
>>> during probe. Would it make sense to move your message closer to probe
>>> time? Rather than when device is added. If driver structure is
>>> available, we could also consider adding a boolean to struct pci_driver
>>> to indicate if driver wants the verbose message? This way we avoid
>>> duplicated prints.
>>>
>>> I have no objection to current patch, it LGTM. Just a thought.
>>
>> I don't see the reason for having two functions. What's the problem with
>> adding the verbose argument to the original function?
>
> IMHO it's reasonable to keep the default parameter to what 90% of users
> want by a means on a wrapper. The non-verbose output is provided by
> the core already for all devices.
>
> What do you think of my proposal above Tal? That would make the extra
> wrapper unnecessary since the verbose parameter would be part of the
> driver structure, and it would avoid the duplicated output.

I see how it might make sense to add another member to the driver
struct, but is it worth the extra learning curve? It seems to be
something with the potential to confuse new driver developers, and
having a very marginal benefit.
Although, if that's what people want...

Alex

2018-07-24 13:40:50

by Tal Gilboa

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On 7/24/2018 2:59 AM, Alex G. wrote:
>
>
> On 07/23/2018 05:14 PM, Jakub Kicinski wrote:
>> On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:
>>> On 7/24/2018 12:01 AM, Jakub Kicinski wrote:
>>>> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:
>>>>> PCIe downtraining happens when both the device and PCIe port are
>>>>> capable of a larger bus width or higher speed than negotiated.
>>>>> Downtraining might be indicative of other problems in the system, and
>>>>> identifying this from userspace is neither intuitive, nor
>>>>> straightforward.
>>>>>
>>>>> The easiest way to detect this is with pcie_print_link_status(),
>>>>> since the bottleneck is usually the link that is downtrained. It's not
>>>>> a perfect solution, but it works extremely well in most cases.
>>>>>
>>>>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>>>>> ---
>>>>>
>>>>> For the sake of review, I've created a __pcie_print_link_status()
>>>>> which
>>>>> takes a 'verbose' argument. If we agree want to go this route, and
>>>>> update
>>>>> the users of pcie_print_link_status(), I can split this up in two
>>>>> patches.
>>>>> I prefer just printing this information in the core functions, and
>>>>> letting
>>>>> drivers not have to worry about this. Though there seems to be
>>>>> strong for
>>>>> not going that route, so here it goes:
>>>>
>>>> FWIW the networking drivers print PCIe BW because sometimes the network
>>>> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
>>>> card on a x8 link.
>>>>
>>>> Sorry to bike shed, but currently the networking cards print the info
>>>> during probe.  Would it make sense to move your message closer to probe
>>>> time?  Rather than when device is added.  If driver structure is
>>>> available, we could also consider adding a boolean to struct pci_driver
>>>> to indicate if driver wants the verbose message?  This way we avoid
>>>> duplicated prints.
>>>>
>>>> I have no objection to current patch, it LGTM.  Just a thought.
>>>
>>> I don't see the reason for having two functions. What's the problem with
>>> adding the verbose argument to the original function?
>>
>> IMHO it's reasonable to keep the default parameter to what 90% of users
>> want by a means on a wrapper.  The non-verbose output is provided by
>> the core already for all devices.
>>
>> What do you think of my proposal above Tal?  That would make the extra
>> wrapper unnecessary since the verbose parameter would be part of the
>> driver structure, and it would avoid the duplicated output.
>
> I see how it might make sense to add another member to the driver
> struct, but is it worth the extra learning curve? It seems to be
> something with the potential to confuse new driver developers, and
> having a very marginal benefit.
> Although, if that's what people want...

I prefer the wrapper function. Looking at struct pci_driver it would
seem strange for it to hold a field for controlling verbosity (IMO).
This is a very (very) specific field in a very general struct.

>
> Alex

2018-07-25 01:27:17

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER

Hi Alexandru,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on pci/next]
[also build test WARNING on v4.18-rc6 next-20180724]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Alexandru-Gagniuc/PCI-AER-Do-not-clear-AER-bits-if-we-don-t-own-AER/20180725-003708
base: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
config: x86_64-randconfig-s0-07250049 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All warnings (new ones prefixed by >>):

In file included from include/linux/string.h:6:0,
from include/linux/uuid.h:20,
from include/linux/cper.h:24,
from drivers/pci/pcie/aer.c:15:
drivers/pci/pcie/aer.c: In function 'pci_enable_pcie_error_reporting':
drivers/pci/pcie/aer.c:371:7: error: implicit declaration of function 'pcie_aer_is_kernel_first' [-Werror=implicit-function-declaration]
if (!pcie_aer_is_kernel_first(dev))
^
include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
if (__builtin_constant_p(!!(cond)) ? !!(cond) : \
^~~~
>> drivers/pci/pcie/aer.c:371:2: note: in expansion of macro 'if'
if (!pcie_aer_is_kernel_first(dev))
^~
cc1: some warnings being treated as errors

vim +/if +371 drivers/pci/pcie/aer.c

365
366 #define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
367 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
368
369 int pci_enable_pcie_error_reporting(struct pci_dev *dev)
370 {
> 371 if (!pcie_aer_is_kernel_first(dev))
372 return -EIO;
373
374 return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
375 }
376 EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
377

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (2.18 kB)
.config.gz (31.33 kB)
Download all attachments

2018-07-30 23:27:45

by Alex_Gagniuc

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On 07/24/2018 08:40 AM, Tal Gilboa wrote:
> On 7/24/2018 2:59 AM, Alex G. wrote:
>>
>>
>> On 07/23/2018 05:14 PM, Jakub Kicinski wrote:
>>> On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote:
>>>> On 7/24/2018 12:01 AM, Jakub Kicinski wrote:
>>>>> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:
>>>>>> PCIe downtraining happens when both the device and PCIe port are
>>>>>> capable of a larger bus width or higher speed than negotiated.
>>>>>> Downtraining might be indicative of other problems in the system, and
>>>>>> identifying this from userspace is neither intuitive, nor
>>>>>> straightforward.
>>>>>>
>>>>>> The easiest way to detect this is with pcie_print_link_status(),
>>>>>> since the bottleneck is usually the link that is downtrained. It's not
>>>>>> a perfect solution, but it works extremely well in most cases.
>>>>>>
>>>>>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>>>>>> ---
>>>>>>
>>>>>> For the sake of review, I've created a __pcie_print_link_status()
>>>>>> which
>>>>>> takes a 'verbose' argument. If we agree want to go this route, and
>>>>>> update
>>>>>> the users of pcie_print_link_status(), I can split this up in two
>>>>>> patches.
>>>>>> I prefer just printing this information in the core functions, and
>>>>>> letting
>>>>>> drivers not have to worry about this. Though there seems to be
>>>>>> strong for
>>>>>> not going that route, so here it goes:
>>>>>
>>>>> FWIW the networking drivers print PCIe BW because sometimes the network
>>>>> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
>>>>> card on a x8 link.
>>>>>
>>>>> Sorry to bike shed, but currently the networking cards print the info
>>>>> during probe.? Would it make sense to move your message closer to probe
>>>>> time?? Rather than when device is added.? If driver structure is
>>>>> available, we could also consider adding a boolean to struct pci_driver
>>>>> to indicate if driver wants the verbose message?? This way we avoid
>>>>> duplicated prints.
>>>>>
>>>>> I have no objection to current patch, it LGTM.? Just a thought.
>>>>
>>>> I don't see the reason for having two functions. What's the problem with
>>>> adding the verbose argument to the original function?
>>>
>>> IMHO it's reasonable to keep the default parameter to what 90% of users
>>> want by a means on a wrapper.? The non-verbose output is provided by
>>> the core already for all devices.
>>>
>>> What do you think of my proposal above Tal?? That would make the extra
>>> wrapper unnecessary since the verbose parameter would be part of the
>>> driver structure, and it would avoid the duplicated output.
>>
>> I see how it might make sense to add another member to the driver
>> struct, but is it worth the extra learning curve? It seems to be
>> something with the potential to confuse new driver developers, and
>> having a very marginal benefit.
>> Although, if that's what people want...
>
> I prefer the wrapper function. Looking at struct pci_driver it would
> seem strange for it to hold a field for controlling verbosity (IMO).
> This is a very (very) specific field in a very general struct.

If people are okay with the wrapper, then I'm not going to update the patch.

Alex

2018-07-31 06:41:42

by Tal Gilboa

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On 7/24/2018 12:52 AM, Tal Gilboa wrote:
> On 7/24/2018 12:01 AM, Jakub Kicinski wrote:
>> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote:
>>> PCIe downtraining happens when both the device and PCIe port are
>>> capable of a larger bus width or higher speed than negotiated.
>>> Downtraining might be indicative of other problems in the system, and
>>> identifying this from userspace is neither intuitive, nor
>>> straightforward.
>>>
>>> The easiest way to detect this is with pcie_print_link_status(),
>>> since the bottleneck is usually the link that is downtrained. It's not
>>> a perfect solution, but it works extremely well in most cases.
>>>
>>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>>> ---
>>>
>>> For the sake of review, I've created a __pcie_print_link_status() which
>>> takes a 'verbose' argument. If we agree want to go this route, and
>>> update
>>> the users of pcie_print_link_status(), I can split this up in two
>>> patches.
>>> I prefer just printing this information in the core functions, and
>>> letting
>>> drivers not have to worry about this. Though there seems to be strong
>>> for
>>> not going that route, so here it goes:
>>
>> FWIW the networking drivers print PCIe BW because sometimes the network
>> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps
>> card on a x8 link.
>>
>> Sorry to bike shed, but currently the networking cards print the info
>> during probe.  Would it make sense to move your message closer to probe
>> time?  Rather than when device is added.  If driver structure is
>> available, we could also consider adding a boolean to struct pci_driver
>> to indicate if driver wants the verbose message?  This way we avoid
>> duplicated prints.
>>
>> I have no objection to current patch, it LGTM.  Just a thought.
>
> I don't see the reason for having two functions. What's the problem with
> adding the verbose argument to the original function?
>
>>
>>>   drivers/pci/pci.c   | 22 ++++++++++++++++++----
>>>   drivers/pci/probe.c | 21 +++++++++++++++++++++
>>>   include/linux/pci.h |  1 +
>>>   3 files changed, 40 insertions(+), 4 deletions(-) >>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index 316496e99da9..414ad7b3abdb 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev
>>> *dev, enum pci_bus_speed *speed,
>>>   }
>>>   /**
>>> - * pcie_print_link_status - Report the PCI device's link speed and
>>> width
>>> + * __pcie_print_link_status - Report the PCI device's link speed and
>>> width
>>>    * @dev: PCI device to query
>>> + * @verbose: Be verbose -- print info even when enough bandwidth is
>>> available.
>>>    *
>>>    * Report the available bandwidth at the device.  If this is less
>>> than the
>>>    * device is capable of, report the device's maximum possible
>>> bandwidth and
>>>    * the upstream link that limits its performance to less than that.
>>>    */
>>> -void pcie_print_link_status(struct pci_dev *dev)
>>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose)
>>>   {
>>>       enum pcie_link_width width, width_cap;
>>>       enum pci_bus_speed speed, speed_cap;
>>> @@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev)
>>>       bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
>>>       bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed,
>>> &width);
>>> -    if (bw_avail >= bw_cap)
>>> +    if (bw_avail >= bw_cap && verbose)
>>>           pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s
>>> x%d link)\n",
>>>                bw_cap / 1000, bw_cap % 1000,
>>>                PCIE_SPEED2STR(speed_cap), width_cap);
>>> -    else
>>> +    else if (bw_avail < bw_cap)
>>>           pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth,
>>> limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d
>>> link)\n",
>>>                bw_avail / 1000, bw_avail % 1000,
>>>                PCIE_SPEED2STR(speed), width,
>>> @@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev)
>>>                bw_cap / 1000, bw_cap % 1000,
>>>                PCIE_SPEED2STR(speed_cap), width_cap);
>>>   }
>>> +
>>> +/**
>>> + * pcie_print_link_status - Report the PCI device's link speed and
>>> width
>>> + * @dev: PCI device to query
>>> + *
>>> + * Report the available bandwidth at the device.  If this is less
>>> than the
>>> + * device is capable of, report the device's maximum possible
>>> bandwidth and
>>> + * the upstream link that limits its performance to less than that.
>>> + */
>>> +void pcie_print_link_status(struct pci_dev *dev)
>>> +{
>>> +    __pcie_print_link_status(dev, true);
>>> +}
>>>   EXPORT_SYMBOL(pcie_print_link_status);
>>>   /**
>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>> index ac876e32de4b..1f7336377c3b 100644
>>> --- a/drivers/pci/probe.c
>>> +++ b/drivers/pci/probe.c
>>> @@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct
>>> pci_bus *bus, int devfn)
>>>       return dev;
>>>   }
>>> +static void pcie_check_upstream_link(struct pci_dev *dev)
>>> +{
>>> +    if (!pci_is_pcie(dev))
>>> +        return;
>>> +
>>> +    /* Look from the device up to avoid downstream ports with no
>>> devices. */
>>> +    if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
>>> +        (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
>>> +        (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
>>> +        return;
>>> +
>>> +    /* Multi-function PCIe share the same link/status. */
>>> +    if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
>>> +        return;
>>> +
>>> +    __pcie_print_link_status(dev, false);
>>> +}
>>> +
>>>   static void pci_init_capabilities(struct pci_dev *dev)
>>>   {
>>>       /* Enhanced Allocation */
>>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct
>>> pci_dev *dev)
>>>       /* Advanced Error Reporting */
>>>       pci_aer_init(dev);
>>> +    /* Check link and detect downtrain errors */
>>> +    pcie_check_upstream_link(dev);
>
> This is called for every PCIe device right? Won't there be a duplicated
> print in case a device loads with lower PCIe bandwidth than needed?

Alex, can you comment on this please?

>
>>> +
>>>       if (pci_probe_reset_function(dev) == 0)
>>>           dev->reset_fn = 1;
>>>   }
>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>> index abd5d5e17aee..15bfab8f7a1b 100644
>>> --- a/include/linux/pci.h
>>> +++ b/include/linux/pci.h
>>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
>>>   u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev
>>> **limiting_dev,
>>>                    enum pci_bus_speed *speed,
>>>                    enum pcie_link_width *width);
>>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
>>>   void pcie_print_link_status(struct pci_dev *dev);
>>>   int pcie_flr(struct pci_dev *dev);
>>>   int __pci_reset_function_locked(struct pci_dev *dev);
>>

2018-07-31 15:11:55

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On 07/31/2018 01:40 AM, Tal Gilboa wrote:
[snip]
>>>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct
>>>> pci_dev *dev)
>>>>       /* Advanced Error Reporting */
>>>>       pci_aer_init(dev);
>>>> +    /* Check link and detect downtrain errors */
>>>> +    pcie_check_upstream_link(dev);
>>
>> This is called for every PCIe device right? Won't there be a
>> duplicated print in case a device loads with lower PCIe bandwidth than
>> needed?
>
> Alex, can you comment on this please?

Of course I can.

There's one print at probe() time, which happens if bandwidth < max. I
would think that's fine. There is a way to duplicate it, and that is if
the driver also calls print_link_status(). A few driver maintainers who
call it have indicated they'd be fine with removing it from the driver,
and leaving it in the core PCI.

Should the device come back at low speed, go away, then come back at
full speed we'd still only see one print from the first probe. Again, if
driver doesn't also call the function.
There's the corner case where the device come up at < max, goes away,
then comes back faster, but still < max. There will be two prints, but
they won't be true duplicates -- each one will indicate different BW.

Alex

>>>> +
>>>>       if (pci_probe_reset_function(dev) == 0)
>>>>           dev->reset_fn = 1;
>>>>   }
>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>> index abd5d5e17aee..15bfab8f7a1b 100644
>>>> --- a/include/linux/pci.h
>>>> +++ b/include/linux/pci.h
>>>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
>>>>   u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev
>>>> **limiting_dev,
>>>>                    enum pci_bus_speed *speed,
>>>>                    enum pcie_link_width *width);
>>>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
>>>>   void pcie_print_link_status(struct pci_dev *dev);
>>>>   int pcie_flr(struct pci_dev *dev);
>>>>   int __pci_reset_function_locked(struct pci_dev *dev);
>>>

2018-08-05 07:06:35

by Tal Gilboa

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On 7/31/2018 6:10 PM, Alex G. wrote:
> On 07/31/2018 01:40 AM, Tal Gilboa wrote:
> [snip]
>>>>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct
>>>>> pci_dev *dev)
>>>>>       /* Advanced Error Reporting */
>>>>>       pci_aer_init(dev);
>>>>> +    /* Check link and detect downtrain errors */
>>>>> +    pcie_check_upstream_link(dev);
>>>
>>> This is called for every PCIe device right? Won't there be a
>>> duplicated print in case a device loads with lower PCIe bandwidth
>>> than needed?
>>
>> Alex, can you comment on this please?
>
> Of course I can.
>
> There's one print at probe() time, which happens if bandwidth < max. I
> would think that's fine. There is a way to duplicate it, and that is if
> the driver also calls print_link_status(). A few driver maintainers who
> call it have indicated they'd be fine with removing it from the driver,
> and leaving it in the core PCI.

We would be fine with that as well. Please include the removal in your
patches.

>
> Should the device come back at low speed, go away, then come back at
> full speed we'd still only see one print from the first probe. Again, if
> driver doesn't also call the function.
> There's the corner case where the device come up at < max, goes away,
> then comes back faster, but still < max. There will be two prints, but
> they won't be true duplicates -- each one will indicate different BW.

This is fine.

>
> Alex
>
>>>>> +
>>>>>       if (pci_probe_reset_function(dev) == 0)
>>>>>           dev->reset_fn = 1;
>>>>>   }
>>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>>> index abd5d5e17aee..15bfab8f7a1b 100644
>>>>> --- a/include/linux/pci.h
>>>>> +++ b/include/linux/pci.h
>>>>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
>>>>>   u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev
>>>>> **limiting_dev,
>>>>>                    enum pci_bus_speed *speed,
>>>>>                    enum pcie_link_width *width);
>>>>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
>>>>>   void pcie_print_link_status(struct pci_dev *dev);
>>>>>   int pcie_flr(struct pci_dev *dev);
>>>>>   int __pci_reset_function_locked(struct pci_dev *dev);
>>>>

2018-08-06 20:10:49

by Alex_Gagniuc

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On 08/05/2018 02:06 AM, Tal Gilboa wrote:
> On 7/31/2018 6:10 PM, Alex G. wrote:
>> On 07/31/2018 01:40 AM, Tal Gilboa wrote:
>> [snip]
>>>>>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct
>>>>>> pci_dev *dev)
>>>>>> ????? /* Advanced Error Reporting */
>>>>>> ????? pci_aer_init(dev);
>>>>>> +??? /* Check link and detect downtrain errors */
>>>>>> +??? pcie_check_upstream_link(dev);
>>>>
>>>> This is called for every PCIe device right? Won't there be a
>>>> duplicated print in case a device loads with lower PCIe bandwidth
>>>> than needed?
>>>
>>> Alex, can you comment on this please?
>>
>> Of course I can.
>>
>> There's one print at probe() time, which happens if bandwidth < max. I
>> would think that's fine. There is a way to duplicate it, and that is if
>> the driver also calls print_link_status(). A few driver maintainers who
>> call it have indicated they'd be fine with removing it from the driver,
>> and leaving it in the core PCI.
>
> We would be fine with that as well. Please include the removal in your
> patches.

What's the proper procedure? Do I wait for confirmation from Bjorn
before knocking on maintainer's doors, or do I William Wallace into
their trees and demand they merge the removal (pending Bjorn's approval
on the other side) ?

Alex

>>
>> Should the device come back at low speed, go away, then come back at
>> full speed we'd still only see one print from the first probe. Again, if
>> driver doesn't also call the function.
>> There's the corner case where the device come up at < max, goes away,
>> then comes back faster, but still < max. There will be two prints, but
>> they won't be true duplicates -- each one will indicate different BW.
>
> This is fine.
>
>>
>> Alex
>>
>>>>>> +
>>>>>> ????? if (pci_probe_reset_function(dev) == 0)
>>>>>> ????????? dev->reset_fn = 1;
>>>>>> ? }
>>>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>>>>>> index abd5d5e17aee..15bfab8f7a1b 100644
>>>>>> --- a/include/linux/pci.h
>>>>>> +++ b/include/linux/pci.h
>>>>>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
>>>>>> ? u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev
>>>>>> **limiting_dev,
>>>>>> ?????????????????? enum pci_bus_speed *speed,
>>>>>> ?????????????????? enum pcie_link_width *width);
>>>>>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
>>>>>> ? void pcie_print_link_status(struct pci_dev *dev);
>>>>>> ? int pcie_flr(struct pci_dev *dev);
>>>>>> ? int __pci_reset_function_locked(struct pci_dev *dev);
>>>>>
>


2018-08-06 21:22:36

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v5] PCI: Check for PCIe downtraining conditions

On Mon, Aug 6, 2018 at 1:39 PM <[email protected]> wrote:
>
> On 08/05/2018 02:06 AM, Tal Gilboa wrote:
> > On 7/31/2018 6:10 PM, Alex G. wrote:
> >> On 07/31/2018 01:40 AM, Tal Gilboa wrote:
> >> [snip]
> >>>>>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct
> >>>>>> pci_dev *dev)
> >>>>>> /* Advanced Error Reporting */
> >>>>>> pci_aer_init(dev);
> >>>>>> + /* Check link and detect downtrain errors */
> >>>>>> + pcie_check_upstream_link(dev);
> >>>>
> >>>> This is called for every PCIe device right? Won't there be a
> >>>> duplicated print in case a device loads with lower PCIe bandwidth
> >>>> than needed?
> >>>
> >>> Alex, can you comment on this please?
> >>
> >> Of course I can.
> >>
> >> There's one print at probe() time, which happens if bandwidth < max. I
> >> would think that's fine. There is a way to duplicate it, and that is if
> >> the driver also calls print_link_status(). A few driver maintainers who
> >> call it have indicated they'd be fine with removing it from the driver,
> >> and leaving it in the core PCI.
> >
> > We would be fine with that as well. Please include the removal in your
> > patches.
>
> What's the proper procedure? Do I wait for confirmation from Bjorn
> before knocking on maintainer's doors, or do I William Wallace into
> their trees and demand they merge the removal (pending Bjorn's approval
> on the other side) ?

Post a v4 series that does the PCI core stuff as well as removing the
driver code.

Bjorn

2018-08-06 23:27:25

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v6 1/9] PCI: Check for PCIe downtraining conditions

PCIe downtraining happens when both the device and PCIe port are
capable of a larger bus width or higher speed than negotiated.
Downtraining might be indicative of other problems in the system, and
identifying this from userspace is neither intuitive, nor
straightforward.

The easiest way to detect this is with pcie_print_link_status(),
since the bottleneck is usually the link that is downtrained. It's not
a perfect solution, but it works extremely well in most cases.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/pci/pci.c | 22 ++++++++++++++++++----
drivers/pci/probe.c | 21 +++++++++++++++++++++
include/linux/pci.h | 1 +
3 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 316496e99da9..414ad7b3abdb 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
}

/**
- * pcie_print_link_status - Report the PCI device's link speed and width
+ * __pcie_print_link_status - Report the PCI device's link speed and width
* @dev: PCI device to query
+ * @verbose: Be verbose -- print info even when enough bandwidth is available.
*
* Report the available bandwidth at the device. If this is less than the
* device is capable of, report the device's maximum possible bandwidth and
* the upstream link that limits its performance to less than that.
*/
-void pcie_print_link_status(struct pci_dev *dev)
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose)
{
enum pcie_link_width width, width_cap;
enum pci_bus_speed speed, speed_cap;
@@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev)
bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width);

- if (bw_avail >= bw_cap)
+ if (bw_avail >= bw_cap && verbose)
pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n",
bw_cap / 1000, bw_cap % 1000,
PCIE_SPEED2STR(speed_cap), width_cap);
- else
+ else if (bw_avail < bw_cap)
pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
bw_avail / 1000, bw_avail % 1000,
PCIE_SPEED2STR(speed), width,
@@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev)
bw_cap / 1000, bw_cap % 1000,
PCIE_SPEED2STR(speed_cap), width_cap);
}
+
+/**
+ * pcie_print_link_status - Report the PCI device's link speed and width
+ * @dev: PCI device to query
+ *
+ * Report the available bandwidth at the device. If this is less than the
+ * device is capable of, report the device's maximum possible bandwidth and
+ * the upstream link that limits its performance to less than that.
+ */
+void pcie_print_link_status(struct pci_dev *dev)
+{
+ __pcie_print_link_status(dev, true);
+}
EXPORT_SYMBOL(pcie_print_link_status);

/**
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 611adcd9c169..1c8c26dd2cb2 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
return dev;
}

+static void pcie_check_upstream_link(struct pci_dev *dev)
+{
+ if (!pci_is_pcie(dev))
+ return;
+
+ /* Look from the device up to avoid downstream ports with no devices. */
+ if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
+ return;
+
+ /* Multi-function PCIe share the same link/status. */
+ if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
+ return;
+
+ __pcie_print_link_status(dev, false);
+}
+
static void pci_init_capabilities(struct pci_dev *dev)
{
/* Enhanced Allocation */
@@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
/* Advanced Error Reporting */
pci_aer_init(dev);

+ /* Check link and detect downtrain errors */
+ pcie_check_upstream_link(dev);
+
if (pci_probe_reset_function(dev) == 0)
dev->reset_fn = 1;
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index c133ccfa002e..d212de231259 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1087,6 +1087,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev,
enum pci_bus_speed *speed,
enum pcie_link_width *width);
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
void pcie_print_link_status(struct pci_dev *dev);
int pcie_flr(struct pci_dev *dev);
int __pci_reset_function_locked(struct pci_dev *dev);
--
2.17.1


2018-08-06 23:27:40

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v6 2/9] bnx2x: Do not call pcie_print_link_status()

This is now done by the PCI core to warn of sub-optimal bandwidth.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 57348f2b49a3..3eadd6201dff 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -14100,7 +14100,6 @@ static int bnx2x_init_one(struct pci_dev *pdev,
board_info[ent->driver_data].name,
(CHIP_REV(bp) >> 12) + 'A', (CHIP_METAL(bp) >> 4),
dev->base_addr, bp->pdev->irq, dev->dev_addr);
- pcie_print_link_status(bp->pdev);

bnx2x_register_phc(bp);

--
2.17.1


2018-08-06 23:27:51

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v6 3/9] bnxt_en: Do not call pcie_print_link_status()

This is now done by the PCI core to warn of sub-optimal bandwidth.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 4394c1162be4..4b3928c89076 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -8909,7 +8909,6 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
netdev_info(dev, "%s found at mem %lx, node addr %pM\n",
board_info[ent->driver_data].name,
(long)pci_resource_start(pdev, 0), dev->dev_addr);
- pcie_print_link_status(pdev);

return 0;

--
2.17.1


2018-08-06 23:28:02

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v6 4/9] cxgb4: Do not call pcie_print_link_status()

This is now done by the PCI core to warn of sub-optimal bandwidth.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index a8926e97935e..049958898c17 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5726,9 +5726,6 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
free_msix_info(adapter);
}

- /* check for PCI Express bandwidth capabiltites */
- pcie_print_link_status(pdev);
-
err = init_rss(adapter);
if (err)
goto out_free_dev;
--
2.17.1


2018-08-06 23:28:09

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v6 5/9] fm10k: Do not call pcie_print_link_status()

This is now done by the PCI core to warn of sub-optimal bandwidth.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
index 15071e4adb98..079fd3c884ea 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_pci.c
@@ -2224,9 +2224,6 @@ static int fm10k_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
/* kick off service timer now, even when interface is down */
mod_timer(&interface->service_timer, (HZ * 2) + jiffies);

- /* print warning for non-optimal configurations */
- pcie_print_link_status(interface->pdev);
-
/* report MAC address for logging */
dev_info(&pdev->dev, "%pM\n", netdev->dev_addr);

--
2.17.1


2018-08-06 23:28:19

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v6 6/9] ixgbe: Do not call pcie_print_link_status()

This is now done by the PCI core to warn of sub-optimal bandwidth.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 26 -------------------
1 file changed, 26 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 62e57b05a0ae..7ecdc6c03a66 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -241,28 +241,6 @@ static inline bool ixgbe_pcie_from_parent(struct ixgbe_hw *hw)
}
}

-static void ixgbe_check_minimum_link(struct ixgbe_adapter *adapter,
- int expected_gts)
-{
- struct ixgbe_hw *hw = &adapter->hw;
- struct pci_dev *pdev;
-
- /* Some devices are not connected over PCIe and thus do not negotiate
- * speed. These devices do not have valid bus info, and thus any report
- * we generate may not be correct.
- */
- if (hw->bus.type == ixgbe_bus_type_internal)
- return;
-
- /* determine whether to use the parent device */
- if (ixgbe_pcie_from_parent(&adapter->hw))
- pdev = adapter->pdev->bus->parent->self;
- else
- pdev = adapter->pdev;
-
- pcie_print_link_status(pdev);
-}
-
static void ixgbe_service_event_schedule(struct ixgbe_adapter *adapter)
{
if (!test_bit(__IXGBE_DOWN, &adapter->state) &&
@@ -10585,10 +10563,6 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
break;
}

- /* don't check link if we failed to enumerate functions */
- if (expected_gts > 0)
- ixgbe_check_minimum_link(adapter, expected_gts);
-
err = ixgbe_read_pba_string_generic(hw, part_str, sizeof(part_str));
if (err)
strlcpy(part_str, "Unknown", sizeof(part_str));
--
2.17.1


2018-08-06 23:28:24

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v6 7/9] net/mlx4: Do not call pcie_print_link_status()

This is now done by the PCI core to warn of sub-optimal bandwidth.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/net/ethernet/mellanox/mlx4/main.c | 7 -------
1 file changed, 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 872014702fc1..da4d54195853 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3398,13 +3398,6 @@ static int mlx4_load_one(struct pci_dev *pdev, int pci_dev_data,
}
}

- /* check if the device is functioning at its maximum possible speed.
- * No return code for this call, just warn the user in case of PCI
- * express device capabilities are under-satisfied by the bus.
- */
- if (!mlx4_is_slave(dev))
- pcie_print_link_status(dev->persist->pdev);
-
/* In master functions, the communication channel must be initialized
* after obtaining its address from fw */
if (mlx4_is_master(dev)) {
--
2.17.1


2018-08-06 23:28:48

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v6 9/9] nfp: Do not call pcie_print_link_status()

This is now done by the PCI core to warn of sub-optimal bandwidth.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c
index 749655c329b2..0324f99bd1a7 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c
@@ -1324,7 +1324,6 @@ struct nfp_cpp *nfp_cpp_from_nfp6000_pcie(struct pci_dev *pdev)
/* Finished with card initialization. */
dev_info(&pdev->dev,
"Netronome Flow Processor NFP4000/NFP6000 PCIe Card Probe\n");
- pcie_print_link_status(pdev);

nfp = kzalloc(sizeof(*nfp), GFP_KERNEL);
if (!nfp) {
--
2.17.1


2018-08-06 23:29:07

by Alexandru Gagniuc

[permalink] [raw]
Subject: [PATCH v6 8/9] net/mlx5: Do not call pcie_print_link_status()

This is now done by the PCI core to warn of sub-optimal bandwidth.

Signed-off-by: Alexandru Gagniuc <[email protected]>
---
drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 615005e63819..e10f9c2bea3b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1045,10 +1045,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
dev_info(&pdev->dev, "firmware version: %d.%d.%d\n", fw_rev_maj(dev),
fw_rev_min(dev), fw_rev_sub(dev));

- /* Only PFs hold the relevant PCIe information for this query */
- if (mlx5_core_is_pf(dev))
- pcie_print_link_status(dev->pdev);
-
/* on load removing any previous indication of internal error, device is
* up
*/
--
2.17.1


2018-08-07 18:16:20

by Jeff Kirsher

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] ixgbe: Do not call pcie_print_link_status()

On Mon, 2018-08-06 at 18:25 -0500, Alexandru Gagniuc wrote:
> This is now done by the PCI core to warn of sub-optimal bandwidth.
>
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 26 ---------------
> ----
> 1 file changed, 26 deletions(-)

Acked-by: Jeff Kirsher <[email protected]>


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2018-08-07 18:16:39

by Jeff Kirsher

[permalink] [raw]
Subject: Re: [PATCH v6 5/9] fm10k: Do not call pcie_print_link_status()

On Mon, 2018-08-06 at 18:25 -0500, Alexandru Gagniuc wrote:
> This is now done by the PCI core to warn of sub-optimal bandwidth.
>
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 3 ---
> 1 file changed, 3 deletions(-)

Acked-by: Jeff Kirsher <[email protected]>


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2018-08-07 20:13:04

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v6 1/9] PCI: Check for PCIe downtraining conditions

From: Alexandru Gagniuc <[email protected]>
Date: Mon, 6 Aug 2018 18:25:35 -0500

> PCIe downtraining happens when both the device and PCIe port are
> capable of a larger bus width or higher speed than negotiated.
> Downtraining might be indicative of other problems in the system, and
> identifying this from userspace is neither intuitive, nor
> straightforward.
>
> The easiest way to detect this is with pcie_print_link_status(),
> since the bottleneck is usually the link that is downtrained. It's not
> a perfect solution, but it works extremely well in most cases.
>
> Signed-off-by: Alexandru Gagniuc <[email protected]>

Feel free to merge this entire series via the PCI tree.

For the series:

Acked-by: David S. Miller <[email protected]>

2018-08-07 21:42:49

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v6 1/9] PCI: Check for PCIe downtraining conditions

On Mon, Aug 06, 2018 at 06:25:35PM -0500, Alexandru Gagniuc wrote:
> PCIe downtraining happens when both the device and PCIe port are
> capable of a larger bus width or higher speed than negotiated.
> Downtraining might be indicative of other problems in the system, and
> identifying this from userspace is neither intuitive, nor
> straightforward.
>
> The easiest way to detect this is with pcie_print_link_status(),
> since the bottleneck is usually the link that is downtrained. It's not
> a perfect solution, but it works extremely well in most cases.

After this series, there are no callers of pcie_print_link_status(),
which means we *only* print something if a device is capable of more
bandwidth than the fabric can deliver.

ISTR some desire to have this information for NICs even if the device
isn't limited, so I'm just double-checking to make sure the driver
guys are OK with this change.

There are no callers of __pcie_print_link_status() outside the PCI
core, so I would move the declaration from include/linux/pci.h to
drivers/pci/pci.h.

If we agree that we *never* need to print anything unless a device is
constrained by the fabric, I would get rid of the "verbose" flag and
keep everything in pcie_print_link_status().

> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/pci/pci.c | 22 ++++++++++++++++++----
> drivers/pci/probe.c | 21 +++++++++++++++++++++
> include/linux/pci.h | 1 +
> 3 files changed, 40 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 316496e99da9..414ad7b3abdb 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
> }
>
> /**
> - * pcie_print_link_status - Report the PCI device's link speed and width
> + * __pcie_print_link_status - Report the PCI device's link speed and width
> * @dev: PCI device to query
> + * @verbose: Be verbose -- print info even when enough bandwidth is available.
> *
> * Report the available bandwidth at the device. If this is less than the
> * device is capable of, report the device's maximum possible bandwidth and
> * the upstream link that limits its performance to less than that.
> */
> -void pcie_print_link_status(struct pci_dev *dev)
> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose)
> {
> enum pcie_link_width width, width_cap;
> enum pci_bus_speed speed, speed_cap;
> @@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev)
> bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
> bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width);
>
> - if (bw_avail >= bw_cap)
> + if (bw_avail >= bw_cap && verbose)
> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n",
> bw_cap / 1000, bw_cap % 1000,
> PCIE_SPEED2STR(speed_cap), width_cap);
> - else
> + else if (bw_avail < bw_cap)
> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
> bw_avail / 1000, bw_avail % 1000,
> PCIE_SPEED2STR(speed), width,
> @@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev)
> bw_cap / 1000, bw_cap % 1000,
> PCIE_SPEED2STR(speed_cap), width_cap);
> }
> +
> +/**
> + * pcie_print_link_status - Report the PCI device's link speed and width
> + * @dev: PCI device to query
> + *
> + * Report the available bandwidth at the device. If this is less than the
> + * device is capable of, report the device's maximum possible bandwidth and
> + * the upstream link that limits its performance to less than that.
> + */
> +void pcie_print_link_status(struct pci_dev *dev)
> +{
> + __pcie_print_link_status(dev, true);
> +}
> EXPORT_SYMBOL(pcie_print_link_status);
>
> /**
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 611adcd9c169..1c8c26dd2cb2 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
> return dev;
> }
>
> +static void pcie_check_upstream_link(struct pci_dev *dev)
> +{
> + if (!pci_is_pcie(dev))
> + return;
> +
> + /* Look from the device up to avoid downstream ports with no devices. */
> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
> + return;
> +
> + /* Multi-function PCIe share the same link/status. */
> + if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
> + return;
> +
> + __pcie_print_link_status(dev, false);
> +}
> +
> static void pci_init_capabilities(struct pci_dev *dev)
> {
> /* Enhanced Allocation */
> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev)
> /* Advanced Error Reporting */
> pci_aer_init(dev);
>
> + /* Check link and detect downtrain errors */
> + pcie_check_upstream_link(dev);
> +
> if (pci_probe_reset_function(dev) == 0)
> dev->reset_fn = 1;
> }
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index c133ccfa002e..d212de231259 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1087,6 +1087,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
> u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev,
> enum pci_bus_speed *speed,
> enum pcie_link_width *width);
> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose);
> void pcie_print_link_status(struct pci_dev *dev);
> int pcie_flr(struct pci_dev *dev);
> int __pci_reset_function_locked(struct pci_dev *dev);
> --
> 2.17.1
>

2018-08-08 06:09:57

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] net/mlx5: Do not call pcie_print_link_status()

On Mon, Aug 06, 2018 at 06:25:42PM -0500, Alexandru Gagniuc wrote:
> This is now done by the PCI core to warn of sub-optimal bandwidth.
>
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ----
> 1 file changed, 4 deletions(-)
>

Thanks,
Reviewed-by: Leon Romanovsky <[email protected]>


Attachments:
(No filename) (368.00 B)
signature.asc (817.00 B)
Download all attachments

2018-08-08 06:11:28

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v6 7/9] net/mlx4: Do not call pcie_print_link_status()

On Mon, Aug 06, 2018 at 06:25:41PM -0500, Alexandru Gagniuc wrote:
> This is now done by the PCI core to warn of sub-optimal bandwidth.
>
> Signed-off-by: Alexandru Gagniuc <[email protected]>
> ---
> drivers/net/ethernet/mellanox/mlx4/main.c | 7 -------
> 1 file changed, 7 deletions(-)
>

Thanks,
Reviewed-by: Leon Romanovsky <[email protected]>


Attachments:
(No filename) (366.00 B)
signature.asc (817.00 B)
Download all attachments

2018-08-08 14:24:53

by Tal Gilboa

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] net/mlx5: Do not call pcie_print_link_status()

On 8/8/2018 9:08 AM, Leon Romanovsky wrote:
> On Mon, Aug 06, 2018 at 06:25:42PM -0500, Alexandru Gagniuc wrote:
>> This is now done by the PCI core to warn of sub-optimal bandwidth.
>>
>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>> ---
>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ----
>> 1 file changed, 4 deletions(-)
>>
>
> Thanks,
> Reviewed-by: Leon Romanovsky <[email protected]>
>

Alex,
I loaded mlx5 driver with and without these series. The report in dmesg
is now missing. From what I understood, the status should be reported at
least once, even if everything is in order. We need this functionality
to stay.

net-next (dmesg output for 07:00.0):
[270498.625351] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
[270498.632130] mlx5_core 0000:07:00.0: 63.008 Gb/s available PCIe
bandwidth (8 GT/s x8 link)
[270499.169533] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
uc(1024) max mc(16384)
[270499.182358] mlx5_core 0000:07:00.0: Port module event: module 0,
Cable plugged

net-next + patches (dmesg output for 07:00.0):
[ 331.608472] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
[ 332.564938] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
uc(1024) max mc(16384)
[ 332.616271] mlx5_core 0000:07:00.0: Port module event: module 0,
Cable plugged



2018-08-08 15:43:31

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] net/mlx5: Do not call pcie_print_link_status()

On Wed, Aug 08, 2018 at 05:23:12PM +0300, Tal Gilboa wrote:
> On 8/8/2018 9:08 AM, Leon Romanovsky wrote:
> > On Mon, Aug 06, 2018 at 06:25:42PM -0500, Alexandru Gagniuc wrote:
> > > This is now done by the PCI core to warn of sub-optimal bandwidth.
> > >
> > > Signed-off-by: Alexandru Gagniuc <[email protected]>
> > > ---
> > > drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ----
> > > 1 file changed, 4 deletions(-)
> > >
> >
> > Thanks,
> > Reviewed-by: Leon Romanovsky <[email protected]>
> >
>
> Alex,
> I loaded mlx5 driver with and without these series. The report in dmesg is
> now missing. From what I understood, the status should be reported at least
> once, even if everything is in order.

It is not what this series is doing and it removes prints completely if
fabric can deliver more than card is capable.

> We need this functionality to stay.

I'm not sure that you need this information in driver's dmesg output,
but most probably something globally visible and accessible per-pci
device.

>
> net-next (dmesg output for 07:00.0):
> [270498.625351] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
> [270498.632130] mlx5_core 0000:07:00.0: 63.008 Gb/s available PCIe bandwidth
> (8 GT/s x8 link)
> [270499.169533] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
> uc(1024) max mc(16384)
> [270499.182358] mlx5_core 0000:07:00.0: Port module event: module 0, Cable
> plugged
>
> net-next + patches (dmesg output for 07:00.0):
> [ 331.608472] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
> [ 332.564938] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
> uc(1024) max mc(16384)
> [ 332.616271] mlx5_core 0000:07:00.0: Port module event: module 0, Cable
> plugged
>
>


Attachments:
(No filename) (1.74 kB)
signature.asc (817.00 B)
Download all attachments

2018-08-08 15:58:27

by Tal Gilboa

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] net/mlx5: Do not call pcie_print_link_status()

On 8/8/2018 6:41 PM, Leon Romanovsky wrote:
> On Wed, Aug 08, 2018 at 05:23:12PM +0300, Tal Gilboa wrote:
>> On 8/8/2018 9:08 AM, Leon Romanovsky wrote:
>>> On Mon, Aug 06, 2018 at 06:25:42PM -0500, Alexandru Gagniuc wrote:
>>>> This is now done by the PCI core to warn of sub-optimal bandwidth.
>>>>
>>>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>>>> ---
>>>> drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ----
>>>> 1 file changed, 4 deletions(-)
>>>>
>>>
>>> Thanks,
>>> Reviewed-by: Leon Romanovsky <[email protected]>
>>>
>>
>> Alex,
>> I loaded mlx5 driver with and without these series. The report in dmesg is
>> now missing. From what I understood, the status should be reported at least
>> once, even if everything is in order.
>
> It is not what this series is doing and it removes prints completely if
> fabric can deliver more than card is capable.
>
>> We need this functionality to stay.
>
> I'm not sure that you need this information in driver's dmesg output,
> but most probably something globally visible and accessible per-pci
> device.

Currently we have users that look for it. If we remove the dmesg print
we need this to be reported elsewhere. Adding it to sysfs for example
should be a valid solution for our case.

>
>>
>> net-next (dmesg output for 07:00.0):
>> [270498.625351] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
>> [270498.632130] mlx5_core 0000:07:00.0: 63.008 Gb/s available PCIe bandwidth
>> (8 GT/s x8 link)
>> [270499.169533] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
>> uc(1024) max mc(16384)
>> [270499.182358] mlx5_core 0000:07:00.0: Port module event: module 0, Cable
>> plugged
>>
>> net-next + patches (dmesg output for 07:00.0):
>> [ 331.608472] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
>> [ 332.564938] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
>> uc(1024) max mc(16384)
>> [ 332.616271] mlx5_core 0000:07:00.0: Port module event: module 0, Cable
>> plugged
>>
>>

2018-08-08 16:35:01

by Alexandru Gagniuc

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] net/mlx5: Do not call pcie_print_link_status()



On 08/08/2018 10:56 AM, Tal Gilboa wrote:
> On 8/8/2018 6:41 PM, Leon Romanovsky wrote:
>> On Wed, Aug 08, 2018 at 05:23:12PM +0300, Tal Gilboa wrote:
>>> On 8/8/2018 9:08 AM, Leon Romanovsky wrote:
>>>> On Mon, Aug 06, 2018 at 06:25:42PM -0500, Alexandru Gagniuc wrote:
>>>>> This is now done by the PCI core to warn of sub-optimal bandwidth.
>>>>>
>>>>> Signed-off-by: Alexandru Gagniuc <[email protected]>
>>>>> ---
>>>>> ?? drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ----
>>>>> ?? 1 file changed, 4 deletions(-)
>>>>>
>>>>
>>>> Thanks,
>>>> Reviewed-by: Leon Romanovsky <[email protected]>
>>>>
>>>
>>> Alex,
>>> I loaded mlx5 driver with and without these series. The report in
>>> dmesg is
>>> now missing. From what I understood, the status should be reported at
>>> least
>>> once, even if everything is in order.
>>
>> It is not what this series is doing and it removes prints completely if
>> fabric can deliver more than card is capable.
>>
>>> We need this functionality to stay.
>>
>> I'm not sure that you need this information in driver's dmesg output,
>> but most probably something globally visible and accessible per-pci
>> device.
>
> Currently we have users that look for it. If we remove the dmesg print
> we need this to be reported elsewhere. Adding it to sysfs for example
> should be a valid solution for our case.

I think a stop-gap measure is to leave the pcie_print_link_status() call
in drivers that really need it for whatever reason. Implementing a
reliable reporting through sysfs might take some tinkering, and I don't
think it's a sufficient reason to block the heart of this series --
being able to detect bottlenecks and link downtraining.

Alex

>>
>>>
>>> net-next (dmesg output for 07:00.0):
>>> [270498.625351] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
>>> [270498.632130] mlx5_core 0000:07:00.0: 63.008 Gb/s available PCIe
>>> bandwidth
>>> (8 GT/s x8 link)
>>> [270499.169533] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
>>> uc(1024) max mc(16384)
>>> [270499.182358] mlx5_core 0000:07:00.0: Port module event: module 0,
>>> Cable
>>> plugged
>>>
>>> net-next + patches (dmesg output for 07:00.0):
>>> [? 331.608472] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
>>> [? 332.564938] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
>>> uc(1024) max mc(16384)
>>> [? 332.616271] mlx5_core 0000:07:00.0: Port module event: module 0,
>>> Cable
>>> plugged
>>>
>>>

2018-08-08 17:30:02

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] net/mlx5: Do not call pcie_print_link_status()

On Wed, Aug 08, 2018 at 11:33:51AM -0500, Alex G. wrote:
>
>
> On 08/08/2018 10:56 AM, Tal Gilboa wrote:
> > On 8/8/2018 6:41 PM, Leon Romanovsky wrote:
> > > On Wed, Aug 08, 2018 at 05:23:12PM +0300, Tal Gilboa wrote:
> > > > On 8/8/2018 9:08 AM, Leon Romanovsky wrote:
> > > > > On Mon, Aug 06, 2018 at 06:25:42PM -0500, Alexandru Gagniuc wrote:
> > > > > > This is now done by the PCI core to warn of sub-optimal bandwidth.
> > > > > >
> > > > > > Signed-off-by: Alexandru Gagniuc <[email protected]>
> > > > > > ---
> > > > > > ?? drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ----
> > > > > > ?? 1 file changed, 4 deletions(-)
> > > > > >
> > > > >
> > > > > Thanks,
> > > > > Reviewed-by: Leon Romanovsky <[email protected]>
> > > > >
> > > >
> > > > Alex,
> > > > I loaded mlx5 driver with and without these series. The report
> > > > in dmesg is
> > > > now missing. From what I understood, the status should be
> > > > reported at least
> > > > once, even if everything is in order.
> > >
> > > It is not what this series is doing and it removes prints completely if
> > > fabric can deliver more than card is capable.
> > >
> > > > We need this functionality to stay.
> > >
> > > I'm not sure that you need this information in driver's dmesg output,
> > > but most probably something globally visible and accessible per-pci
> > > device.
> >
> > Currently we have users that look for it. If we remove the dmesg print
> > we need this to be reported elsewhere. Adding it to sysfs for example
> > should be a valid solution for our case.
>
> I think a stop-gap measure is to leave the pcie_print_link_status() call in
> drivers that really need it for whatever reason. Implementing a reliable
> reporting through sysfs might take some tinkering, and I don't think it's a
> sufficient reason to block the heart of this series -- being able to detect
> bottlenecks and link downtraining.

IMHO, you did right change and it is better to replace this print to some
more generic solution now while you are doing it and don't leave leftovers.

Thanks

>
> Alex
>
> > >
> > > >
> > > > net-next (dmesg output for 07:00.0):
> > > > [270498.625351] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
> > > > [270498.632130] mlx5_core 0000:07:00.0: 63.008 Gb/s available
> > > > PCIe bandwidth
> > > > (8 GT/s x8 link)
> > > > [270499.169533] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
> > > > uc(1024) max mc(16384)
> > > > [270499.182358] mlx5_core 0000:07:00.0: Port module event:
> > > > module 0, Cable
> > > > plugged
> > > >
> > > > net-next + patches (dmesg output for 07:00.0):
> > > > [? 331.608472] mlx5_core 0000:07:00.0: firmware version: 14.22.4020
> > > > [? 332.564938] (0000:07:00.0): E-Switch: Total vports 9, per vport: max
> > > > uc(1024) max mc(16384)
> > > > [? 332.616271] mlx5_core 0000:07:00.0: Port module event: module
> > > > 0, Cable
> > > > plugged
> > > >
> > > >


Attachments:
(No filename) (2.93 kB)
signature.asc (817.00 B)
Download all attachments

2018-08-09 14:04:11

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] net/mlx5: Do not call pcie_print_link_status()

On Wed, Aug 08, 2018 at 08:27:36PM +0300, Leon Romanovsky wrote:
> On Wed, Aug 08, 2018 at 11:33:51AM -0500, Alex G. wrote:
> >
> >
> > On 08/08/2018 10:56 AM, Tal Gilboa wrote:
> > > On 8/8/2018 6:41 PM, Leon Romanovsky wrote:
> > > > On Wed, Aug 08, 2018 at 05:23:12PM +0300, Tal Gilboa wrote:
> > > > > On 8/8/2018 9:08 AM, Leon Romanovsky wrote:
> > > > > > On Mon, Aug 06, 2018 at 06:25:42PM -0500, Alexandru Gagniuc wrote:
> > > > > > > This is now done by the PCI core to warn of sub-optimal bandwidth.
> > > > > > >
> > > > > > > Signed-off-by: Alexandru Gagniuc <[email protected]>
> > > > > > > ---
> > > > > > > ?? drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ----
> > > > > > > ?? 1 file changed, 4 deletions(-)
> > > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Reviewed-by: Leon Romanovsky <[email protected]>
> > > > > >
> > > > >
> > > > > Alex,
> > > > > I loaded mlx5 driver with and without these series. The report
> > > > > in dmesg is
> > > > > now missing. From what I understood, the status should be
> > > > > reported at least
> > > > > once, even if everything is in order.
> > > >
> > > > It is not what this series is doing and it removes prints completely if
> > > > fabric can deliver more than card is capable.
> > > >
> > > > > We need this functionality to stay.
> > > >
> > > > I'm not sure that you need this information in driver's dmesg output,
> > > > but most probably something globally visible and accessible per-pci
> > > > device.
> > >
> > > Currently we have users that look for it. If we remove the dmesg print
> > > we need this to be reported elsewhere. Adding it to sysfs for example
> > > should be a valid solution for our case.
> >
> > I think a stop-gap measure is to leave the pcie_print_link_status() call in
> > drivers that really need it for whatever reason. Implementing a reliable
> > reporting through sysfs might take some tinkering, and I don't think it's a
> > sufficient reason to block the heart of this series -- being able to detect
> > bottlenecks and link downtraining.
>
> IMHO, you did right change and it is better to replace this print to some
> more generic solution now while you are doing it and don't leave leftovers.

I'd like to make forward progress on this, so I propose we merge only
the PCI core change (patch 1/9) and drop the individual driver
changes. That would mean:

- We'll get a message from every NIC driver that calls
pcie_print_link_status() as before.

- We'll get a new message from the core for every downtrained link.

- If a link leading to the NIC is downtrained, there will be
duplicate messages. Maybe that's overkill but it's not terrible.

I provisionally put the patch below on my pci/enumeration branch.
Objections?


commit c870cc8cbc4d79014f3daa74d1e412f32e42bf1b
Author: Alexandru Gagniuc <[email protected]>
Date: Mon Aug 6 18:25:35 2018 -0500

PCI: Check for PCIe Link downtraining

When both ends of a PCIe Link are capable of a higher bandwidth than is
currently in use, the Link is said to be "downtrained". A downtrained Link
may indicate hardware or configuration problems in the system, but it's
hard to identify such Links from userspace.

Refactor pcie_print_link_status() so it continues to always print PCIe
bandwidth information, as several NIC drivers desire.

Add a new internal __pcie_print_link_status() to emit a message only when a
device's bandwidth is constrained by the fabric and call it from the PCI
core for all devices, which identifies all downtrained Links. It also
emits messages for a few cases that are technically not downtrained, such
as a x4 device in an open-ended x1 slot.

Signed-off-by: Alexandru Gagniuc <[email protected]>
[bhelgaas: changelog, move __pcie_print_link_status() declaration to
drivers/pci/, rename pcie_check_upstream_link() to
pcie_report_downtraining()]
Signed-off-by: Bjorn Helgaas <[email protected]>

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 97acba712e4e..a84d341504a5 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5264,14 +5264,16 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
}

/**
- * pcie_print_link_status - Report the PCI device's link speed and width
+ * __pcie_print_link_status - Report the PCI device's link speed and width
* @dev: PCI device to query
+ * @verbose: Print info even when enough bandwidth is available
*
- * Report the available bandwidth at the device. If this is less than the
- * device is capable of, report the device's maximum possible bandwidth and
- * the upstream link that limits its performance to less than that.
+ * If the available bandwidth at the device is less than the device is
+ * capable of, report the device's maximum possible bandwidth and the
+ * upstream link that limits its performance. If @verbose, always print
+ * the available bandwidth, even if the device isn't constrained.
*/
-void pcie_print_link_status(struct pci_dev *dev)
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose)
{
enum pcie_link_width width, width_cap;
enum pci_bus_speed speed, speed_cap;
@@ -5281,11 +5283,11 @@ void pcie_print_link_status(struct pci_dev *dev)
bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap);
bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width);

- if (bw_avail >= bw_cap)
+ if (bw_avail >= bw_cap && verbose)
pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n",
bw_cap / 1000, bw_cap % 1000,
PCIE_SPEED2STR(speed_cap), width_cap);
- else
+ else if (bw_avail < bw_cap)
pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n",
bw_avail / 1000, bw_avail % 1000,
PCIE_SPEED2STR(speed), width,
@@ -5293,6 +5295,17 @@ void pcie_print_link_status(struct pci_dev *dev)
bw_cap / 1000, bw_cap % 1000,
PCIE_SPEED2STR(speed_cap), width_cap);
}
+
+/**
+ * pcie_print_link_status - Report the PCI device's link speed and width
+ * @dev: PCI device to query
+ *
+ * Report the available bandwidth at the device.
+ */
+void pcie_print_link_status(struct pci_dev *dev)
+{
+ __pcie_print_link_status(dev, true);
+}
EXPORT_SYMBOL(pcie_print_link_status);

/**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 70808c168fb9..ce880dab5bc8 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -263,6 +263,7 @@ enum pci_bus_speed pcie_get_speed_cap(struct pci_dev *dev);
enum pcie_link_width pcie_get_width_cap(struct pci_dev *dev);
u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed,
enum pcie_link_width *width);
+void __pcie_print_link_status(struct pci_dev *dev, bool verbose);

/* Single Root I/O Virtualization */
struct pci_sriov {
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index bc147c586643..387fc8ac54ec 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2231,6 +2231,25 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
return dev;
}

+static void pcie_report_downtraining(struct pci_dev *dev)
+{
+ if (!pci_is_pcie(dev))
+ return;
+
+ /* Look from the device up to avoid downstream ports with no devices */
+ if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM))
+ return;
+
+ /* Multi-function PCIe devices share the same link/status */
+ if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn)
+ return;
+
+ /* Print link status only if the device is constrained by the fabric */
+ __pcie_print_link_status(dev, false);
+}
+
static void pci_init_capabilities(struct pci_dev *dev)
{
/* Enhanced Allocation */
@@ -2266,6 +2285,8 @@ static void pci_init_capabilities(struct pci_dev *dev)
/* Advanced Error Reporting */
pci_aer_init(dev);

+ pcie_report_downtraining(dev);
+
if (pci_probe_reset_function(dev) == 0)
dev->reset_fn = 1;
}