I think we have a problem with link bandwidth change notifications
(see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
Here's a recent bug report where Jan reported "_tons_" of these
notifications on an nvme device:
https://bugzilla.kernel.org/show_bug.cgi?id=206197
There was similar discussion involving GPU drivers at
https://lore.kernel.org/r/[email protected]
The current solution is the CONFIG_PCIE_BW config option, which
disables the messages completely. That option defaults to "off" (no
messages), but even so, I think it's a little problematic.
Users are not really in a position to figure out whether it's safe to
enable. All they can do is experiment and see whether it works with
their current mix of devices and drivers.
I don't think it's currently useful for distros because it's a
compile-time switch, and distros cannot predict what system configs
will be used, so I don't think they can enable it.
Does anybody have proposals for making it smarter about distinguishing
real problems from intentional power management, or maybe interfaces
drivers could use to tell us when we should ignore bandwidth changes?
Bjorn
Hi Bjorn,
I'm no longer working on this, so my memory may not be up to speed. If
the endpoint is causing the bandwidth change, then we should get an
_autonomous_ link management interrupt instead. I don't think we report
those, and that shouldn't spam the logs
If it's not a (non-autonomous) link management interrupt, then something
is causing the downstream port to do funny things. I don't think ASPM is
supposed to be causing this.
Do we know what's causing these swings?
For now, I suggest a boot-time parameter to disable link speed reporting
instead of a compile time option.
Alex
On 1/15/20 4:10 PM, Bjorn Helgaas wrote:
> I think we have a problem with link bandwidth change notifications
> (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
>
> Here's a recent bug report where Jan reported "_tons_" of these
> notifications on an nvme device:
> https://bugzilla.kernel.org/show_bug.cgi?id=206197
>
> There was similar discussion involving GPU drivers at
> https://lore.kernel.org/r/[email protected]
>
> The current solution is the CONFIG_PCIE_BW config option, which
> disables the messages completely. That option defaults to "off" (no
> messages), but even so, I think it's a little problematic.
>
> Users are not really in a position to figure out whether it's safe to
> enable. All they can do is experiment and see whether it works with
> their current mix of devices and drivers.
>
> I don't think it's currently useful for distros because it's a
> compile-time switch, and distros cannot predict what system configs
> will be used, so I don't think they can enable it.
>
> Does anybody have proposals for making it smarter about distinguishing
> real problems from intentional power management, or maybe interfaces
> drivers could use to tell us when we should ignore bandwidth changes?
>
> Bjorn
>
On Wed, Jan 15, 2020 at 08:44:21PM -0600, Alex G wrote:
> Hi Bjorn,
>
> I'm no longer working on this, so my memory may not be up to speed. If the
> endpoint is causing the bandwidth change, then we should get an _autonomous_
> link management interrupt instead. I don't think we report those, and that
> shouldn't spam the logs
>
> If it's not a (non-autonomous) link management interrupt, then something is
> causing the downstream port to do funny things. I don't think ASPM is
> supposed to be causing this.
>
> Do we know what's causing these swings?
>
> For now, I suggest a boot-time parameter to disable link speed reporting
> instead of a compile time option.
If we add a parameter, it would have to be that reporting is disabled
by default, and the parameter would enable it. That way the default
will not spam the logs and cause more problem reports.
I don't have time to debug this and I don't like boot parameters in
general, so somebody else will have to step up to resolve this.
At the very least, we need a Kconfig update to warn about the
possibility, and we may need to consider reverting this if we don't
have a better solution.
> On 1/15/20 4:10 PM, Bjorn Helgaas wrote:
> > I think we have a problem with link bandwidth change notifications
> > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> >
> > Here's a recent bug report where Jan reported "_tons_" of these
> > notifications on an nvme device:
> > https://bugzilla.kernel.org/show_bug.cgi?id=206197
> >
> > There was similar discussion involving GPU drivers at
> > https://lore.kernel.org/r/[email protected]
> >
> > The current solution is the CONFIG_PCIE_BW config option, which
> > disables the messages completely. That option defaults to "off" (no
> > messages), but even so, I think it's a little problematic.
> >
> > Users are not really in a position to figure out whether it's safe to
> > enable. All they can do is experiment and see whether it works with
> > their current mix of devices and drivers.
> >
> > I don't think it's currently useful for distros because it's a
> > compile-time switch, and distros cannot predict what system configs
> > will be used, so I don't think they can enable it.
> >
> > Does anybody have proposals for making it smarter about distinguishing
> > real problems from intentional power management, or maybe interfaces
> > drivers could use to tell us when we should ignore bandwidth changes?
> >
> > Bjorn
> >
[+cc NVMe, GPU driver folks]
On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> I think we have a problem with link bandwidth change notifications
> (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
>
> Here's a recent bug report where Jan reported "_tons_" of these
> notifications on an nvme device:
> https://bugzilla.kernel.org/show_bug.cgi?id=206197
>
> There was similar discussion involving GPU drivers at
> https://lore.kernel.org/r/[email protected]
>
> The current solution is the CONFIG_PCIE_BW config option, which
> disables the messages completely. That option defaults to "off" (no
> messages), but even so, I think it's a little problematic.
>
> Users are not really in a position to figure out whether it's safe to
> enable. All they can do is experiment and see whether it works with
> their current mix of devices and drivers.
>
> I don't think it's currently useful for distros because it's a
> compile-time switch, and distros cannot predict what system configs
> will be used, so I don't think they can enable it.
>
> Does anybody have proposals for making it smarter about distinguishing
> real problems from intentional power management, or maybe interfaces
> drivers could use to tell us when we should ignore bandwidth changes?
NVMe, GPU folks, do your drivers or devices change PCIe link
speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
the PCI core interprets changes like that as problems that need to be
reported.
If drivers do change link speed/width, can you point me to where
that's done? Would it be feasible to add some sort of PCI core
interface so the driver could say "ignore" or "pay attention to"
subsequent link changes?
Or maybe there would even be a way to move the link change itself into
the PCI core, so the core would be aware of what's going on?
Bjorn
On Sun, 19 Jan 2020 20:33:26 -0600
Bjorn Helgaas <[email protected]> wrote:
> [+cc NVMe, GPU driver folks]
>
> On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> > I think we have a problem with link bandwidth change notifications
> > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> >
> > Here's a recent bug report where Jan reported "_tons_" of these
> > notifications on an nvme device:
> > https://bugzilla.kernel.org/show_bug.cgi?id=206197
> >
> > There was similar discussion involving GPU drivers at
> > https://lore.kernel.org/r/[email protected]
> >
> > The current solution is the CONFIG_PCIE_BW config option, which
> > disables the messages completely. That option defaults to "off" (no
> > messages), but even so, I think it's a little problematic.
> >
> > Users are not really in a position to figure out whether it's safe to
> > enable. All they can do is experiment and see whether it works with
> > their current mix of devices and drivers.
> >
> > I don't think it's currently useful for distros because it's a
> > compile-time switch, and distros cannot predict what system configs
> > will be used, so I don't think they can enable it.
> >
> > Does anybody have proposals for making it smarter about distinguishing
> > real problems from intentional power management, or maybe interfaces
> > drivers could use to tell us when we should ignore bandwidth changes?
>
> NVMe, GPU folks, do your drivers or devices change PCIe link
> speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
> the PCI core interprets changes like that as problems that need to be
> reported.
>
> If drivers do change link speed/width, can you point me to where
> that's done? Would it be feasible to add some sort of PCI core
> interface so the driver could say "ignore" or "pay attention to"
> subsequent link changes?
>
> Or maybe there would even be a way to move the link change itself into
> the PCI core, so the core would be aware of what's going on?
One case where we previously saw sporadic link change messages was
vfio-pci owned devices. If the transitions are based on config space
manipulation then I can trap those accesses and wrap them in a PCI core
API, but I suspect that's not the exclusive (or potentially even
primary) mechanism for initiating link changes. So I think we'd
probably need a mechanism for a driver to opt-out of link notification
for their devices (presumably the fn0 device per link would opt-out the
entire link?). Thanks,
Alex
On 1/19/20 8:33 PM, Bjorn Helgaas wrote:
> [+cc NVMe, GPU driver folks]
>
> On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
>> I think we have a problem with link bandwidth change notifications
>> (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
>>
>> Here's a recent bug report where Jan reported "_tons_" of these
>> notifications on an nvme device:
>> https://bugzilla.kernel.org/show_bug.cgi?id=206197
>>
>> There was similar discussion involving GPU drivers at
>> https://lore.kernel.org/r/[email protected]
>>
>> The current solution is the CONFIG_PCIE_BW config option, which
>> disables the messages completely. That option defaults to "off" (no
>> messages), but even so, I think it's a little problematic.
>>
>> Users are not really in a position to figure out whether it's safe to
>> enable. All they can do is experiment and see whether it works with
>> their current mix of devices and drivers.
>>
>> I don't think it's currently useful for distros because it's a
>> compile-time switch, and distros cannot predict what system configs
>> will be used, so I don't think they can enable it.
>>
>> Does anybody have proposals for making it smarter about distinguishing
>> real problems from intentional power management, or maybe interfaces
>> drivers could use to tell us when we should ignore bandwidth changes?
>
> NVMe, GPU folks, do your drivers or devices change PCIe link
> speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
> the PCI core interprets changes like that as problems that need to be
> reported.
>
> If drivers do change link speed/width, can you point me to where
> that's done? Would it be feasible to add some sort of PCI core
> interface so the driver could say "ignore" or "pay attention to"
> subsequent link changes?
>
> Or maybe there would even be a way to move the link change itself into
> the PCI core, so the core would be aware of what's going on?
Funny thing is, I was going to suggest an in-kernel API for this.
* Driver requests lower link speed 'X'
* Link management interrupt fires
* If link speed is at or above 'X' then do not report it.
I think an "ignore" flag would defeat the purpose of having link
bandwidth reporting in the first place. If some drivers set it, and
others don't, then it would be inconsistent enough to not be useful.
A second suggestion is, if there is a way to ratelimit these messages on
a per-downstream port basis.
Alex
On Mo, 2020-01-20 at 10:01 -0600, Alex G. wrote:
>
> On 1/19/20 8:33 PM, Bjorn Helgaas wrote:
> > [+cc NVMe, GPU driver folks]
> >
> > On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> > > I think we have a problem with link bandwidth change notifications
> > > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> > >
> > > Here's a recent bug report where Jan reported "_tons_" of these
> > > notifications on an nvme device:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=206197
> > >
> > > There was similar discussion involving GPU drivers at
> > > https://lore.kernel.org/r/[email protected]
> > >
> > > The current solution is the CONFIG_PCIE_BW config option, which
> > > disables the messages completely. That option defaults to "off" (no
> > > messages), but even so, I think it's a little problematic.
> > >
> > > Users are not really in a position to figure out whether it's safe to
> > > enable. All they can do is experiment and see whether it works with
> > > their current mix of devices and drivers.
> > >
> > > I don't think it's currently useful for distros because it's a
> > > compile-time switch, and distros cannot predict what system configs
> > > will be used, so I don't think they can enable it.
> > >
> > > Does anybody have proposals for making it smarter about distinguishing
> > > real problems from intentional power management, or maybe interfaces
> > > drivers could use to tell us when we should ignore bandwidth changes?
> >
> > NVMe, GPU folks, do your drivers or devices change PCIe link
> > speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
> > the PCI core interprets changes like that as problems that need to be
> > reported.
> >
> > If drivers do change link speed/width, can you point me to where
> > that's done? Would it be feasible to add some sort of PCI core
> > interface so the driver could say "ignore" or "pay attention to"
> > subsequent link changes?
> >
> > Or maybe there would even be a way to move the link change itself into
> > the PCI core, so the core would be aware of what's going on?
>
> Funny thing is, I was going to suggest an in-kernel API for this.
> * Driver requests lower link speed 'X'
> * Link management interrupt fires
> * If link speed is at or above 'X' then do not report it.
> I think an "ignore" flag would defeat the purpose of having link
> bandwidth reporting in the first place. If some drivers set it, and
> others don't, then it would be inconsistent enough to not be useful.
>
> A second suggestion is, if there is a way to ratelimit these messages on
> a per-downstream port basis.
Both AMD and Nvidia GPUs have embedded controllers, which are
responsible for the power management. IIRC those controllers can
autonomously initiate PCIe link speed changes depending on measured bus
load. So there is no way for the driver to signal the requested bus
speed to the PCIe core.
I guess for the GPU usecase the best we can do is to have the driver
opt-out of the link bandwidth notifications, as the driver knows that
there is some autonomous entity on the endpoint mucking with the link
parameters.
Regards,
Lucas
On 1/21/20 5:10 AM, Lucas Stach wrote:
> On Mo, 2020-01-20 at 10:01 -0600, Alex G. wrote:
>>
>> On 1/19/20 8:33 PM, Bjorn Helgaas wrote:
>>> [+cc NVMe, GPU driver folks]
>>>
>>> On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
>>>> I think we have a problem with link bandwidth change notifications
>>>> (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
>>>>
>>>> Here's a recent bug report where Jan reported "_tons_" of these
>>>> notifications on an nvme device:
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=206197
>>>>
>>>> There was similar discussion involving GPU drivers at
>>>> https://lore.kernel.org/r/[email protected]
>>>>
>>>> The current solution is the CONFIG_PCIE_BW config option, which
>>>> disables the messages completely. That option defaults to "off" (no
>>>> messages), but even so, I think it's a little problematic.
>>>>
>>>> Users are not really in a position to figure out whether it's safe to
>>>> enable. All they can do is experiment and see whether it works with
>>>> their current mix of devices and drivers.
>>>>
>>>> I don't think it's currently useful for distros because it's a
>>>> compile-time switch, and distros cannot predict what system configs
>>>> will be used, so I don't think they can enable it.
>>>>
>>>> Does anybody have proposals for making it smarter about distinguishing
>>>> real problems from intentional power management, or maybe interfaces
>>>> drivers could use to tell us when we should ignore bandwidth changes?
>>>
>>> NVMe, GPU folks, do your drivers or devices change PCIe link
>>> speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
>>> the PCI core interprets changes like that as problems that need to be
>>> reported.
>>>
>>> If drivers do change link speed/width, can you point me to where
>>> that's done? Would it be feasible to add some sort of PCI core
>>> interface so the driver could say "ignore" or "pay attention to"
>>> subsequent link changes?
>>>
>>> Or maybe there would even be a way to move the link change itself into
>>> the PCI core, so the core would be aware of what's going on?
>>
>> Funny thing is, I was going to suggest an in-kernel API for this.
>> * Driver requests lower link speed 'X'
>> * Link management interrupt fires
>> * If link speed is at or above 'X' then do not report it.
>> I think an "ignore" flag would defeat the purpose of having link
>> bandwidth reporting in the first place. If some drivers set it, and
>> others don't, then it would be inconsistent enough to not be useful.
>>
>> A second suggestion is, if there is a way to ratelimit these messages on
>> a per-downstream port basis.
>
> Both AMD and Nvidia GPUs have embedded controllers, which are
> responsible for the power management. IIRC those controllers can
> autonomously initiate PCIe link speed changes depending on measured bus
> load. So there is no way for the driver to signal the requested bus
> speed to the PCIe core.
>
> I guess for the GPU usecase the best we can do is to have the driver
> opt-out of the link bandwidth notifications, as the driver knows that
> there is some autonomous entity on the endpoint mucking with the link
> parameters.
I'm confused. Are you saying that the autonomous mechanism is causing a
link bandwidth notification?
Alex
On Sun, Jan 19, 2020 at 08:33:26PM -0600, Bjorn Helgaas wrote:
> NVMe, GPU folks, do your drivers or devices change PCIe link
> speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
> the PCI core interprets changes like that as problems that need to be
> reported.
The NVMe driver doesn't. For devices I don't know of any, but Ican't
find anything in the spec that would forbid it.
On Tue, 21 Jan 2020 at 21:11, Lucas Stach <[email protected]> wrote:
>
> On Mo, 2020-01-20 at 10:01 -0600, Alex G. wrote:
> >
> > On 1/19/20 8:33 PM, Bjorn Helgaas wrote:
> > > [+cc NVMe, GPU driver folks]
> > >
> > > On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> > > > I think we have a problem with link bandwidth change notifications
> > > > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> > > >
> > > > Here's a recent bug report where Jan reported "_tons_" of these
> > > > notifications on an nvme device:
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=206197
> > > >
> > > > There was similar discussion involving GPU drivers at
> > > > https://lore.kernel.org/r/[email protected]
> > > >
> > > > The current solution is the CONFIG_PCIE_BW config option, which
> > > > disables the messages completely. That option defaults to "off" (no
> > > > messages), but even so, I think it's a little problematic.
> > > >
> > > > Users are not really in a position to figure out whether it's safe to
> > > > enable. All they can do is experiment and see whether it works with
> > > > their current mix of devices and drivers.
> > > >
> > > > I don't think it's currently useful for distros because it's a
> > > > compile-time switch, and distros cannot predict what system configs
> > > > will be used, so I don't think they can enable it.
> > > >
> > > > Does anybody have proposals for making it smarter about distinguishing
> > > > real problems from intentional power management, or maybe interfaces
> > > > drivers could use to tell us when we should ignore bandwidth changes?
> > >
> > > NVMe, GPU folks, do your drivers or devices change PCIe link
> > > speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
> > > the PCI core interprets changes like that as problems that need to be
> > > reported.
> > >
> > > If drivers do change link speed/width, can you point me to where
> > > that's done? Would it be feasible to add some sort of PCI core
> > > interface so the driver could say "ignore" or "pay attention to"
> > > subsequent link changes?
> > >
> > > Or maybe there would even be a way to move the link change itself into
> > > the PCI core, so the core would be aware of what's going on?
> >
> > Funny thing is, I was going to suggest an in-kernel API for this.
> > * Driver requests lower link speed 'X'
> > * Link management interrupt fires
> > * If link speed is at or above 'X' then do not report it.
> > I think an "ignore" flag would defeat the purpose of having link
> > bandwidth reporting in the first place. If some drivers set it, and
> > others don't, then it would be inconsistent enough to not be useful.
> >
> > A second suggestion is, if there is a way to ratelimit these messages on
> > a per-downstream port basis.
>
> Both AMD and Nvidia GPUs have embedded controllers, which are
> responsible for the power management. IIRC those controllers can
> autonomously initiate PCIe link speed changes depending on measured bus
> load. So there is no way for the driver to signal the requested bus
> speed to the PCIe core.
>
> I guess for the GPU usecase the best we can do is to have the driver
> opt-out of the link bandwidth notifications, as the driver knows that
> there is some autonomous entity on the endpoint mucking with the link
> parameters.
>
Adding Alex and Ben for AMD and NVIDIA info
Dave.
On Mon, 3 Feb 2020 at 11:56, Dave Airlie <[email protected]> wrote:
>
> On Tue, 21 Jan 2020 at 21:11, Lucas Stach <[email protected]> wrote:
> >
> > On Mo, 2020-01-20 at 10:01 -0600, Alex G. wrote:
> > >
> > > On 1/19/20 8:33 PM, Bjorn Helgaas wrote:
> > > > [+cc NVMe, GPU driver folks]
> > > >
> > > > On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> > > > > I think we have a problem with link bandwidth change notifications
> > > > > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> > > > >
> > > > > Here's a recent bug report where Jan reported "_tons_" of these
> > > > > notifications on an nvme device:
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206197
> > > > >
> > > > > There was similar discussion involving GPU drivers at
> > > > > https://lore.kernel.org/r/[email protected]
> > > > >
> > > > > The current solution is the CONFIG_PCIE_BW config option, which
> > > > > disables the messages completely. That option defaults to "off" (no
> > > > > messages), but even so, I think it's a little problematic.
> > > > >
> > > > > Users are not really in a position to figure out whether it's safe to
> > > > > enable. All they can do is experiment and see whether it works with
> > > > > their current mix of devices and drivers.
> > > > >
> > > > > I don't think it's currently useful for distros because it's a
> > > > > compile-time switch, and distros cannot predict what system configs
> > > > > will be used, so I don't think they can enable it.
> > > > >
> > > > > Does anybody have proposals for making it smarter about distinguishing
> > > > > real problems from intentional power management, or maybe interfaces
> > > > > drivers could use to tell us when we should ignore bandwidth changes?
> > > >
> > > > NVMe, GPU folks, do your drivers or devices change PCIe link
> > > > speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
> > > > the PCI core interprets changes like that as problems that need to be
> > > > reported.
> > > >
> > > > If drivers do change link speed/width, can you point me to where
> > > > that's done? Would it be feasible to add some sort of PCI core
> > > > interface so the driver could say "ignore" or "pay attention to"
> > > > subsequent link changes?
> > > >
> > > > Or maybe there would even be a way to move the link change itself into
> > > > the PCI core, so the core would be aware of what's going on?
> > >
> > > Funny thing is, I was going to suggest an in-kernel API for this.
> > > * Driver requests lower link speed 'X'
> > > * Link management interrupt fires
> > > * If link speed is at or above 'X' then do not report it.
> > > I think an "ignore" flag would defeat the purpose of having link
> > > bandwidth reporting in the first place. If some drivers set it, and
> > > others don't, then it would be inconsistent enough to not be useful.
> > >
> > > A second suggestion is, if there is a way to ratelimit these messages on
> > > a per-downstream port basis.
> >
> > Both AMD and Nvidia GPUs have embedded controllers, which are
> > responsible for the power management. IIRC those controllers can
> > autonomously initiate PCIe link speed changes depending on measured bus
> > load. So there is no way for the driver to signal the requested bus
> > speed to the PCIe core.
> >
> > I guess for the GPU usecase the best we can do is to have the driver
> > opt-out of the link bandwidth notifications, as the driver knows that
> > there is some autonomous entity on the endpoint mucking with the link
> > parameters.
> >
>
> Adding Alex and Ben for AMD and NVIDIA info (and Karol).
>
> Dave.
On Mon, 3 Feb 2020 at 12:04, Dave Airlie <[email protected]> wrote:
>
> On Mon, 3 Feb 2020 at 11:56, Dave Airlie <[email protected]> wrote:
> >
> > On Tue, 21 Jan 2020 at 21:11, Lucas Stach <[email protected]> wrote:
> > >
> > > On Mo, 2020-01-20 at 10:01 -0600, Alex G. wrote:
> > > >
> > > > On 1/19/20 8:33 PM, Bjorn Helgaas wrote:
> > > > > [+cc NVMe, GPU driver folks]
> > > > >
> > > > > On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> > > > > > I think we have a problem with link bandwidth change notifications
> > > > > > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> > > > > >
> > > > > > Here's a recent bug report where Jan reported "_tons_" of these
> > > > > > notifications on an nvme device:
> > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206197
> > > > > >
> > > > > > There was similar discussion involving GPU drivers at
> > > > > > https://lore.kernel.org/r/[email protected]
> > > > > >
> > > > > > The current solution is the CONFIG_PCIE_BW config option, which
> > > > > > disables the messages completely. That option defaults to "off" (no
> > > > > > messages), but even so, I think it's a little problematic.
> > > > > >
> > > > > > Users are not really in a position to figure out whether it's safe to
> > > > > > enable. All they can do is experiment and see whether it works with
> > > > > > their current mix of devices and drivers.
> > > > > >
> > > > > > I don't think it's currently useful for distros because it's a
> > > > > > compile-time switch, and distros cannot predict what system configs
> > > > > > will be used, so I don't think they can enable it.
> > > > > >
> > > > > > Does anybody have proposals for making it smarter about distinguishing
> > > > > > real problems from intentional power management, or maybe interfaces
> > > > > > drivers could use to tell us when we should ignore bandwidth changes?
> > > > >
> > > > > NVMe, GPU folks, do your drivers or devices change PCIe link
> > > > > speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
> > > > > the PCI core interprets changes like that as problems that need to be
> > > > > reported.
> > > > >
> > > > > If drivers do change link speed/width, can you point me to where
> > > > > that's done? Would it be feasible to add some sort of PCI core
> > > > > interface so the driver could say "ignore" or "pay attention to"
> > > > > subsequent link changes?
> > > > >
> > > > > Or maybe there would even be a way to move the link change itself into
> > > > > the PCI core, so the core would be aware of what's going on?
> > > >
> > > > Funny thing is, I was going to suggest an in-kernel API for this.
> > > > * Driver requests lower link speed 'X'
> > > > * Link management interrupt fires
> > > > * If link speed is at or above 'X' then do not report it.
> > > > I think an "ignore" flag would defeat the purpose of having link
> > > > bandwidth reporting in the first place. If some drivers set it, and
> > > > others don't, then it would be inconsistent enough to not be useful.
> > > >
> > > > A second suggestion is, if there is a way to ratelimit these messages on
> > > > a per-downstream port basis.
> > >
> > > Both AMD and Nvidia GPUs have embedded controllers, which are
> > > responsible for the power management. IIRC those controllers can
> > > autonomously initiate PCIe link speed changes depending on measured bus
> > > load. So there is no way for the driver to signal the requested bus
> > > speed to the PCIe core.
> > >
> > > I guess for the GPU usecase the best we can do is to have the driver
> > > opt-out of the link bandwidth notifications, as the driver knows that
> > > there is some autonomous entity on the endpoint mucking with the link
> > > parameters.
> > >
> >
> > Adding Alex and Ben for AMD and NVIDIA info (and Karol).
We don't attempt link speed changes by default in Nouveau currently,
however, I believe we will (if instructed by the VBIOS perf tables) if
you enable clock frequency changes. We do this from the driver code
currently, but I'm not 100% sure what this will look like if/when we
receive PMU firmware. More and more autonomy is being pushed into
there by NVIDIA with each generation, it's possible that even happens
on Volta/Turing PMU already, it's not an area I've looked into
recently.
Ben.
> >
> > Dave.
On Sun, Feb 2, 2020 at 9:04 PM Dave Airlie <[email protected]> wrote:
>
> On Mon, 3 Feb 2020 at 11:56, Dave Airlie <[email protected]> wrote:
> >
> > On Tue, 21 Jan 2020 at 21:11, Lucas Stach <[email protected]> wrote:
> > >
> > > On Mo, 2020-01-20 at 10:01 -0600, Alex G. wrote:
> > > >
> > > > On 1/19/20 8:33 PM, Bjorn Helgaas wrote:
> > > > > [+cc NVMe, GPU driver folks]
> > > > >
> > > > > On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> > > > > > I think we have a problem with link bandwidth change notifications
> > > > > > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> > > > > >
> > > > > > Here's a recent bug report where Jan reported "_tons_" of these
> > > > > > notifications on an nvme device:
> > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=206197
> > > > > >
> > > > > > There was similar discussion involving GPU drivers at
> > > > > > https://lore.kernel.org/r/[email protected]
> > > > > >
> > > > > > The current solution is the CONFIG_PCIE_BW config option, which
> > > > > > disables the messages completely. That option defaults to "off" (no
> > > > > > messages), but even so, I think it's a little problematic.
> > > > > >
> > > > > > Users are not really in a position to figure out whether it's safe to
> > > > > > enable. All they can do is experiment and see whether it works with
> > > > > > their current mix of devices and drivers.
> > > > > >
> > > > > > I don't think it's currently useful for distros because it's a
> > > > > > compile-time switch, and distros cannot predict what system configs
> > > > > > will be used, so I don't think they can enable it.
> > > > > >
> > > > > > Does anybody have proposals for making it smarter about distinguishing
> > > > > > real problems from intentional power management, or maybe interfaces
> > > > > > drivers could use to tell us when we should ignore bandwidth changes?
> > > > >
> > > > > NVMe, GPU folks, do your drivers or devices change PCIe link
> > > > > speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y,
> > > > > the PCI core interprets changes like that as problems that need to be
> > > > > reported.
> > > > >
> > > > > If drivers do change link speed/width, can you point me to where
> > > > > that's done? Would it be feasible to add some sort of PCI core
> > > > > interface so the driver could say "ignore" or "pay attention to"
> > > > > subsequent link changes?
> > > > >
> > > > > Or maybe there would even be a way to move the link change itself into
> > > > > the PCI core, so the core would be aware of what's going on?
> > > >
> > > > Funny thing is, I was going to suggest an in-kernel API for this.
> > > > * Driver requests lower link speed 'X'
> > > > * Link management interrupt fires
> > > > * If link speed is at or above 'X' then do not report it.
> > > > I think an "ignore" flag would defeat the purpose of having link
> > > > bandwidth reporting in the first place. If some drivers set it, and
> > > > others don't, then it would be inconsistent enough to not be useful.
> > > >
> > > > A second suggestion is, if there is a way to ratelimit these messages on
> > > > a per-downstream port basis.
> > >
> > > Both AMD and Nvidia GPUs have embedded controllers, which are
> > > responsible for the power management. IIRC those controllers can
> > > autonomously initiate PCIe link speed changes depending on measured bus
> > > load. So there is no way for the driver to signal the requested bus
> > > speed to the PCIe core.
> > >
> > > I guess for the GPU usecase the best we can do is to have the driver
> > > opt-out of the link bandwidth notifications, as the driver knows that
> > > there is some autonomous entity on the endpoint mucking with the link
> > > parameters.
> > >
> >
> > Adding Alex and Ben for AMD and NVIDIA info (and Karol).
AMD has had a micro-controller on the GPU handling pcie link speeds
and widths dynamically (in addition to GPU clocks and voltages) for
about 12 years or so at this point to save power when the GPU is idle
and improve performance when it's required. The micro-controller
changes the link parameters dynamically based on load independent of
the driver. The driver can tweak the heuristics, or even disable the
dynamic changes, but by default it's enabled when the driver loads.
The ucode for this micro-controller is loaded by the driver so you'll
see fixed clocks and widths prior to the driver loading. We'd need
some sort of opt out I suppose for periods when the driver has enabled
dynamic pcie power management in the micro-controller.
Alex
On Mon, Feb 03, 2020 at 04:16:36PM -0500, Alex Deucher wrote:
> AMD has had a micro-controller on the GPU handling pcie link speeds
> and widths dynamically (in addition to GPU clocks and voltages) for
> about 12 years or so at this point to save power when the GPU is idle
> and improve performance when it's required. The micro-controller
> changes the link parameters dynamically based on load independent of
> the driver. The driver can tweak the heuristics, or even disable the
> dynamic changes, but by default it's enabled when the driver loads.
> The ucode for this micro-controller is loaded by the driver so you'll
> see fixed clocks and widths prior to the driver loading. We'd need
> some sort of opt out I suppose for periods when the driver has enabled
> dynamic pcie power management in the micro-controller.
Note that there are *two* bits in the Link Status Register:
* Link Autonomous Bandwidth Status
"This bit is Set by hardware to indicate that hardware has
autonomously changed Link speed or width, without the Port
transitioning through DL_Down status, for reasons other than to
attempt to correct unreliable Link operation. This bit must be set if
the Physical Layer reports a speed or width change was initiated by
the Downstream component that was indicated as an autonomous change."
* Link Bandwidth Management Status
"This bit is Set by hardware to indicate that either of the
following has occurred without the Port transitioning through
DL_Down status. [...] Hardware has changed Link speed or width to
attempt to correct unreliable Link operation, either through an
LTSSM timeout or a higher level process."
See PCIe Base Spec 4.0 sec 7.8.8, 7.8.7, 4.2.6.3.3.1.
The two bits generate *separate* interrupts. We only enable the
interrupt for the latter.
If AMD GPUs generate a Link Bandwidth Management Interrupt upon
autonomously changing bandwidth for power management reasons
(instead of to correct unreliability issues), that would be a
spec violation.
So the question is, do your GPUs violate the spec in this regard?
Thanks,
Lukas
On Mon, Feb 3, 2020 at 11:38 PM Lukas Wunner <[email protected]> wrote:
>
> On Mon, Feb 03, 2020 at 04:16:36PM -0500, Alex Deucher wrote:
> > AMD has had a micro-controller on the GPU handling pcie link speeds
> > and widths dynamically (in addition to GPU clocks and voltages) for
> > about 12 years or so at this point to save power when the GPU is idle
> > and improve performance when it's required. The micro-controller
> > changes the link parameters dynamically based on load independent of
> > the driver. The driver can tweak the heuristics, or even disable the
> > dynamic changes, but by default it's enabled when the driver loads.
> > The ucode for this micro-controller is loaded by the driver so you'll
> > see fixed clocks and widths prior to the driver loading. We'd need
> > some sort of opt out I suppose for periods when the driver has enabled
> > dynamic pcie power management in the micro-controller.
>
> Note that there are *two* bits in the Link Status Register:
>
> * Link Autonomous Bandwidth Status
> "This bit is Set by hardware to indicate that hardware has
> autonomously changed Link speed or width, without the Port
> transitioning through DL_Down status, for reasons other than to
> attempt to correct unreliable Link operation. This bit must be set if
> the Physical Layer reports a speed or width change was initiated by
> the Downstream component that was indicated as an autonomous change."
>
> * Link Bandwidth Management Status
> "This bit is Set by hardware to indicate that either of the
> following has occurred without the Port transitioning through
> DL_Down status. [...] Hardware has changed Link speed or width to
> attempt to correct unreliable Link operation, either through an
> LTSSM timeout or a higher level process."
>
> See PCIe Base Spec 4.0 sec 7.8.8, 7.8.7, 4.2.6.3.3.1.
>
> The two bits generate *separate* interrupts. We only enable the
> interrupt for the latter.
>
> If AMD GPUs generate a Link Bandwidth Management Interrupt upon
> autonomously changing bandwidth for power management reasons
> (instead of to correct unreliability issues), that would be a
> spec violation.
>
> So the question is, do your GPUs violate the spec in this regard?
I don't know off hand. I can ask the firmware team. That said, I
haven't seen any reports of problems with our GPUs with this code
enabled.
Alex
[+cc Christoph, Lucas, Dave, Ben, Alex, Myron]
On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> I think we have a problem with link bandwidth change notifications
> (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
>
> Here's a recent bug report where Jan reported "_tons_" of these
> notifications on an nvme device:
> https://bugzilla.kernel.org/show_bug.cgi?id=206197
AFAICT, this thread petered out with no resolution.
If the bandwidth change notifications are important to somebody,
please speak up, preferably with a patch that makes the notifications
disabled by default and adds a parameter to enable them (or some other
strategy that makes sense).
I think these are potentially useful, so I don't really want to just
revert them, but if nobody thinks these are important enough to fix,
that's a possibility.
> There was similar discussion involving GPU drivers at
> https://lore.kernel.org/r/[email protected]
>
> The current solution is the CONFIG_PCIE_BW config option, which
> disables the messages completely. That option defaults to "off" (no
> messages), but even so, I think it's a little problematic.
>
> Users are not really in a position to figure out whether it's safe to
> enable. All they can do is experiment and see whether it works with
> their current mix of devices and drivers.
>
> I don't think it's currently useful for distros because it's a
> compile-time switch, and distros cannot predict what system configs
> will be used, so I don't think they can enable it.
>
> Does anybody have proposals for making it smarter about distinguishing
> real problems from intentional power management, or maybe interfaces
> drivers could use to tell us when we should ignore bandwidth changes?
>
> Bjorn
[+cc Atanas -- thank you very much for the bug report!]
On Sat, Feb 22, 2020 at 10:58:40AM -0600, Bjorn Helgaas wrote:
> On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> > I think we have a problem with link bandwidth change notifications
> > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> >
> > Here's a recent bug report where Jan reported "_tons_" of these
> > notifications on an nvme device:
> > https://bugzilla.kernel.org/show_bug.cgi?id=206197
>
> AFAICT, this thread petered out with no resolution.
>
> If the bandwidth change notifications are important to somebody,
> please speak up, preferably with a patch that makes the notifications
> disabled by default and adds a parameter to enable them (or some other
> strategy that makes sense).
>
> I think these are potentially useful, so I don't really want to just
> revert them, but if nobody thinks these are important enough to fix,
> that's a possibility.
Atanas is also seeing this problem and went to the trouble of digging
up this bug report:
https://bugzilla.kernel.org/show_bug.cgi?id=206197#c8
I'm actually a little surprised that we haven't seen more reports of
this. I don't think distros enable CONFIG_PCIE_BW, but even so, I
would think more people running upstream kernels would trip over it.
But maybe people just haven't turned CONFIG_PCIE_BW on.
I don't have a suggestion; just adding Atanas to this old thread.
> > There was similar discussion involving GPU drivers at
> > https://lore.kernel.org/r/[email protected]
> >
> > The current solution is the CONFIG_PCIE_BW config option, which
> > disables the messages completely. That option defaults to "off" (no
> > messages), but even so, I think it's a little problematic.
> >
> > Users are not really in a position to figure out whether it's safe to
> > enable. All they can do is experiment and see whether it works with
> > their current mix of devices and drivers.
> >
> > I don't think it's currently useful for distros because it's a
> > compile-time switch, and distros cannot predict what system configs
> > will be used, so I don't think they can enable it.
> >
> > Does anybody have proposals for making it smarter about distinguishing
> > real problems from intentional power management, or maybe interfaces
> > drivers could use to tell us when we should ignore bandwidth changes?
> >
> > Bjorn
On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:
> AFAICT, this thread petered out with no resolution.
>
> If the bandwidth change notifications are important to somebody,
> please speak up, preferably with a patch that makes the notifications
> disabled by default and adds a parameter to enable them (or some other
> strategy that makes sense).
>
> I think these are potentially useful, so I don't really want to just
> revert them, but if nobody thinks these are important enough to fix,
> that's a possibility.
Hide behind debug or expert option by default? or even mark it as BROKEN
until someone fixes it?
On 1/28/21 5:51 PM, Sinan Kaya wrote:
> On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:
>> AFAICT, this thread petered out with no resolution.
>>
>> If the bandwidth change notifications are important to somebody,
>> please speak up, preferably with a patch that makes the notifications
>> disabled by default and adds a parameter to enable them (or some other
>> strategy that makes sense).
>>
>> I think these are potentially useful, so I don't really want to just
>> revert them, but if nobody thinks these are important enough to fix,
>> that's a possibility.
>
> Hide behind debug or expert option by default? or even mark it as BROKEN
> until someone fixes it?
>
Instead of making it a config option, wouldn't it be better as a kernel
parameter? People encountering this seem quite competent in passing
kernel arguments, so having a "pcie_bw_notification=off" would solve
their problems.
As far as marking this as broken, I've seen no conclusive evidence of to
tell if its a sw bug or actual hardware problem. Could we have a sysfs
to disable this on a per-downstream-port basis?
e.g.
echo 0 > /sys/bus/pci/devices/0000:00:04.0/bw_notification_enabled
This probably won't be ideal if there are many devices downtraining
their links ad-hoc. At worst we'd have a way to silence those messages
if we do encounter such devices.
Alex
On Thu, Jan 28, 2021 at 6:39 PM Bjorn Helgaas <[email protected]> wrote:
>
> [+cc Atanas -- thank you very much for the bug report!]
>
> On Sat, Feb 22, 2020 at 10:58:40AM -0600, Bjorn Helgaas wrote:
> > On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote:
> > > I think we have a problem with link bandwidth change notifications
> > > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c).
> > >
> > > Here's a recent bug report where Jan reported "_tons_" of these
> > > notifications on an nvme device:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=206197
> >
> > AFAICT, this thread petered out with no resolution.
> >
> > If the bandwidth change notifications are important to somebody,
> > please speak up, preferably with a patch that makes the notifications
> > disabled by default and adds a parameter to enable them (or some other
> > strategy that makes sense).
> >
> > I think these are potentially useful, so I don't really want to just
> > revert them, but if nobody thinks these are important enough to fix,
> > that's a possibility.
>
> Atanas is also seeing this problem and went to the trouble of digging
> up this bug report:
> https://bugzilla.kernel.org/show_bug.cgi?id=206197#c8
>
> I'm actually a little surprised that we haven't seen more reports of
> this. I don't think distros enable CONFIG_PCIE_BW, but even so, I
> would think more people running upstream kernels would trip over it.
> But maybe people just haven't turned CONFIG_PCIE_BW on.
>
> I don't have a suggestion; just adding Atanas to this old thread.
>
> > > There was similar discussion involving GPU drivers at
> > > https://lore.kernel.org/r/[email protected]
> > >
> > > The current solution is the CONFIG_PCIE_BW config option, which
> > > disables the messages completely. That option defaults to "off" (no
> > > messages), but even so, I think it's a little problematic.
> > >
> > > Users are not really in a position to figure out whether it's safe to
> > > enable. All they can do is experiment and see whether it works with
> > > their current mix of devices and drivers.
> > >
> > > I don't think it's currently useful for distros because it's a
> > > compile-time switch, and distros cannot predict what system configs
> > > will be used, so I don't think they can enable it.
> > >
> > > Does anybody have proposals for making it smarter about distinguishing
> > > real problems from intentional power management, or maybe interfaces
> > > drivers could use to tell us when we should ignore bandwidth changes?
There's also this recently filed bug:
https://gitlab.freedesktop.org/drm/amd/-/issues/1447
The root cause of it appears to be related to ASPM.
Alex
On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:
> On 1/28/21 5:51 PM, Sinan Kaya wrote:
> > On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:
> > > AFAICT, this thread petered out with no resolution.
> > >
> > > If the bandwidth change notifications are important to somebody,
> > > please speak up, preferably with a patch that makes the notifications
> > > disabled by default and adds a parameter to enable them (or some other
> > > strategy that makes sense).
> > >
> > > I think these are potentially useful, so I don't really want to just
> > > revert them, but if nobody thinks these are important enough to fix,
> > > that's a possibility.
> >
> > Hide behind debug or expert option by default? or even mark it as BROKEN
> > until someone fixes it?
> >
> Instead of making it a config option, wouldn't it be better as a kernel
> parameter? People encountering this seem quite competent in passing kernel
> arguments, so having a "pcie_bw_notification=off" would solve their
> problems.
I don't want people to have to discover a parameter to solve issues.
If there's a parameter, notification should default to off, and people
who want notification should supply a parameter to enable it. Same
thing for the sysfs idea.
I think we really just need to figure out what's going on. Then it
should be clearer how to handle it. I'm not really in a position to
debug the root cause since I don't have the hardware or the time. If
nobody can figure out what's going on, I think we'll have to make it
disabled by default.
> As far as marking this as broken, I've seen no conclusive evidence of to
> tell if its a sw bug or actual hardware problem. Could we have a sysfs to
> disable this on a per-downstream-port basis?
>
> e.g.
> echo 0 > /sys/bus/pci/devices/0000:00:04.0/bw_notification_enabled
>
> This probably won't be ideal if there are many devices downtraining their
> links ad-hoc. At worst we'd have a way to silence those messages if we do
> encounter such devices.
>
> Alex
On 1/29/21 3:56 PM, Bjorn Helgaas wrote:
> On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:
>> On 1/28/21 5:51 PM, Sinan Kaya wrote:
>>> On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:
>>>> AFAICT, this thread petered out with no resolution.
>>>>
>>>> If the bandwidth change notifications are important to somebody,
>>>> please speak up, preferably with a patch that makes the notifications
>>>> disabled by default and adds a parameter to enable them (or some other
>>>> strategy that makes sense).
>>>>
>>>> I think these are potentially useful, so I don't really want to just
>>>> revert them, but if nobody thinks these are important enough to fix,
>>>> that's a possibility.
>>>
>>> Hide behind debug or expert option by default? or even mark it as BROKEN
>>> until someone fixes it?
>>>
>> Instead of making it a config option, wouldn't it be better as a kernel
>> parameter? People encountering this seem quite competent in passing kernel
>> arguments, so having a "pcie_bw_notification=off" would solve their
>> problems.
>
> I don't want people to have to discover a parameter to solve issues.
> If there's a parameter, notification should default to off, and people
> who want notification should supply a parameter to enable it. Same
> thing for the sysfs idea.
I can imagine cases where a per-port flag would be useful. For example,
a machine with a NIC and a couple of PCIe storage drives. In this
example, the PCIe drives donwtrain willie-nillie, so it's useful to turn
off their notifications, but the NIC absolutely must not downtrain. It's
debatable whether it should be default on or default off.
> I think we really just need to figure out what's going on. Then it
> should be clearer how to handle it. I'm not really in a position to
> debug the root cause since I don't have the hardware or the time.
I wonder
(a) if some PCIe devices are downtraining willie-nillie to save power
(b) if this willie-nillie downtraining somehow violates PCIe spec
(c) what is the official behavior when downtraining is intentional
My theory is: YES, YES, ASPM. But I don't know how to figure this out
without having the problem hardware in hand.
> If nobody can figure out what's going on, I think we'll have to make it
> disabled by default.
I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?
Alex
On 2/2/21 2:16 PM, Bjorn Helgaas wrote:
> On Tue, Feb 02, 2021 at 01:50:20PM -0600, Alex G. wrote:
>> On 1/29/21 3:56 PM, Bjorn Helgaas wrote:
>>> On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:
>>>> On 1/28/21 5:51 PM, Sinan Kaya wrote:
>>>>> On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:
>>>>>> AFAICT, this thread petered out with no resolution.
>>>>>>
>>>>>> If the bandwidth change notifications are important to somebody,
>>>>>> please speak up, preferably with a patch that makes the notifications
>>>>>> disabled by default and adds a parameter to enable them (or some other
>>>>>> strategy that makes sense).
>>>>>>
>>>>>> I think these are potentially useful, so I don't really want to just
>>>>>> revert them, but if nobody thinks these are important enough to fix,
>>>>>> that's a possibility.
>>>>>
>>>>> Hide behind debug or expert option by default? or even mark it as BROKEN
>>>>> until someone fixes it?
>>>>>
>>>> Instead of making it a config option, wouldn't it be better as a kernel
>>>> parameter? People encountering this seem quite competent in passing kernel
>>>> arguments, so having a "pcie_bw_notification=off" would solve their
>>>> problems.
>>>
>>> I don't want people to have to discover a parameter to solve issues.
>>> If there's a parameter, notification should default to off, and people
>>> who want notification should supply a parameter to enable it. Same
>>> thing for the sysfs idea.
>>
>> I can imagine cases where a per-port flag would be useful. For example, a
>> machine with a NIC and a couple of PCIe storage drives. In this example, the
>> PCIe drives downtrain willie-nillie, so it's useful to turn off their
>> notifications, but the NIC absolutely must not downtrain. It's debatable
>> whether it should be default on or default off.
>>
>>> I think we really just need to figure out what's going on. Then it
>>> should be clearer how to handle it. I'm not really in a position to
>>> debug the root cause since I don't have the hardware or the time.
>>
>> I wonder
>> (a) if some PCIe devices are downtraining willie-nillie to save power
>> (b) if this willie-nillie downtraining somehow violates PCIe spec
>> (c) what is the official behavior when downtraining is intentional
>>
>> My theory is: YES, YES, ASPM. But I don't know how to figure this out
>> without having the problem hardware in hand.
>>
>>> If nobody can figure out what's going on, I think we'll have to make it
>>> disabled by default.
>>
>> I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?
>
> I think it *is* true that distros do not enable CONFIG_PCIE_BW.
>
> But it's perfectly reasonable for people building their own kernels to
> enable it. It should be safe to enable all config options. If they
> do enable CONFIG_PCIE_BW, I don't want them to waste time debugging
> messages they don't expect.
>
> If we understood why these happen and could filter out the expected
> ones, that would be great. But we don't. We've already wasted quite
> a bit of Jan's and Atanas' time, and no doubt others who haven't
> bothered to file bug reports.
>
> So I think I'll queue up a patch to remove the functionality for now.
> It's easily restored if somebody debugs the problem or adds a
> command-line switch or something.
I think it's best we make it a module (or kernel) parameter, default=off
for the time being.
Alex
On Tue, Feb 02, 2021 at 01:50:20PM -0600, Alex G. wrote:
> On 1/29/21 3:56 PM, Bjorn Helgaas wrote:
> > On Thu, Jan 28, 2021 at 06:07:36PM -0600, Alex G. wrote:
> > > On 1/28/21 5:51 PM, Sinan Kaya wrote:
> > > > On 1/28/2021 6:39 PM, Bjorn Helgaas wrote:
> > > > > AFAICT, this thread petered out with no resolution.
> > > > >
> > > > > If the bandwidth change notifications are important to somebody,
> > > > > please speak up, preferably with a patch that makes the notifications
> > > > > disabled by default and adds a parameter to enable them (or some other
> > > > > strategy that makes sense).
> > > > >
> > > > > I think these are potentially useful, so I don't really want to just
> > > > > revert them, but if nobody thinks these are important enough to fix,
> > > > > that's a possibility.
> > > >
> > > > Hide behind debug or expert option by default? or even mark it as BROKEN
> > > > until someone fixes it?
> > > >
> > > Instead of making it a config option, wouldn't it be better as a kernel
> > > parameter? People encountering this seem quite competent in passing kernel
> > > arguments, so having a "pcie_bw_notification=off" would solve their
> > > problems.
> >
> > I don't want people to have to discover a parameter to solve issues.
> > If there's a parameter, notification should default to off, and people
> > who want notification should supply a parameter to enable it. Same
> > thing for the sysfs idea.
>
> I can imagine cases where a per-port flag would be useful. For example, a
> machine with a NIC and a couple of PCIe storage drives. In this example, the
> PCIe drives downtrain willie-nillie, so it's useful to turn off their
> notifications, but the NIC absolutely must not downtrain. It's debatable
> whether it should be default on or default off.
>
> > I think we really just need to figure out what's going on. Then it
> > should be clearer how to handle it. I'm not really in a position to
> > debug the root cause since I don't have the hardware or the time.
>
> I wonder
> (a) if some PCIe devices are downtraining willie-nillie to save power
> (b) if this willie-nillie downtraining somehow violates PCIe spec
> (c) what is the official behavior when downtraining is intentional
>
> My theory is: YES, YES, ASPM. But I don't know how to figure this out
> without having the problem hardware in hand.
>
> > If nobody can figure out what's going on, I think we'll have to make it
> > disabled by default.
>
> I think most distros do "CONFIG_PCIE_BW is not set". Is that not true?
I think it *is* true that distros do not enable CONFIG_PCIE_BW.
But it's perfectly reasonable for people building their own kernels to
enable it. It should be safe to enable all config options. If they
do enable CONFIG_PCIE_BW, I don't want them to waste time debugging
messages they don't expect.
If we understood why these happen and could filter out the expected
ones, that would be great. But we don't. We've already wasted quite
a bit of Jan's and Atanas' time, and no doubt others who haven't
bothered to file bug reports.
So I think I'll queue up a patch to remove the functionality for now.
It's easily restored if somebody debugs the problem or adds a
command-line switch or something.
Bjorn