Dear Linux folks,
On the Lenovo X60t, during resume from ACPI suspend and during shutdown,
the message below is shown in the logs.
pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038
(issued 65284 msec ago)
If I am not mistaken, that device is the integrated Ethernet controller,
which is not hot pluggable.
$ lspci -t
-[0000:00]-+-00.0
+-02.0
+-02.1
+-1b.0
+-1c.0-[01]----00.0
+-1c.1-[02]----00.0
+-1c.2-[03]--
+-1c.3-[04]--
+-1d.0
+-1d.1
+-1d.2
+-1d.3
+-1d.7
+-1e.0-[05-06]--+-00.0
| +-00.1
| \-00.2
+-1f.0
+-1f.1
+-1f.2
\-1f.3
```
Here is the device.
01:00.0 Ethernet controller [0200]: Intel Corporation 82573L
Gigabit Ethernet Controller [8086:109a]
Please find the messages attached.
Might it be related to the connector to the docking station/port
replicator. As far as I remember, connecting the device to the port
replicator, only the Ethernet port of the port replicator works.
Kind regards,
Paul
[+cc Lukas, Sinan]
Hi Paul,
Thanks for the report!
On Thu, Apr 26, 2018 at 12:17:53PM +0200, Paul Menzel wrote:
> Dear Linux folks,
>
>
> On the Lenovo X60t, during resume from ACPI suspend and during shutdown, the
> message below is shown in the logs.
>
> pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued
> 65284 msec ago)
This is an Intel root port:
00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 02) (prog-if 00 [Normal decode])
and probably has the CF118 erratum (see
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3461a068661c
for details). I bet if you changed "msecs" in pcie_wait_cmd() to 30000
you'd see a 30 second delay during shutdown because we write a command to
tell the port not to generate any more hotplug interrupts, and we wait for
that command to complete, but the port never tells us it has completed.
Lukas reported a similar issue in
https://lkml.kernel.org/r/[email protected], which we sort
of worked around by assuming that Thunderbolt controllers never support
that "command complete" interrupt (see
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=493fb50e958c)
Sinan mooted the idea of using a "no-wait" path of sending the "don't
generate hotplug interrupts" command. I think we should work on this
idea a little more. If we're shutting down the whole system, I can't
believe there's much value in *anything* we do in the pciehp_remove()
path.
Maybe we should just get rid of pciehp_remove() (and probably
pcie_port_remove_service() and the other service driver remove methods)
completely. That dates from when the service drivers could be modules that
could be potentially unloaded, but unloading them hasn't been possible for
years.
As far as the resume path, my guess is that in pciehp_resume(), we
write a command to enable interrupts, then it looks like we get a
PCI_EXP_SLTSTA_DLLSC "Link Up" interrupt, and apparently we issue
another command. Not sure exactly what's going on here.
Could you try the following patch? The idea is to (1) do nothing on
shutdown, so you should see no message and no delay, and (2) collect
more information about the resume path.
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index 332b723ff9e6..99751cc52968 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -260,14 +260,6 @@ static int pciehp_probe(struct pcie_device *dev)
return -ENODEV;
}
-static void pciehp_remove(struct pcie_device *dev)
-{
- struct controller *ctrl = get_service_data(dev);
-
- cleanup_slot(ctrl);
- pciehp_release_ctrl(ctrl);
-}
-
#ifdef CONFIG_PM
static int pciehp_suspend(struct pcie_device *dev)
{
@@ -305,7 +297,6 @@ static struct pcie_port_service_driver hpdriver_portdrv = {
.service = PCIE_PORT_SERVICE_HP,
.probe = pciehp_probe,
- .remove = pciehp_remove,
#ifdef CONFIG_PM
.suspend = pciehp_suspend,
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 18a42f8f5dc5..c3a9c47ed061 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -113,7 +113,7 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout)
return 0; /* timeout */
}
-static void pcie_wait_cmd(struct controller *ctrl)
+static void pcie_wait_cmd(struct controller *ctrl, u16 cmd, u16 mask)
{
unsigned int msecs = pciehp_poll_mode ? 2500 : 1000;
unsigned long duration = msecs_to_jiffies(msecs);
@@ -155,10 +155,13 @@ static void pcie_wait_cmd(struct controller *ctrl)
* don't change those bits, e.g., commands that merely enable
* interrupts.
*/
- if (!rc)
- ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec ago)\n",
+ if (!rc) {
+ ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec ago), new command %#06x/mask %#06x\n",
ctrl->slot_ctrl,
- jiffies_to_msecs(jiffies - ctrl->cmd_started));
+ jiffies_to_msecs(jiffies - ctrl->cmd_started),
+ cmd, mask);
+ dump_stack();
+ }
}
static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
@@ -172,7 +175,7 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
/*
* Always wait for any previous command that might still be in progress
*/
- pcie_wait_cmd(ctrl);
+ pcie_wait_cmd(ctrl, cmd, mask);
pcie_capability_read_word(pdev, PCI_EXP_SLTCTL, &slot_ctrl);
if (slot_ctrl == (u16) ~0) {
@@ -193,7 +196,7 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
* indicating completion of the above issued command.
*/
if (wait)
- pcie_wait_cmd(ctrl);
+ pcie_wait_cmd(ctrl, cmd, mask);
out:
mutex_unlock(&ctrl->ctrl_lock);
Hi Bjorn,
Thanks for adding me.
On 4/27/2018 3:22 PM, Bjorn Helgaas wrote:
> Sinan mooted the idea of using a "no-wait" path of sending the "don't
> generate hotplug interrupts" command. I think we should work on this
> idea a little more. If we're shutting down the whole system, I can't
> believe there's much value in *anything* we do in the pciehp_remove()
> path.
>
> Maybe we should just get rid of pciehp_remove() (and probably
> pcie_port_remove_service() and the other service driver remove methods)
> completely. That dates from when the service drivers could be modules that
> could be potentially unloaded, but unloading them hasn't been possible for
> years.
Shutdown path is also used for kexec. Leaving hotplug interrupts pending is
dangerous for the newly loaded kernel as it leaves spurious interrupts
during the new kernel boot.
I think we should always disable the hotplug interrupt on shutdown. We might
think of not waiting for command-completion as a middle-ground or go to polling
path instead of interrupts all the time.
Sinan
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
[+cc Eric, Vivek, kexec list]
On Fri, Apr 27, 2018 at 03:34:30PM -0400, Sinan Kaya wrote:
> On 4/27/2018 3:22 PM, Bjorn Helgaas wrote:
> > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > generate hotplug interrupts" command. I think we should work on this
> > idea a little more. If we're shutting down the whole system, I can't
> > believe there's much value in *anything* we do in the pciehp_remove()
> > path.
> >
> > Maybe we should just get rid of pciehp_remove() (and probably
> > pcie_port_remove_service() and the other service driver remove methods)
> > completely. That dates from when the service drivers could be modules that
> > could be potentially unloaded, but unloading them hasn't been possible for
> > years.
>
> Shutdown path is also used for kexec. Leaving hotplug interrupts
> pending is dangerous for the newly loaded kernel as it leaves
> spurious interrupts during the new kernel boot.
>
> I think we should always disable the hotplug interrupt on shutdown.
> We might think of not waiting for command-completion as a
> middle-ground or go to polling path instead of interrupts all the
> time.
Ah, I forgot about the kexec path. The kexec path is used for
crashdump, too, so ideally the newly-loaded kernel would defend itself
when possible so it doesn't depend on the original kernel doing things
correctly.
Seems like this question of whether to do things in the original
kernel or the kexec-ed kernel comes up periodically, but I can never
remember a definitive answer. My initial reaction is that it'd be
nice if we didn't have to do *any* shutdown in the original kernel,
but I'm sure there are reasons that's not practical.
I copied Eric (kexec maintainer) and Vivek (contact listed in
Documentation/kdump/kdump.txt) in case they have suggestions or would
consider some sort of Documentation/ update.
Bjorn
On 04/27/18 at 04:12pm, Bjorn Helgaas wrote:
> [+cc Eric, Vivek, kexec list]
>
> On Fri, Apr 27, 2018 at 03:34:30PM -0400, Sinan Kaya wrote:
> > On 4/27/2018 3:22 PM, Bjorn Helgaas wrote:
> > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > > generate hotplug interrupts" command. I think we should work on this
> > > idea a little more. If we're shutting down the whole system, I can't
> > > believe there's much value in *anything* we do in the pciehp_remove()
> > > path.
> > >
> > > Maybe we should just get rid of pciehp_remove() (and probably
> > > pcie_port_remove_service() and the other service driver remove methods)
> > > completely. That dates from when the service drivers could be modules that
> > > could be potentially unloaded, but unloading them hasn't been possible for
> > > years.
> >
> > Shutdown path is also used for kexec. Leaving hotplug interrupts
> > pending is dangerous for the newly loaded kernel as it leaves
> > spurious interrupts during the new kernel boot.
> >
> > I think we should always disable the hotplug interrupt on shutdown.
> > We might think of not waiting for command-completion as a
> > middle-ground or go to polling path instead of interrupts all the
> > time.
>
> Ah, I forgot about the kexec path. The kexec path is used for
> crashdump, too, so ideally the newly-loaded kernel would defend itself
> when possible so it doesn't depend on the original kernel doing things
> correctly.
It is true for kdump. But kexec needs device shutdown.
>
> Seems like this question of whether to do things in the original
> kernel or the kexec-ed kernel comes up periodically, but I can never
> remember a definitive answer. My initial reaction is that it'd be
> nice if we didn't have to do *any* shutdown in the original kernel,
> but I'm sure there are reasons that's not practical.
Devices sometimes assume it is in a good state initialized in firmware boot
phase, so we need a shutdown in 1st kernel so that kexec kernel can boot
correctly for those devices. For kdump since kernel already panicked
and it is not reliable so we do as less as we can in the 1st kernel
crash path, but there are some special handling for kdump in various drivers
to reset the devices in 2nd kernel, eg. when it see "reset_devices" kernel parameter.
>
> I copied Eric (kexec maintainer) and Vivek (contact listed in
> Documentation/kdump/kdump.txt) in case they have suggestions or would
> consider some sort of Documentation/ update.
>
> Bjorn
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
Thanks
Dave
On 04/28/18 at 08:56am, Dave Young wrote:
> On 04/27/18 at 04:12pm, Bjorn Helgaas wrote:
> > [+cc Eric, Vivek, kexec list]
> >
> > On Fri, Apr 27, 2018 at 03:34:30PM -0400, Sinan Kaya wrote:
> > > On 4/27/2018 3:22 PM, Bjorn Helgaas wrote:
> > > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > > > generate hotplug interrupts" command. I think we should work on this
> > > > idea a little more. If we're shutting down the whole system, I can't
> > > > believe there's much value in *anything* we do in the pciehp_remove()
> > > > path.
> > > >
> > > > Maybe we should just get rid of pciehp_remove() (and probably
> > > > pcie_port_remove_service() and the other service driver remove methods)
> > > > completely. That dates from when the service drivers could be modules that
Hmm, if it is the remove() method then kexec does not use it. kexec use
the shutdown() method instead. I missed this details when I replied.
> > > > could be potentially unloaded, but unloading them hasn't been possible for
> > > > years.
> > >
> > > Shutdown path is also used for kexec. Leaving hotplug interrupts
> > > pending is dangerous for the newly loaded kernel as it leaves
> > > spurious interrupts during the new kernel boot.
> > >
> > > I think we should always disable the hotplug interrupt on shutdown.
> > > We might think of not waiting for command-completion as a
> > > middle-ground or go to polling path instead of interrupts all the
> > > time.
> >
> > Ah, I forgot about the kexec path. The kexec path is used for
> > crashdump, too, so ideally the newly-loaded kernel would defend itself
> > when possible so it doesn't depend on the original kernel doing things
> > correctly.
>
> It is true for kdump. But kexec needs device shutdown.
>
> >
> > Seems like this question of whether to do things in the original
> > kernel or the kexec-ed kernel comes up periodically, but I can never
> > remember a definitive answer. My initial reaction is that it'd be
> > nice if we didn't have to do *any* shutdown in the original kernel,
> > but I'm sure there are reasons that's not practical.
>
> Devices sometimes assume it is in a good state initialized in firmware boot
> phase, so we need a shutdown in 1st kernel so that kexec kernel can boot
> correctly for those devices. For kdump since kernel already panicked
> and it is not reliable so we do as less as we can in the 1st kernel
> crash path, but there are some special handling for kdump in various drivers
> to reset the devices in 2nd kernel, eg. when it see "reset_devices" kernel parameter.
>
> >
> > I copied Eric (kexec maintainer) and Vivek (contact listed in
> > Documentation/kdump/kdump.txt) in case they have suggestions or would
> > consider some sort of Documentation/ update.
> >
> > Bjorn
> >
> > _______________________________________________
> > kexec mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/kexec
>
> Thanks
> Dave
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
On 2018-04-27 21:18, Dave Young wrote:
> On 04/28/18 at 08:56am, Dave Young wrote:
>> On 04/27/18 at 04:12pm, Bjorn Helgaas wrote:
>> > [+cc Eric, Vivek, kexec list]
>> >
>> > On Fri, Apr 27, 2018 at 03:34:30PM -0400, Sinan Kaya wrote:
>> > > On 4/27/2018 3:22 PM, Bjorn Helgaas wrote:
>> > > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
>> > > > generate hotplug interrupts" command. I think we should work on this
>> > > > idea a little more. If we're shutting down the whole system, I can't
>> > > > believe there's much value in *anything* we do in the pciehp_remove()
>> > > > path.
>> > > >
>> > > > Maybe we should just get rid of pciehp_remove() (and probably
>> > > > pcie_port_remove_service() and the other service driver remove methods)
>> > > > completely. That dates from when the service drivers could be modules that
>
> Hmm, if it is the remove() method then kexec does not use it. kexec
> use
> the shutdown() method instead. I missed this details when I replied.
Portdrv hooks up remove handler to shutdown. That's why remove is
getting called.
>
>> > > > could be potentially unloaded, but unloading them hasn't been possible for
>> > > > years.
>> > >
>> > > Shutdown path is also used for kexec. Leaving hotplug interrupts
>> > > pending is dangerous for the newly loaded kernel as it leaves
>> > > spurious interrupts during the new kernel boot.
>> > >
>> > > I think we should always disable the hotplug interrupt on shutdown.
>> > > We might think of not waiting for command-completion as a
>> > > middle-ground or go to polling path instead of interrupts all the
>> > > time.
>> >
>> > Ah, I forgot about the kexec path. The kexec path is used for
>> > crashdump, too, so ideally the newly-loaded kernel would defend itself
>> > when possible so it doesn't depend on the original kernel doing things
>> > correctly.
>>
>> It is true for kdump. But kexec needs device shutdown.
>>
>> >
>> > Seems like this question of whether to do things in the original
>> > kernel or the kexec-ed kernel comes up periodically, but I can never
>> > remember a definitive answer. My initial reaction is that it'd be
>> > nice if we didn't have to do *any* shutdown in the original kernel,
>> > but I'm sure there are reasons that's not practical.
>>
>> Devices sometimes assume it is in a good state initialized in firmware
>> boot
>> phase, so we need a shutdown in 1st kernel so that kexec kernel can
>> boot
>> correctly for those devices. For kdump since kernel already panicked
>> and it is not reliable so we do as less as we can in the 1st kernel
>> crash path, but there are some special handling for kdump in various
>> drivers
>> to reset the devices in 2nd kernel, eg. when it see "reset_devices"
>> kernel parameter.
>>
>> >
>> > I copied Eric (kexec maintainer) and Vivek (contact listed in
>> > Documentation/kdump/kdump.txt) in case they have suggestions or would
>> > consider some sort of Documentation/ update.
>> >
>> > Bjorn
>> >
>> > _______________________________________________
>> > kexec mailing list
>> > [email protected]
>> > http://lists.infradead.org/mailman/listinfo/kexec
>>
>> Thanks
>> Dave
>>
>> _______________________________________________
>> kexec mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/kexec
Bjorn,
On 4/28/2018 9:03 AM, [email protected] wrote:
>> Hmm, if it is the remove() method then kexec does not use it. kexec use
>> the shutdown() method instead. I missed this details when I replied.
>
> Portdrv hooks up remove handler to shutdown. That's why remove is getting called.
What should we do about this?
Since there is an actual HW errata involved, should we quirk this root port and
not wait as if remove/shutdown doesn't exist?
Paul,
You might want to file a bugzilla so that we can keep our debug efforts out of this
list.
Sinan
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
On Mon, Apr 30, 2018 at 04:48:15PM -0400, Sinan Kaya wrote:
> Bjorn,
>
> On 4/28/2018 9:03 AM, [email protected] wrote:
> >> Hmm, if it is the remove() method then kexec does not use it.? kexec use
> >> the shutdown() method instead.? I missed this details when I replied.
> >
> > Portdrv hooks up remove handler to shutdown. That's why remove is getting called.
>
> What should we do about this?
>
> Since there is an actual HW errata involved, should we quirk this
> root port and not wait as if remove/shutdown doesn't exist?
I was hoping to avoid a quirk because AFAIK all Intel parts have this
issue so it will be an ongoing maintenance issue. I tried to avoid
the timeout delays, e.g., with 40b960831cfa ("PCI: pciehp: Compute
timeout from hotplug command start time").
But we still see the alarming messages, so we should probably add a
quirk to get rid of those.
But I haven't given up on the idea of getting rid of the
pciehp_remove() path. I'm not convinced yet that we actually need to
do anything to shut this device down. I don't like the assumption
that kexec requires this. The kexec is fundamentally just a branch,
and anything we do before the branch (i.e., in the old kernel), we
should also be able to do after the branch (i.e., in the kexec-ed
kernel).
> Paul,
> You might want to file a bugzilla so that we can keep our debug
> efforts out of this list.
On 4/30/2018 5:17 PM, Bjorn Helgaas wrote:
>> What should we do about this?
>>
>> Since there is an actual HW errata involved, should we quirk this
>> root port and not wait as if remove/shutdown doesn't exist?
> I was hoping to avoid a quirk because AFAIK all Intel parts have this
> issue so it will be an ongoing maintenance issue. I tried to avoid
> the timeout delays, e.g., with 40b960831cfa ("PCI: pciehp: Compute
> timeout from hotplug command start time").
>
> But we still see the alarming messages, so we should probably add a
> quirk to get rid of those.
>
> But I haven't given up on the idea of getting rid of the
> pciehp_remove() path. I'm not convinced yet that we actually need to
> do anything to shut this device down. I don't like the assumption
> that kexec requires this. The kexec is fundamentally just a branch,
> and anything we do before the branch (i.e., in the old kernel), we
> should also be able to do after the branch (i.e., in the kexec-ed
> kernel).
>
In my experience with kexec, MSI type edge interrupts are harmless.
You might just see a few unhandled interrupt messages during boot
if something is pending from the first kernel.
It is the level interrupts that are more concerning. It remains pending
until the interrupt source is cleared. CPU never returns from the
interrupt handler to actually continue booting the second kernel.
Execution doesn't reach to PCIe hp driver initialization for
acknowledging the interrupt.
How about remove() only if MSI is disabled? Most root port interrupts
are MSI based anyhow.
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
On Mon, Apr 30, 2018 at 05:27:32PM -0400, Sinan Kaya wrote:
> On 4/30/2018 5:17 PM, Bjorn Helgaas wrote:
> How about remove() only if MSI is disabled? Most root port interrupts
> are MSI based anyhow.
I think the code is perfectly fine as it is and doesn't need to nor
should be modified. We should just add a quirk for this particular
chipset that it doesn't support Command Completed interrupts (if that's
what's broken). In the case of Thunderbolt controllers, the first few
had the NoCompl bit set incorrectly in config space and Intel hardware
engineers confirmed the error. These or similar oddities may occur
with more chipsets, so I propose moving existing quirks to a
separate file, thereby avoiding that they clutter generic code.
The blinkenlights feature added with commit 576243b3f9ea is a similar
quirk that could be moved to a separate pciehp_quirks.c file.
Thanks,
Lukas
+Marc,
On 4/30/2018 5:27 PM, Sinan Kaya wrote:
> On 4/30/2018 5:17 PM, Bjorn Helgaas wrote:
>>> What should we do about this?
>>>
>>> Since there is an actual HW errata involved, should we quirk this
>>> root port and not wait as if remove/shutdown doesn't exist?
>> I was hoping to avoid a quirk because AFAIK all Intel parts have this
>> issue so it will be an ongoing maintenance issue. I tried to avoid
>> the timeout delays, e.g., with 40b960831cfa ("PCI: pciehp: Compute
>> timeout from hotplug command start time").
>>
>> But we still see the alarming messages, so we should probably add a
>> quirk to get rid of those.
>>
>> But I haven't given up on the idea of getting rid of the
>> pciehp_remove() path. I'm not convinced yet that we actually need to
>> do anything to shut this device down. I don't like the assumption
>> that kexec requires this. The kexec is fundamentally just a branch,
>> and anything we do before the branch (i.e., in the old kernel), we
>> should also be able to do after the branch (i.e., in the kexec-ed
>> kernel).
>>
>
> In my experience with kexec, MSI type edge interrupts are harmless.
> You might just see a few unhandled interrupt messages during boot
> if something is pending from the first kernel.
>
> It is the level interrupts that are more concerning. It remains pending
> until the interrupt source is cleared. CPU never returns from the
> interrupt handler to actually continue booting the second kernel.
This makes me wonder why kexec doesn't disable all interrupt sources by
itself instead of relying on the drivers shutdown routine. Some drivers
don't even have a shutdown callback. Kexec could have done both as another
example. Something like.
1. Call shutdown for all drivers if available.
2. Disable all interrupt sources in the interrupt controller
3. Start the new kernel.
>
> Execution doesn't reach to PCIe hp driver initialization for
> acknowledging the interrupt.
>
> How about remove() only if MSI is disabled? Most root port interrupts
> are MSI based anyhow.
>
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
On 01/05/18 13:38, Sinan Kaya wrote:
> +Marc,
>
> On 4/30/2018 5:27 PM, Sinan Kaya wrote:
>> On 4/30/2018 5:17 PM, Bjorn Helgaas wrote:
>>>> What should we do about this?
>>>>
>>>> Since there is an actual HW errata involved, should we quirk this
>>>> root port and not wait as if remove/shutdown doesn't exist?
>>> I was hoping to avoid a quirk because AFAIK all Intel parts have this
>>> issue so it will be an ongoing maintenance issue. I tried to avoid
>>> the timeout delays, e.g., with 40b960831cfa ("PCI: pciehp: Compute
>>> timeout from hotplug command start time").
>>>
>>> But we still see the alarming messages, so we should probably add a
>>> quirk to get rid of those.
>>>
>>> But I haven't given up on the idea of getting rid of the
>>> pciehp_remove() path. I'm not convinced yet that we actually need to
>>> do anything to shut this device down. I don't like the assumption
>>> that kexec requires this. The kexec is fundamentally just a branch,
>>> and anything we do before the branch (i.e., in the old kernel), we
>>> should also be able to do after the branch (i.e., in the kexec-ed
>>> kernel).
>>>
>>
>> In my experience with kexec, MSI type edge interrupts are harmless.
>> You might just see a few unhandled interrupt messages during boot
>> if something is pending from the first kernel.
Unfortunately, that's not always the case.
A number of GICv3/v4 implementations (a very common interrupt controller
on ARM servers) cannot be disabled, which means they will keep writing
to their pending tables long after kexec will have started the new
kernel. And since we don't track memory allocation across kexec, you
end-up with significant chances of observing single bit corruption as
interrupts carry on being delivered. Oh, and you won't actually be able
to take MSIs because you can't even reprogram the damn thing.
Yes, this can be considered a HW bug.
>> It is the level interrupts that are more concerning. It remains pending
>> until the interrupt source is cleared. CPU never returns from the
>> interrupt handler to actually continue booting the second kernel.
>
> This makes me wonder why kexec doesn't disable all interrupt sources by
> itself instead of relying on the drivers shutdown routine. Some drivers
> don't even have a shutdown callback. Kexec could have done both as another
> example. Something like.
>
> 1. Call shutdown for all drivers if available.
> 2. Disable all interrupt sources in the interrupt controller
> 3. Start the new kernel.
See above. Although you can shut off the end-point and to some extent
mask interrupts before jumping into the payload, it is not always
possible to go back to a reasonable state where you can take actually MSIs.
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On Tue, May 01, 2018 at 01:59:20PM +0100, Marc Zyngier wrote:
> On 01/05/18 13:38, Sinan Kaya wrote:
> > +Marc,
> >
> > On 4/30/2018 5:27 PM, Sinan Kaya wrote:
> >> On 4/30/2018 5:17 PM, Bjorn Helgaas wrote:
> >>>> What should we do about this?
> >>>>
> >>>> Since there is an actual HW errata involved, should we quirk this
> >>>> root port and not wait as if remove/shutdown doesn't exist?
> >>> I was hoping to avoid a quirk because AFAIK all Intel parts have this
> >>> issue so it will be an ongoing maintenance issue. I tried to avoid
> >>> the timeout delays, e.g., with 40b960831cfa ("PCI: pciehp: Compute
> >>> timeout from hotplug command start time").
> >>>
> >>> But we still see the alarming messages, so we should probably add a
> >>> quirk to get rid of those.
> >>>
> >>> But I haven't given up on the idea of getting rid of the
> >>> pciehp_remove() path. I'm not convinced yet that we actually need to
> >>> do anything to shut this device down. I don't like the assumption
> >>> that kexec requires this. The kexec is fundamentally just a branch,
> >>> and anything we do before the branch (i.e., in the old kernel), we
> >>> should also be able to do after the branch (i.e., in the kexec-ed
> >>> kernel).
> >>>
> >>
> >> In my experience with kexec, MSI type edge interrupts are harmless.
> >> You might just see a few unhandled interrupt messages during boot
> >> if something is pending from the first kernel.
>
> Unfortunately, that's not always the case.
>
> A number of GICv3/v4 implementations (a very common interrupt controller
> on ARM servers) cannot be disabled, which means they will keep writing
> to their pending tables long after kexec will have started the new
> kernel. And since we don't track memory allocation across kexec, you
> end-up with significant chances of observing single bit corruption as
> interrupts carry on being delivered. Oh, and you won't actually be able
> to take MSIs because you can't even reprogram the damn thing.
>
> Yes, this can be considered a HW bug.
>
> >> It is the level interrupts that are more concerning. It remains pending
> >> until the interrupt source is cleared. CPU never returns from the
> >> interrupt handler to actually continue booting the second kernel.
> >
> > This makes me wonder why kexec doesn't disable all interrupt sources by
> > itself instead of relying on the drivers shutdown routine. Some drivers
> > don't even have a shutdown callback. Kexec could have done both as another
> > example. Something like.
> >
> > 1. Call shutdown for all drivers if available.
> > 2. Disable all interrupt sources in the interrupt controller
> > 3. Start the new kernel.
>
> See above. Although you can shut off the end-point and to some extent
> mask interrupts before jumping into the payload, it is not always
> possible to go back to a reasonable state where you can take actually MSIs.
This is exactly the sort of thing it would be nice to collect and
document as part of the background of "why kexec works the way it
does." It certainly helps explain things that are far from obvious if
you don't have the background.
On Tue, 01 May 2018 14:25:54 +0100,
Bjorn Helgaas wrote:
Hi Bjorn,
> On Tue, May 01, 2018 at 01:59:20PM +0100, Marc Zyngier wrote:
> > On 01/05/18 13:38, Sinan Kaya wrote:
> > > +Marc,
> > >
> > > On 4/30/2018 5:27 PM, Sinan Kaya wrote:
> > >> On 4/30/2018 5:17 PM, Bjorn Helgaas wrote:
> > >>>> What should we do about this?
> > >>>>
> > >>>> Since there is an actual HW errata involved, should we quirk this
> > >>>> root port and not wait as if remove/shutdown doesn't exist?
> > >>> I was hoping to avoid a quirk because AFAIK all Intel parts have this
> > >>> issue so it will be an ongoing maintenance issue. I tried to avoid
> > >>> the timeout delays, e.g., with 40b960831cfa ("PCI: pciehp: Compute
> > >>> timeout from hotplug command start time").
> > >>>
> > >>> But we still see the alarming messages, so we should probably add a
> > >>> quirk to get rid of those.
> > >>>
> > >>> But I haven't given up on the idea of getting rid of the
> > >>> pciehp_remove() path. I'm not convinced yet that we actually need to
> > >>> do anything to shut this device down. I don't like the assumption
> > >>> that kexec requires this. The kexec is fundamentally just a branch,
> > >>> and anything we do before the branch (i.e., in the old kernel), we
> > >>> should also be able to do after the branch (i.e., in the kexec-ed
> > >>> kernel).
> > >>>
> > >>
> > >> In my experience with kexec, MSI type edge interrupts are harmless.
> > >> You might just see a few unhandled interrupt messages during boot
> > >> if something is pending from the first kernel.
> >
> > Unfortunately, that's not always the case.
> >
> > A number of GICv3/v4 implementations (a very common interrupt controller
> > on ARM servers) cannot be disabled, which means they will keep writing
> > to their pending tables long after kexec will have started the new
> > kernel. And since we don't track memory allocation across kexec, you
> > end-up with significant chances of observing single bit corruption as
> > interrupts carry on being delivered. Oh, and you won't actually be able
> > to take MSIs because you can't even reprogram the damn thing.
> >
> > Yes, this can be considered a HW bug.
> >
> > >> It is the level interrupts that are more concerning. It remains pending
> > >> until the interrupt source is cleared. CPU never returns from the
> > >> interrupt handler to actually continue booting the second kernel.
> > >
> > > This makes me wonder why kexec doesn't disable all interrupt sources by
> > > itself instead of relying on the drivers shutdown routine. Some drivers
> > > don't even have a shutdown callback. Kexec could have done both as another
> > > example. Something like.
> > >
> > > 1. Call shutdown for all drivers if available.
> > > 2. Disable all interrupt sources in the interrupt controller
> > > 3. Start the new kernel.
> >
> > See above. Although you can shut off the end-point and to some extent
> > mask interrupts before jumping into the payload, it is not always
> > possible to go back to a reasonable state where you can take actually MSIs.
>
> This is exactly the sort of thing it would be nice to collect and
> document as part of the background of "why kexec works the way it
> does." It certainly helps explain things that are far from obvious if
> you don't have the background.
I'd certainly be happy to help with it if someone was willing to
kickstart such a document. kexec/kdump is a huge bag of "interesting"
tricks, and it has driven me mad over the past couple of months (I'm
typing this from a laptop that uses kexec as its bootloader, and it is
*not fun*).
M.
--
Jazz is not dead, it just smell funny.
Marc Zyngier <[email protected]> writes:
> On Tue, 01 May 2018 14:25:54 +0100,
> Bjorn Helgaas wrote:
>
> Hi Bjorn,
>
>> On Tue, May 01, 2018 at 01:59:20PM +0100, Marc Zyngier wrote:
>> > On 01/05/18 13:38, Sinan Kaya wrote:
>> > > +Marc,
>> > >
>> > > On 4/30/2018 5:27 PM, Sinan Kaya wrote:
>> > >> On 4/30/2018 5:17 PM, Bjorn Helgaas wrote:
>> > >>>> What should we do about this?
>> > >>>>
>> > >>>> Since there is an actual HW errata involved, should we quirk this
>> > >>>> root port and not wait as if remove/shutdown doesn't exist?
>> > >>> I was hoping to avoid a quirk because AFAIK all Intel parts have this
>> > >>> issue so it will be an ongoing maintenance issue. I tried to avoid
>> > >>> the timeout delays, e.g., with 40b960831cfa ("PCI: pciehp: Compute
>> > >>> timeout from hotplug command start time").
>> > >>>
>> > >>> But we still see the alarming messages, so we should probably add a
>> > >>> quirk to get rid of those.
>> > >>>
>> > >>> But I haven't given up on the idea of getting rid of the
>> > >>> pciehp_remove() path. I'm not convinced yet that we actually need to
>> > >>> do anything to shut this device down. I don't like the assumption
>> > >>> that kexec requires this. The kexec is fundamentally just a branch,
>> > >>> and anything we do before the branch (i.e., in the old kernel), we
>> > >>> should also be able to do after the branch (i.e., in the kexec-ed
>> > >>> kernel).
>> > >>>
>> > >>
>> > >> In my experience with kexec, MSI type edge interrupts are harmless.
>> > >> You might just see a few unhandled interrupt messages during boot
>> > >> if something is pending from the first kernel.
>> >
>> > Unfortunately, that's not always the case.
>> >
>> > A number of GICv3/v4 implementations (a very common interrupt controller
>> > on ARM servers) cannot be disabled, which means they will keep writing
>> > to their pending tables long after kexec will have started the new
>> > kernel. And since we don't track memory allocation across kexec, you
>> > end-up with significant chances of observing single bit corruption as
>> > interrupts carry on being delivered. Oh, and you won't actually be able
>> > to take MSIs because you can't even reprogram the damn thing.
>> >
>> > Yes, this can be considered a HW bug.
>> >
>> > >> It is the level interrupts that are more concerning. It remains pending
>> > >> until the interrupt source is cleared. CPU never returns from the
>> > >> interrupt handler to actually continue booting the second kernel.
>> > >
>> > > This makes me wonder why kexec doesn't disable all interrupt sources by
>> > > itself instead of relying on the drivers shutdown routine. Some drivers
>> > > don't even have a shutdown callback. Kexec could have done both as another
>> > > example. Something like.
>> > >
>> > > 1. Call shutdown for all drivers if available.
>> > > 2. Disable all interrupt sources in the interrupt controller
>> > > 3. Start the new kernel.
>> >
>> > See above. Although you can shut off the end-point and to some extent
>> > mask interrupts before jumping into the payload, it is not always
>> > possible to go back to a reasonable state where you can take actually MSIs.
>>
>> This is exactly the sort of thing it would be nice to collect and
>> document as part of the background of "why kexec works the way it
>> does." It certainly helps explain things that are far from obvious if
>> you don't have the background.
>
> I'd certainly be happy to help with it if someone was willing to
> kickstart such a document. kexec/kdump is a huge bag of "interesting"
> tricks, and it has driven me mad over the past couple of months (I'm
> typing this from a laptop that uses kexec as its bootloader, and it is
> *not fun*).
I don't know if it helps documentation wise but here is my memory of why
things are the way they are.
Case 1) kexec-on-panic.
In this case we run the new kernel in memory reserved since boot of the
previous kernel in memory has never been used by any device driver.
This means on-going DMA transactions that we don't manage to shut off
are harmless.
In actual execution a bare minimum of hardware is shutdown on the
kexec-on-panic path. Ideally it would be nothing. The crashing kernel
simply can not be trusted to shut things down itself.
The kernel that is executing in the after the crash loads a bare minimum
of drivers and does it's best to initialize the hardware. Ideally if
something goes wrong the kernel will hang before we write to hardware
and mess anything up.
With this we get something like a 50% or a 60% success rate of capture
crashdump in practice in the field.
Everything else that has been tried relies more on the crashing kernel
and looks great in testing and then turns out to not have a measurable
success rate in practice.
Using lkdtm you can setup tests of various kinds of kernel corruption
and failure and see some approximation of the success rate of kexec will
see in practice.
I forget where we are with iommus, but the principles remain and iommus
tend to tricky just because they get in the middle of everything.
If someone stares hard enough we are probably at the point on x86 where
we can remove the irq shutdown code.
The kexec on panic case tends to be tested more on enterprise kernels
than on normal ones.
Case 2) Ordinary kexec.
The goal is to have a fully functionaly uncompromised system (unlike
kexec on panic). Hardware bugs mean that in the general case the only
place we can shutdown hardware reliably is the drivers themselves.
All devices doing DMA must be shutdown in the kexec'ing kernel. In part
because there is no guarantee that we will even load a driver for that
hardware.
The presence of DMA drove most of the decisions. But from this thread I
see that irq handling follows the same pattern. The best place to shut
anything down is in the driver where there is full knowledge of how
things work.
One of the more annoying things that have been discovered is the generic
pci dma disable bit doesn't work uniformly acrosss hardware. Which
means there is no known generic way to shut down dma across the board.
In the prototypes there was only the "remove" method of drivers and that
worked well. When it came time to merge the original kexec
implementation the maintainer of the power mananagement subsystem
insisted we add a new "shutdown" method instead, because while it is
necessary to shutdown the hardware you should not need to clean up the
data structures.
In practice that idea flopped. The most reliable way I know to run
kexec is to remmod all of the drivers before runing sys_reboot(...,
LINUX_REBOT_CMD_KEXEC, ...) so that the shutdown methods get run.
It has been asked and I have given my approval to anyone who wants to do
the work to switch form the "shutdown" methods to "remove" on the kexec
path. But so far it is a big enough project that no one has done that
yet.
It has been suggested that hardware does not need to be shutdown at the
end of the kernel before returning to a a firmware method. Which is
incorrect. Most firmware when it regains control triggers a system
reset to get the hardware back into a usable state, and be able to
reboot the system. There is a magic register for this on x86. On older
x86 systems and others that transfer control to firmware without doing a
soft hardware reset of the system and all of the devices. Without
shutting down the devices they will work about as well as kexec does
when you don't remove the devices. That is why I merged the reboot
and the kexec code paths. Well that and so that there is a little
more testing.
In practice it still seems that rmmod is the only testing that reliably
happens to drivers. So not sharing that code path makes kexec more
fragile than necessary.
Hopefully this helps put things into perspective and can help with your
docuement.
Eric
Dear Bjorn,
On 04/27/18 21:22, Bjorn Helgaas wrote:
> [+cc Lukas, Sinan]
> On Thu, Apr 26, 2018 at 12:17:53PM +0200, Paul Menzel wrote:
>> On the Lenovo X60t, during resume from ACPI suspend and during shutdown, the
>> message below is shown in the logs.
>>
>> pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued
>> 65284 msec ago)
>
> This is an Intel root port:
>
> 00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 02) (prog-if 00 [Normal decode])
>
> and probably has the CF118 erratum (see
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3461a068661c
> for details). I bet if you changed "msecs" in pcie_wait_cmd() to 30000
> you'd see a 30 second delay during shutdown because we write a command to
> tell the port not to generate any more hotplug interrupts, and we wait for
> that command to complete, but the port never tells us it has completed.
>
> Lukas reported a similar issue in
> https://lkml.kernel.org/r/[email protected], which we sort
> of worked around by assuming that Thunderbolt controllers never support
> that "command complete" interrupt (see
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=493fb50e958c)
>
> Sinan mooted the idea of using a "no-wait" path of sending the "don't
> generate hotplug interrupts" command. I think we should work on this
> idea a little more. If we're shutting down the whole system, I can't
> believe there's much value in *anything* we do in the pciehp_remove()
> path.
>
> Maybe we should just get rid of pciehp_remove() (and probably
> pcie_port_remove_service() and the other service driver remove methods)
> completely. That dates from when the service drivers could be modules that
> could be potentially unloaded, but unloading them hasn't been possible for
> years.
>
> As far as the resume path, my guess is that in pciehp_resume(), we
> write a command to enable interrupts, then it looks like we get a
> PCI_EXP_SLTSTA_DLLSC "Link Up" interrupt, and apparently we issue
> another command. Not sure exactly what's going on here.
>
> Could you try the following patch? The idea is to (1) do nothing on
> shutdown, so you should see no message and no delay, and (2) collect
> more information about the resume path.
>
>
> diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> index 332b723ff9e6..99751cc52968 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -260,14 +260,6 @@ static int pciehp_probe(struct pcie_device *dev)
> return -ENODEV;
> }
>
> -static void pciehp_remove(struct pcie_device *dev)
> -{
> - struct controller *ctrl = get_service_data(dev);
> -
> - cleanup_slot(ctrl);
> - pciehp_release_ctrl(ctrl);
> -}
> -
> #ifdef CONFIG_PM
> static int pciehp_suspend(struct pcie_device *dev)
> {
> @@ -305,7 +297,6 @@ static struct pcie_port_service_driver hpdriver_portdrv = {
> .service = PCIE_PORT_SERVICE_HP,
>
> .probe = pciehp_probe,
> - .remove = pciehp_remove,
>
> #ifdef CONFIG_PM
> .suspend = pciehp_suspend,
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 18a42f8f5dc5..c3a9c47ed061 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -113,7 +113,7 @@ static int pcie_poll_cmd(struct controller *ctrl, int timeout)
> return 0; /* timeout */
> }
>
> -static void pcie_wait_cmd(struct controller *ctrl)
> +static void pcie_wait_cmd(struct controller *ctrl, u16 cmd, u16 mask)
> {
> unsigned int msecs = pciehp_poll_mode ? 2500 : 1000;
> unsigned long duration = msecs_to_jiffies(msecs);
> @@ -155,10 +155,13 @@ static void pcie_wait_cmd(struct controller *ctrl)
> * don't change those bits, e.g., commands that merely enable
> * interrupts.
> */
> - if (!rc)
> - ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec ago)\n",
> + if (!rc) {
> + ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec ago), new command %#06x/mask %#06x\n",
> ctrl->slot_ctrl,
> - jiffies_to_msecs(jiffies - ctrl->cmd_started));
> + jiffies_to_msecs(jiffies - ctrl->cmd_started),
> + cmd, mask);
> + dump_stack();
> + }
> }
>
> static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> @@ -172,7 +175,7 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> /*
> * Always wait for any previous command that might still be in progress
> */
> - pcie_wait_cmd(ctrl);
> + pcie_wait_cmd(ctrl, cmd, mask);
>
> pcie_capability_read_word(pdev, PCI_EXP_SLTCTL, &slot_ctrl);
> if (slot_ctrl == (u16) ~0) {
> @@ -193,7 +196,7 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> * indicating completion of the above issued command.
> */
> if (wait)
> - pcie_wait_cmd(ctrl);
> + pcie_wait_cmd(ctrl, cmd, mask);
>
> out:
> mutex_unlock(&ctrl->ctrl_lock);
>
Thank you for the quick reply and sorry for only being able to test it
now. Please find the relevant bits from the ACPI S3 suspend “action”
below. The full log is attached.
```
[ 190.600060] pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command
0x1038 (issued 190084 msec ago), new command 0x1038/mask 0x103b
[ 190.600065] CPU: 0 PID: 1290 Comm: kworker/u4:37 Not tainted
4.17.0-rc3+ #20
[ 190.600067] Hardware name: LENOVO 636338U/636338U, BIOS CBET4000
TIMELESS 01/01/1970
[ 190.600077] Workqueue: events_unbound async_run_entry_fn
[ 190.600079] Call Trace:
[ 190.600087] dump_stack+0x66/0xa6
[ 190.600092] pcie_wait_cmd+0x153/0x2b0
[ 190.600097] ? prepare_to_wait+0x190/0x190
[ 190.600100] pcie_do_write_cmd+0x54/0x130
[ 190.600104] ? radix_tree_lookup+0x14/0x20
[ 190.600108] ? suspend_iter+0x80/0x80
[ 190.600111] pcie_enable_notification+0x64/0x150
[ 190.600115] ? irq_set_irq_wake+0x6b/0x170
[ 190.600117] ? suspend_iter+0x80/0x80
[ 190.600121] pciehp_resume+0x28/0xa0
[ 190.600124] ? klist_next+0x2d/0x170
[ 190.600127] resume_iter+0x4b/0x80
[ 190.600131] device_for_each_child+0x61/0xb0
[ 190.600134] pcie_port_device_resume+0x14/0x20
[ 190.600139] pci_pm_resume+0x75/0x100
[ 190.600143] dpm_run_callback+0x47/0x1b0
[ 190.600146] ? pci_pm_thaw+0xd0/0xd0
[ 190.600149] device_resume+0x97/0x190
[ 190.600152] ? device_resume+0x190/0x190
[ 190.600155] async_resume+0x1e/0x50
[ 190.600158] async_run_entry_fn+0x61/0x3a0
[ 190.600162] ? try_to_wake_up+0x4d/0x790
[ 190.600166] ? __switch_to_asm+0x33/0x4c
[ 190.600171] process_one_work+0x235/0x690
[ 190.600175] worker_thread+0x19d/0x6a0
[ 190.600179] kthread+0x14a/0x1f0
[ 190.600182] ? process_one_work+0x690/0x690
[ 190.600185] ? kthread_create_worker_on_cpu+0x30/0x30
[ 190.600187] ret_from_fork+0x2e/0x38
[ 190.664162] call 0000:05:00.1+ returned 0 after 67611 usecs
[ 190.700252] call usb1+ returned 0 after 102522 usecs
[ 190.700259] call usb2+ returned 0 after 102481 usecs
[ 190.704238] call usb3+ returned 0 after 106349 usecs
[ 190.704433] call ACPI0003:00+ returned 0 after 109837 usecs
```
Kind regards,
Paul
On Thu, May 03, 2018 at 10:49:24AM +0200, Paul Menzel wrote:
> On 04/27/18 21:22, Bjorn Helgaas wrote:
> > [+cc Lukas, Sinan]
>
> > On Thu, Apr 26, 2018 at 12:17:53PM +0200, Paul Menzel wrote:
>
> > > On the Lenovo X60t, during resume from ACPI suspend and during shutdown, the
> > > message below is shown in the logs.
> > >
> > > pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued
> > > 65284 msec ago)
> >
> > This is an Intel root port:
> >
> > 00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 02) (prog-if 00 [Normal decode])
> >
> > and probably has the CF118 erratum (see
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3461a068661c
> > for details). I bet if you changed "msecs" in pcie_wait_cmd() to 30000
> > you'd see a 30 second delay during shutdown because we write a command to
> > tell the port not to generate any more hotplug interrupts, and we wait for
> > that command to complete, but the port never tells us it has completed.
> >
> > Lukas reported a similar issue in
> > https://lkml.kernel.org/r/[email protected], which we sort
> > of worked around by assuming that Thunderbolt controllers never support
> > that "command complete" interrupt (see
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=493fb50e958c)
> >
> > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > generate hotplug interrupts" command. I think we should work on this
> > idea a little more. If we're shutting down the whole system, I can't
> > believe there's much value in *anything* we do in the pciehp_remove()
> > path.
> >
> > Maybe we should just get rid of pciehp_remove() (and probably
> > pcie_port_remove_service() and the other service driver remove methods)
> > completely. That dates from when the service drivers could be modules that
> > could be potentially unloaded, but unloading them hasn't been possible for
> > years.
> >
> > As far as the resume path, my guess is that in pciehp_resume(), we
> > write a command to enable interrupts, then it looks like we get a
> > PCI_EXP_SLTSTA_DLLSC "Link Up" interrupt, and apparently we issue
> > another command. Not sure exactly what's going on here.
> Thank you for the quick reply and sorry for only being able to test it now.
> Please find the relevant bits from the ACPI S3 suspend “action” below. The
> full log is attached.
No problem. I think we need to bite the bullet and just do a quirk
for the Intel erratum. I tried to avoid it by waiting for command
completion lazily, but I think that ended up being unnecessarily
clever and it didn't even solve the whole problem.
Can you try the patch below? I think it should solve the problem
you're seeing.
commit ec48a1e0b91ce68903c8ea4dce659d4fdf17ad06
Author: Bjorn Helgaas <[email protected]>
Date: Thu May 3 18:39:38 2018 -0500
PCI: pciehp: Add quirk for Intel Command Completed erratum
The Intel CF118 erratum means the controller does not set the Command
Completed bit unless writes to the Slot Command register change "Control"
bits. Command Completed is never set for writes that only change software
notification "Enable" bits. This results in timeouts like this:
pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)
When this erratum is present, avoid these timeouts by marking commands
"completed" immediately unless they change the "Control" bits.
Here's the text of the erratum from the Intel document:
CF118 PCIe Slot Status Register Command Completed bit not always
updated on any configuration write to the Slot Control
Register
Problem: For PCIe root ports (devices 0 - 10) supporting hot-plug,
the Slot Status Register (offset AAh) Command Completed
(bit[4]) status is updated under the following condition:
IOH will set Command Completed bit after delivering the new
commands written in the Slot Controller register (offset
A8h) to VPP. The IOH detects new commands written in Slot
Control register by checking the change of value for Power
Controller Control (bit[10]), Power Indicator Control
(bits[9:8]), Attention Indicator Control (bits[7:6]), or
Electromechanical Interlock Control (bit[11]) fields. Any
other configuration writes to the Slot Control register
without changing the values of these fields will not cause
Command Completed bit to be set.
The PCIe Base Specification Revision 2.0 or later describes
the “Slot Control Register” in section 7.8.10, as follows
(Reference section 7.8.10, Slot Control Register, Offset
18h). In hot-plug capable Downstream Ports, a write to the
Slot Control register must cause a hot-plug command to be
generated (see Section 6.7.3.2 for details on hot-plug
commands). A write to the Slot Control register in a
Downstream Port that is not hotplug capable must not cause a
hot-plug command to be executed.
The PCIe Spec intended that every write to the Slot Control
Register is a command and expected a command complete status
to abstract the VPP implementation specific nuances from the
OS software. IOH PCIe Slot Control Register implementation
is not fully conforming to the PCIe Specification in this
respect.
Implication: Software checking on the Command Completed status after
writing to the Slot Control register may time out.
Workaround: Software can read the Slot Control register and compare the
existing and new values to determine if it should check the
Command Completed status after writing to the Slot Control
register.
Link: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Paul Menzel <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index c27aab8e25d7..eefffff8e403 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -83,6 +83,7 @@ struct controller {
struct timer_list poll_timer;
unsigned long cmd_started; /* jiffies */
unsigned int cmd_busy:1;
+ unsigned int cc_erratum:1;
unsigned int link_active_reporting:1;
unsigned int notification_enabled:1;
unsigned int power_fault_detected:1;
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 18a42f8f5dc5..aba67d16484a 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -10,7 +10,6 @@
* All rights reserved.
*
* Send feedback to <[email protected]>,<[email protected]>
- *
*/
#include <linux/kernel.h>
@@ -147,25 +146,27 @@ static void pcie_wait_cmd(struct controller *ctrl)
else
rc = pcie_poll_cmd(ctrl, jiffies_to_msecs(timeout));
- /*
- * Controllers with errata like Intel CF118 don't generate
- * completion notifications unless the power/indicator/interlock
- * control bits are changed. On such controllers, we'll emit this
- * timeout message when we wait for completion of commands that
- * don't change those bits, e.g., commands that merely enable
- * interrupts.
- */
if (!rc)
ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec ago)\n",
ctrl->slot_ctrl,
jiffies_to_msecs(jiffies - ctrl->cmd_started));
}
+/*
+ * The Intel CF118 erratum means the Command Completed bit is only set if a
+ * Slot Control write changes PCI_EXP_SLTCTL_PCC, PCI_EXP_SLTCTL_PIC,
+ * PCI_EXP_SLTCTL_AIC, or PCI_EXP_SLTCTL_EIC.
+ */
+#define CC_ERRATUM_MASK (PCI_EXP_SLTCTL_PCC | \
+ PCI_EXP_SLTCTL_PIC | \
+ PCI_EXP_SLTCTL_AIC | \
+ PCI_EXP_SLTCTL_EIC)
+
static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
u16 mask, bool wait)
{
struct pci_dev *pdev = ctrl_dev(ctrl);
- u16 slot_ctrl;
+ u16 slot_ctrl_orig, slot_ctrl;
mutex_lock(&ctrl->ctrl_lock);
@@ -180,6 +181,7 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
goto out;
}
+ slot_ctrl_orig = slot_ctrl;
slot_ctrl &= ~mask;
slot_ctrl |= (cmd & mask);
ctrl->cmd_busy = 1;
@@ -188,6 +190,10 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
ctrl->cmd_started = jiffies;
ctrl->slot_ctrl = slot_ctrl;
+ if (ctrl->cc_erratum &&
+ (slot_ctrl_orig & CC_ERRATUM_MASK) == (slot_ctrl & CC_ERRATUM_MASK))
+ ctrl->cmd_busy = 0;
+
/*
* Optionally wait for the hardware to be ready for a new command,
* indicating completion of the above issued command.
@@ -840,6 +846,10 @@ struct controller *pcie_init(struct pcie_device *dev)
if (pdev->is_thunderbolt)
slot_cap |= PCI_EXP_SLTCAP_NCCS;
+ /* Assume all Intel controllers have erratum CF118 */
+ if (pdev->vendor == PCI_VENDOR_ID_INTEL)
+ ctrl->cc_erratum = 1;
+
ctrl->slot_cap = slot_cap;
mutex_init(&ctrl->ctrl_lock);
init_waitqueue_head(&ctrl->queue);
@@ -861,7 +871,7 @@ struct controller *pcie_init(struct pcie_device *dev)
PCI_EXP_SLTSTA_MRLSC | PCI_EXP_SLTSTA_CC |
PCI_EXP_SLTSTA_DLLSC);
- ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c\n",
+ ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c%s LLActRep%c\n",
(slot_cap & PCI_EXP_SLTCAP_PSN) >> 19,
FLAG(slot_cap, PCI_EXP_SLTCAP_ABP),
FLAG(slot_cap, PCI_EXP_SLTCAP_PCP),
@@ -872,6 +882,7 @@ struct controller *pcie_init(struct pcie_device *dev)
FLAG(slot_cap, PCI_EXP_SLTCAP_HPS),
FLAG(slot_cap, PCI_EXP_SLTCAP_EIP),
FLAG(slot_cap, PCI_EXP_SLTCAP_NCCS),
+ ctrl->cc_erratum? " (with CC erratum)" : "",
FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC));
if (pcie_init_slot(ctrl))
On 2018-05-04 03:45, Bjorn Helgaas wrote:
> On Thu, May 03, 2018 at 10:49:24AM +0200, Paul Menzel wrote:
>> On 04/27/18 21:22, Bjorn Helgaas wrote:
>> > [+cc Lukas, Sinan]
>>
>> > On Thu, Apr 26, 2018 at 12:17:53PM +0200, Paul Menzel wrote:
>>
>> > > On the Lenovo X60t, during resume from ACPI suspend and during shutdown, the
>> > > message below is shown in the logs.
>> > >
>> > > pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued
>> > > 65284 msec ago)
>> >
>> > This is an Intel root port:
>> >
>> > 00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 02) (prog-if 00 [Normal decode])
>> >
>> > and probably has the CF118 erratum (see
>> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3461a068661c
>> > for details). I bet if you changed "msecs" in pcie_wait_cmd() to 30000
>> > you'd see a 30 second delay during shutdown because we write a command to
>> > tell the port not to generate any more hotplug interrupts, and we wait for
>> > that command to complete, but the port never tells us it has completed.
>> >
>> > Lukas reported a similar issue in
>> > https://lkml.kernel.org/r/[email protected], which we sort
>> > of worked around by assuming that Thunderbolt controllers never support
>> > that "command complete" interrupt (see
>> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=493fb50e958c)
>> >
>> > Sinan mooted the idea of using a "no-wait" path of sending the "don't
>> > generate hotplug interrupts" command. I think we should work on this
>> > idea a little more. If we're shutting down the whole system, I can't
>> > believe there's much value in *anything* we do in the pciehp_remove()
>> > path.
>> >
>> > Maybe we should just get rid of pciehp_remove() (and probably
>> > pcie_port_remove_service() and the other service driver remove methods)
>> > completely. That dates from when the service drivers could be modules that
>> > could be potentially unloaded, but unloading them hasn't been possible for
>> > years.
>> >
>> > As far as the resume path, my guess is that in pciehp_resume(), we
>> > write a command to enable interrupts, then it looks like we get a
>> > PCI_EXP_SLTSTA_DLLSC "Link Up" interrupt, and apparently we issue
>> > another command. Not sure exactly what's going on here.
>
>> Thank you for the quick reply and sorry for only being able to test it
>> now.
>> Please find the relevant bits from the ACPI S3 suspend “action” below.
>> The
>> full log is attached.
>
> No problem. I think we need to bite the bullet and just do a quirk
> for the Intel erratum. I tried to avoid it by waiting for command
> completion lazily, but I think that ended up being unnecessarily
> clever and it didn't even solve the whole problem.
>
> Can you try the patch below? I think it should solve the problem
> you're seeing.
>
>
> commit ec48a1e0b91ce68903c8ea4dce659d4fdf17ad06
> Author: Bjorn Helgaas <[email protected]>
> Date: Thu May 3 18:39:38 2018 -0500
>
> PCI: pciehp: Add quirk for Intel Command Completed erratum
>
> The Intel CF118 erratum means the controller does not set the
> Command
> Completed bit unless writes to the Slot Command register change
> "Control"
> bits. Command Completed is never set for writes that only change
> software
> notification "Enable" bits. This results in timeouts like this:
>
> pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038
> (issued 65284 msec ago)
>
> When this erratum is present, avoid these timeouts by marking
> commands
> "completed" immediately unless they change the "Control" bits.
>
> Here's the text of the erratum from the Intel document:
>
> CF118 PCIe Slot Status Register Command Completed bit not
> always
> updated on any configuration write to the Slot
> Control
> Register
>
> Problem: For PCIe root ports (devices 0 - 10) supporting
> hot-plug,
> the Slot Status Register (offset AAh) Command
> Completed
> (bit[4]) status is updated under the following
> condition:
> IOH will set Command Completed bit after delivering
> the new
> commands written in the Slot Controller register
> (offset
> A8h) to VPP. The IOH detects new commands written in
> Slot
> Control register by checking the change of value for
> Power
> Controller Control (bit[10]), Power Indicator
> Control
> (bits[9:8]), Attention Indicator Control
> (bits[7:6]), or
> Electromechanical Interlock Control (bit[11])
> fields. Any
> other configuration writes to the Slot Control
> register
> without changing the values of these fields will not
> cause
> Command Completed bit to be set.
>
> The PCIe Base Specification Revision 2.0 or later
> describes
> the “Slot Control Register” in section 7.8.10, as
> follows
> (Reference section 7.8.10, Slot Control Register,
> Offset
> 18h). In hot-plug capable Downstream Ports, a write
> to the
> Slot Control register must cause a hot-plug command
> to be
> generated (see Section 6.7.3.2 for details on
> hot-plug
> commands). A write to the Slot Control register in a
> Downstream Port that is not hotplug capable must not
> cause a
> hot-plug command to be executed.
>
> The PCIe Spec intended that every write to the Slot
> Control
> Register is a command and expected a command
> complete status
> to abstract the VPP implementation specific nuances
> from the
> OS software. IOH PCIe Slot Control Register
> implementation
> is not fully conforming to the PCIe Specification in
> this
> respect.
>
> Implication: Software checking on the Command Completed status
> after
> writing to the Slot Control register may time out.
>
> Workaround: Software can read the Slot Control register and
> compare the
> existing and new values to determine if it should
> check the
> Command Completed status after writing to the Slot
> Control
> register.
>
> Link:
> http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html
> Link:
> https://lkml.kernel.org/r/[email protected]
> Reported-by: Paul Menzel <[email protected]>
> Signed-off-by: Bjorn Helgaas <[email protected]>
>
> diff --git a/drivers/pci/hotplug/pciehp.h
> b/drivers/pci/hotplug/pciehp.h
> index c27aab8e25d7..eefffff8e403 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -83,6 +83,7 @@ struct controller {
> struct timer_list poll_timer;
> unsigned long cmd_started; /* jiffies */
> unsigned int cmd_busy:1;
> + unsigned int cc_erratum:1;
> unsigned int link_active_reporting:1;
> unsigned int notification_enabled:1;
> unsigned int power_fault_detected:1;
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c
> b/drivers/pci/hotplug/pciehp_hpc.c
> index 18a42f8f5dc5..aba67d16484a 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -10,7 +10,6 @@
> * All rights reserved.
> *
> * Send feedback to <[email protected]>,<[email protected]>
> - *
> */
>
> #include <linux/kernel.h>
> @@ -147,25 +146,27 @@ static void pcie_wait_cmd(struct controller
> *ctrl)
> else
> rc = pcie_poll_cmd(ctrl, jiffies_to_msecs(timeout));
>
> - /*
> - * Controllers with errata like Intel CF118 don't generate
> - * completion notifications unless the power/indicator/interlock
> - * control bits are changed. On such controllers, we'll emit this
> - * timeout message when we wait for completion of commands that
> - * don't change those bits, e.g., commands that merely enable
> - * interrupts.
> - */
> if (!rc)
> ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec
> ago)\n",
> ctrl->slot_ctrl,
> jiffies_to_msecs(jiffies - ctrl->cmd_started));
> }
>
> +/*
> + * The Intel CF118 erratum means the Command Completed bit is only set
> if a
> + * Slot Control write changes PCI_EXP_SLTCTL_PCC, PCI_EXP_SLTCTL_PIC,
> + * PCI_EXP_SLTCTL_AIC, or PCI_EXP_SLTCTL_EIC.
> + */
> +#define CC_ERRATUM_MASK (PCI_EXP_SLTCTL_PCC | \
> + PCI_EXP_SLTCTL_PIC | \
> + PCI_EXP_SLTCTL_AIC | \
> + PCI_EXP_SLTCTL_EIC)
> +
> static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> u16 mask, bool wait)
> {
> struct pci_dev *pdev = ctrl_dev(ctrl);
> - u16 slot_ctrl;
> + u16 slot_ctrl_orig, slot_ctrl;
>
> mutex_lock(&ctrl->ctrl_lock);
>
> @@ -180,6 +181,7 @@ static void pcie_do_write_cmd(struct controller
> *ctrl, u16 cmd,
> goto out;
> }
>
> + slot_ctrl_orig = slot_ctrl;
> slot_ctrl &= ~mask;
> slot_ctrl |= (cmd & mask);
> ctrl->cmd_busy = 1;
> @@ -188,6 +190,10 @@ static void pcie_do_write_cmd(struct controller
> *ctrl, u16 cmd,
> ctrl->cmd_started = jiffies;
> ctrl->slot_ctrl = slot_ctrl;
>
> + if (ctrl->cc_erratum &&
> + (slot_ctrl_orig & CC_ERRATUM_MASK) == (slot_ctrl &
> CC_ERRATUM_MASK))
> + ctrl->cmd_busy = 0;
> +
> /*
> * Optionally wait for the hardware to be ready for a new command,
> * indicating completion of the above issued command.
> @@ -840,6 +846,10 @@ struct controller *pcie_init(struct pcie_device
> *dev)
> if (pdev->is_thunderbolt)
> slot_cap |= PCI_EXP_SLTCAP_NCCS;
>
> + /* Assume all Intel controllers have erratum CF118 */
> + if (pdev->vendor == PCI_VENDOR_ID_INTEL)
> + ctrl->cc_erratum = 1;
> +
Can we build a table like quirks.c?
Qdf2400 root ports have the same problem. I will do a follow up patch
once this finds its way in.
> ctrl->slot_cap = slot_cap;
> mutex_init(&ctrl->ctrl_lock);
> init_waitqueue_head(&ctrl->queue);
> @@ -861,7 +871,7 @@ struct controller *pcie_init(struct pcie_device
> *dev)
> PCI_EXP_SLTSTA_MRLSC | PCI_EXP_SLTSTA_CC |
> PCI_EXP_SLTSTA_DLLSC);
>
> - ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c
> PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c\n",
> + ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c
> PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c%s LLActRep%c\n",
> (slot_cap & PCI_EXP_SLTCAP_PSN) >> 19,
> FLAG(slot_cap, PCI_EXP_SLTCAP_ABP),
> FLAG(slot_cap, PCI_EXP_SLTCAP_PCP),
> @@ -872,6 +882,7 @@ struct controller *pcie_init(struct pcie_device
> *dev)
> FLAG(slot_cap, PCI_EXP_SLTCAP_HPS),
> FLAG(slot_cap, PCI_EXP_SLTCAP_EIP),
> FLAG(slot_cap, PCI_EXP_SLTCAP_NCCS),
> + ctrl->cc_erratum? " (with CC erratum)" : "",
> FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC));
>
> if (pcie_init_slot(ctrl))
On Fri, May 04, 2018 at 07:37:40AM +0100, [email protected] wrote:
> On 2018-05-04 03:45, Bjorn Helgaas wrote:
> > On Thu, May 03, 2018 at 10:49:24AM +0200, Paul Menzel wrote:
> > > On 04/27/18 21:22, Bjorn Helgaas wrote:
> > > > [+cc Lukas, Sinan]
> > >
> > > > On Thu, Apr 26, 2018 at 12:17:53PM +0200, Paul Menzel wrote:
> > >
> > > > > On the Lenovo X60t, during resume from ACPI suspend and during shutdown, the
> > > > > message below is shown in the logs.
> > > > >
> > > > > pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued
> > > > > 65284 msec ago)
> > > >
> > > > This is an Intel root port:
> > > >
> > > > 00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 02) (prog-if 00 [Normal decode])
> > > >
> > > > and probably has the CF118 erratum (see
> > > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3461a068661c
> > > > for details). I bet if you changed "msecs" in pcie_wait_cmd() to 30000
> > > > you'd see a 30 second delay during shutdown because we write a command to
> > > > tell the port not to generate any more hotplug interrupts, and we wait for
> > > > that command to complete, but the port never tells us it has completed.
> > > >
> > > > Lukas reported a similar issue in
> > > > https://lkml.kernel.org/r/[email protected], which we sort
> > > > of worked around by assuming that Thunderbolt controllers never support
> > > > that "command complete" interrupt (see
> > > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=493fb50e958c)
> > > >
> > > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > > > generate hotplug interrupts" command. I think we should work on this
> > > > idea a little more. If we're shutting down the whole system, I can't
> > > > believe there's much value in *anything* we do in the pciehp_remove()
> > > > path.
> > > >
> > > > Maybe we should just get rid of pciehp_remove() (and probably
> > > > pcie_port_remove_service() and the other service driver remove methods)
> > > > completely. That dates from when the service drivers could be modules that
> > > > could be potentially unloaded, but unloading them hasn't been possible for
> > > > years.
> > > >
> > > > As far as the resume path, my guess is that in pciehp_resume(), we
> > > > write a command to enable interrupts, then it looks like we get a
> > > > PCI_EXP_SLTSTA_DLLSC "Link Up" interrupt, and apparently we issue
> > > > another command. Not sure exactly what's going on here.
> >
> > > Thank you for the quick reply and sorry for only being able to test
> > > it now.
> > > Please find the relevant bits from the ACPI S3 suspend “action”
> > > below. The
> > > full log is attached.
> >
> > No problem. I think we need to bite the bullet and just do a quirk
> > for the Intel erratum. I tried to avoid it by waiting for command
> > completion lazily, but I think that ended up being unnecessarily
> > clever and it didn't even solve the whole problem.
> >
> > Can you try the patch below? I think it should solve the problem
> > you're seeing.
> > ...
> > + /* Assume all Intel controllers have erratum CF118 */
> > + if (pdev->vendor == PCI_VENDOR_ID_INTEL)
> > + ctrl->cc_erratum = 1;
> > +
>
> Can we build a table like quirks.c?
>
> Qdf2400 root ports have the same problem. I will do a follow up patch once
> this finds its way in.
Yes, definitely. I intended to do that but got a little lazy. What
do you think about the following? Paul, if you haven't tested the
first patch, can you try this one instead? The logic is pretty much
the same.
3461a068661c ("PCI: pciehp: Wait for hotplug command completion
lazily") mentions AMD and Nvidia devices with the same issue, but
unfortunately doesn't include any specifics.
commit b0d6f2230e12c85ae3b65a854a53c67c7c1f6406
Author: Bjorn Helgaas <[email protected]>
Date: Thu May 3 18:39:38 2018 -0500
PCI: pciehp: Add quirk for Intel Command Completed erratum
The Intel CF118 erratum means the controller does not set the Command
Completed bit unless writes to the Slot Command register change "Control"
bits. Command Completed is never set for writes that only change software
notification "Enable" bits. This results in timeouts like this:
pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)
When this erratum is present, avoid these timeouts by marking commands
"completed" immediately unless they change the "Control" bits.
Here's the text of the erratum from the Intel document:
CF118 PCIe Slot Status Register Command Completed bit not always
updated on any configuration write to the Slot Control
Register
Problem: For PCIe root ports (devices 0 - 10) supporting hot-plug,
the Slot Status Register (offset AAh) Command Completed
(bit[4]) status is updated under the following condition:
IOH will set Command Completed bit after delivering the new
commands written in the Slot Controller register (offset
A8h) to VPP. The IOH detects new commands written in Slot
Control register by checking the change of value for Power
Controller Control (bit[10]), Power Indicator Control
(bits[9:8]), Attention Indicator Control (bits[7:6]), or
Electromechanical Interlock Control (bit[11]) fields. Any
other configuration writes to the Slot Control register
without changing the values of these fields will not cause
Command Completed bit to be set.
The PCIe Base Specification Revision 2.0 or later describes
the “Slot Control Register” in section 7.8.10, as follows
(Reference section 7.8.10, Slot Control Register, Offset
18h). In hot-plug capable Downstream Ports, a write to the
Slot Control register must cause a hot-plug command to be
generated (see Section 6.7.3.2 for details on hot-plug
commands). A write to the Slot Control register in a
Downstream Port that is not hotplug capable must not cause a
hot-plug command to be executed.
The PCIe Spec intended that every write to the Slot Control
Register is a command and expected a command complete status
to abstract the VPP implementation specific nuances from the
OS software. IOH PCIe Slot Control Register implementation
is not fully conforming to the PCIe Specification in this
respect.
Implication: Software checking on the Command Completed status after
writing to the Slot Control register may time out.
Workaround: Software can read the Slot Control register and compare the
existing and new values to determine if it should check the
Command Completed status after writing to the Slot Control
register.
Link: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Paul Menzel <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 18a42f8f5dc5..e70eba5ea906 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -10,7 +10,6 @@
* All rights reserved.
*
* Send feedback to <[email protected]>,<[email protected]>
- *
*/
#include <linux/kernel.h>
@@ -147,25 +146,22 @@ static void pcie_wait_cmd(struct controller *ctrl)
else
rc = pcie_poll_cmd(ctrl, jiffies_to_msecs(timeout));
- /*
- * Controllers with errata like Intel CF118 don't generate
- * completion notifications unless the power/indicator/interlock
- * control bits are changed. On such controllers, we'll emit this
- * timeout message when we wait for completion of commands that
- * don't change those bits, e.g., commands that merely enable
- * interrupts.
- */
if (!rc)
ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec ago)\n",
ctrl->slot_ctrl,
jiffies_to_msecs(jiffies - ctrl->cmd_started));
}
+#define CC_ERRATUM_MASK (PCI_EXP_SLTCTL_PCC | \
+ PCI_EXP_SLTCTL_PIC | \
+ PCI_EXP_SLTCTL_AIC | \
+ PCI_EXP_SLTCTL_EIC)
+
static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
u16 mask, bool wait)
{
struct pci_dev *pdev = ctrl_dev(ctrl);
- u16 slot_ctrl;
+ u16 slot_ctrl_orig, slot_ctrl;
mutex_lock(&ctrl->ctrl_lock);
@@ -180,6 +176,7 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
goto out;
}
+ slot_ctrl_orig = slot_ctrl;
slot_ctrl &= ~mask;
slot_ctrl |= (cmd & mask);
ctrl->cmd_busy = 1;
@@ -188,6 +185,17 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
ctrl->cmd_started = jiffies;
ctrl->slot_ctrl = slot_ctrl;
+ /*
+ * Controllers with the Intel CF118 and similar errata advertise
+ * Command Completed support, but they only set Command Completed
+ * if we change the "Control" bits for power, power indicator,
+ * attention indicator, or interlock. If we only change the
+ * "Enable" bits, they never set the Command Completed bit.
+ */
+ if (pdev->broken_cmd_compl &&
+ (slot_ctrl_orig & CC_ERRATUM_MASK) == (slot_ctrl & CC_ERRATUM_MASK))
+ ctrl->cmd_busy = 0;
+
/*
* Optionally wait for the hardware to be ready for a new command,
* indicating completion of the above issued command.
@@ -861,7 +869,7 @@ struct controller *pcie_init(struct pcie_device *dev)
PCI_EXP_SLTSTA_MRLSC | PCI_EXP_SLTSTA_CC |
PCI_EXP_SLTSTA_DLLSC);
- ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c\n",
+ ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c%s\n",
(slot_cap & PCI_EXP_SLTCAP_PSN) >> 19,
FLAG(slot_cap, PCI_EXP_SLTCAP_ABP),
FLAG(slot_cap, PCI_EXP_SLTCAP_PCP),
@@ -872,7 +880,8 @@ struct controller *pcie_init(struct pcie_device *dev)
FLAG(slot_cap, PCI_EXP_SLTCAP_HPS),
FLAG(slot_cap, PCI_EXP_SLTCAP_EIP),
FLAG(slot_cap, PCI_EXP_SLTCAP_NCCS),
- FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC));
+ FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC),
+ pdev->broken_cmd_compl ? " (with Cmd Compl erratum)" : "");
if (pcie_init_slot(ctrl))
goto abort_ctrl;
@@ -891,3 +900,17 @@ void pciehp_release_ctrl(struct controller *ctrl)
pcie_cleanup_slot(ctrl);
kfree(ctrl);
}
+
+static void quirk_cmd_compl(struct pci_dev *pdev)
+{
+ u32 slot_cap;
+
+ if (pci_is_pcie(pdev)) {
+ pcie_capability_read_dword(pdev, PCI_EXP_SLTCAP, &slot_cap);
+ if (slot_cap & PCI_EXP_SLTCAP_HPC &&
+ !(slot_cap & PCI_EXP_SLTCAP_NCCS))
+ pdev->broken_cmd_compl = 1;
+ }
+}
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
+ PCI_CLASS_BRIDGE_PCI, 8, quirk_cmd_compl);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 73178a2fcee0..60cb5350ad28 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -406,6 +406,9 @@ struct pci_dev {
struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
struct bin_attribute *res_attr_wc[DEVICE_COUNT_RESOURCE]; /* sysfs file for WC mapping of resources */
+#ifdef CONFIG_HOTPLUG_PCI_PCIE
+ unsigned int broken_cmd_compl:1; /* Command Complete broken */
+#endif
#ifdef CONFIG_PCIE_PTM
unsigned int ptm_root:1;
unsigned int ptm_enabled:1;
On 2018-05-04 14:33, Bjorn Helgaas wrote:
> On Fri, May 04, 2018 at 07:37:40AM +0100, [email protected] wrote:
>> On 2018-05-04 03:45, Bjorn Helgaas wrote:
>> > On Thu, May 03, 2018 at 10:49:24AM +0200, Paul Menzel wrote:
>> > > On 04/27/18 21:22, Bjorn Helgaas wrote:
>> > > > [+cc Lukas, Sinan]
>> > >
>> > > > On Thu, Apr 26, 2018 at 12:17:53PM +0200, Paul Menzel wrote:
>> > >
>> > > > > On the Lenovo X60t, during resume from ACPI suspend and during shutdown, the
>> > > > > message below is shown in the logs.
>> > > > >
>> > > > > pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued
>> > > > > 65284 msec ago)
>> > > >
>> > > > This is an Intel root port:
>> > > >
>> > > > 00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 02) (prog-if 00 [Normal decode])
>> > > >
>> > > > and probably has the CF118 erratum (see
>> > > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3461a068661c
>> > > > for details). I bet if you changed "msecs" in pcie_wait_cmd() to 30000
>> > > > you'd see a 30 second delay during shutdown because we write a command to
>> > > > tell the port not to generate any more hotplug interrupts, and we wait for
>> > > > that command to complete, but the port never tells us it has completed.
>> > > >
>> > > > Lukas reported a similar issue in
>> > > > https://lkml.kernel.org/r/[email protected], which we sort
>> > > > of worked around by assuming that Thunderbolt controllers never support
>> > > > that "command complete" interrupt (see
>> > > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=493fb50e958c)
>> > > >
>> > > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
>> > > > generate hotplug interrupts" command. I think we should work on this
>> > > > idea a little more. If we're shutting down the whole system, I can't
>> > > > believe there's much value in *anything* we do in the pciehp_remove()
>> > > > path.
>> > > >
>> > > > Maybe we should just get rid of pciehp_remove() (and probably
>> > > > pcie_port_remove_service() and the other service driver remove methods)
>> > > > completely. That dates from when the service drivers could be modules that
>> > > > could be potentially unloaded, but unloading them hasn't been possible for
>> > > > years.
>> > > >
>> > > > As far as the resume path, my guess is that in pciehp_resume(), we
>> > > > write a command to enable interrupts, then it looks like we get a
>> > > > PCI_EXP_SLTSTA_DLLSC "Link Up" interrupt, and apparently we issue
>> > > > another command. Not sure exactly what's going on here.
>> >
>> > > Thank you for the quick reply and sorry for only being able to test
>> > > it now.
>> > > Please find the relevant bits from the ACPI S3 suspend “action”
>> > > below. The
>> > > full log is attached.
>> >
>> > No problem. I think we need to bite the bullet and just do a quirk
>> > for the Intel erratum. I tried to avoid it by waiting for command
>> > completion lazily, but I think that ended up being unnecessarily
>> > clever and it didn't even solve the whole problem.
>> >
>> > Can you try the patch below? I think it should solve the problem
>> > you're seeing.
>> > ...
>
>> > + /* Assume all Intel controllers have erratum CF118 */
>> > + if (pdev->vendor == PCI_VENDOR_ID_INTEL)
>> > + ctrl->cc_erratum = 1;
>> > +
>>
>> Can we build a table like quirks.c?
>>
>> Qdf2400 root ports have the same problem. I will do a follow up patch
>> once
>> this finds its way in.
>
> Yes, definitely. I intended to do that but got a little lazy. What
> do you think about the following? Paul, if you haven't tested the
> first patch, can you try this one instead? The logic is pretty much
> the same.
>
Yes, this works for me.
> 3461a068661c ("PCI: pciehp: Wait for hotplug command completion
> lazily") mentions AMD and Nvidia devices with the same issue, but
> unfortunately doesn't include any specifics.
>
>
> commit b0d6f2230e12c85ae3b65a854a53c67c7c1f6406
> Author: Bjorn Helgaas <[email protected]>
> Date: Thu May 3 18:39:38 2018 -0500
>
> PCI: pciehp: Add quirk for Intel Command Completed erratum
>
> The Intel CF118 erratum means the controller does not set the
> Command
> Completed bit unless writes to the Slot Command register change
> "Control"
> bits. Command Completed is never set for writes that only change
> software
> notification "Enable" bits. This results in timeouts like this:
>
> pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038
> (issued 65284 msec ago)
>
> When this erratum is present, avoid these timeouts by marking
> commands
> "completed" immediately unless they change the "Control" bits.
>
> Here's the text of the erratum from the Intel document:
>
> CF118 PCIe Slot Status Register Command Completed bit not
> always
> updated on any configuration write to the Slot
> Control
> Register
>
> Problem: For PCIe root ports (devices 0 - 10) supporting
> hot-plug,
> the Slot Status Register (offset AAh) Command
> Completed
> (bit[4]) status is updated under the following
> condition:
> IOH will set Command Completed bit after delivering
> the new
> commands written in the Slot Controller register
> (offset
> A8h) to VPP. The IOH detects new commands written in
> Slot
> Control register by checking the change of value for
> Power
> Controller Control (bit[10]), Power Indicator
> Control
> (bits[9:8]), Attention Indicator Control
> (bits[7:6]), or
> Electromechanical Interlock Control (bit[11])
> fields. Any
> other configuration writes to the Slot Control
> register
> without changing the values of these fields will not
> cause
> Command Completed bit to be set.
>
> The PCIe Base Specification Revision 2.0 or later
> describes
> the “Slot Control Register” in section 7.8.10, as
> follows
> (Reference section 7.8.10, Slot Control Register,
> Offset
> 18h). In hot-plug capable Downstream Ports, a write
> to the
> Slot Control register must cause a hot-plug command
> to be
> generated (see Section 6.7.3.2 for details on
> hot-plug
> commands). A write to the Slot Control register in a
> Downstream Port that is not hotplug capable must not
> cause a
> hot-plug command to be executed.
>
> The PCIe Spec intended that every write to the Slot
> Control
> Register is a command and expected a command
> complete status
> to abstract the VPP implementation specific nuances
> from the
> OS software. IOH PCIe Slot Control Register
> implementation
> is not fully conforming to the PCIe Specification in
> this
> respect.
>
> Implication: Software checking on the Command Completed status
> after
> writing to the Slot Control register may time out.
>
> Workaround: Software can read the Slot Control register and
> compare the
> existing and new values to determine if it should
> check the
> Command Completed status after writing to the Slot
> Control
> register.
>
> Link:
> http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html
> Link:
> https://lkml.kernel.org/r/[email protected]
> Reported-by: Paul Menzel <[email protected]>
> Signed-off-by: Bjorn Helgaas <[email protected]>
>
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c
> b/drivers/pci/hotplug/pciehp_hpc.c
> index 18a42f8f5dc5..e70eba5ea906 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -10,7 +10,6 @@
> * All rights reserved.
> *
> * Send feedback to <[email protected]>,<[email protected]>
> - *
> */
>
> #include <linux/kernel.h>
> @@ -147,25 +146,22 @@ static void pcie_wait_cmd(struct controller
> *ctrl)
> else
> rc = pcie_poll_cmd(ctrl, jiffies_to_msecs(timeout));
>
> - /*
> - * Controllers with errata like Intel CF118 don't generate
> - * completion notifications unless the power/indicator/interlock
> - * control bits are changed. On such controllers, we'll emit this
> - * timeout message when we wait for completion of commands that
> - * don't change those bits, e.g., commands that merely enable
> - * interrupts.
> - */
> if (!rc)
> ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec
> ago)\n",
> ctrl->slot_ctrl,
> jiffies_to_msecs(jiffies - ctrl->cmd_started));
> }
>
> +#define CC_ERRATUM_MASK (PCI_EXP_SLTCTL_PCC | \
> + PCI_EXP_SLTCTL_PIC | \
> + PCI_EXP_SLTCTL_AIC | \
> + PCI_EXP_SLTCTL_EIC)
> +
> static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> u16 mask, bool wait)
> {
> struct pci_dev *pdev = ctrl_dev(ctrl);
> - u16 slot_ctrl;
> + u16 slot_ctrl_orig, slot_ctrl;
>
> mutex_lock(&ctrl->ctrl_lock);
>
> @@ -180,6 +176,7 @@ static void pcie_do_write_cmd(struct controller
> *ctrl, u16 cmd,
> goto out;
> }
>
> + slot_ctrl_orig = slot_ctrl;
> slot_ctrl &= ~mask;
> slot_ctrl |= (cmd & mask);
> ctrl->cmd_busy = 1;
> @@ -188,6 +185,17 @@ static void pcie_do_write_cmd(struct controller
> *ctrl, u16 cmd,
> ctrl->cmd_started = jiffies;
> ctrl->slot_ctrl = slot_ctrl;
>
> + /*
> + * Controllers with the Intel CF118 and similar errata advertise
> + * Command Completed support, but they only set Command Completed
> + * if we change the "Control" bits for power, power indicator,
> + * attention indicator, or interlock. If we only change the
> + * "Enable" bits, they never set the Command Completed bit.
> + */
> + if (pdev->broken_cmd_compl &&
> + (slot_ctrl_orig & CC_ERRATUM_MASK) == (slot_ctrl &
> CC_ERRATUM_MASK))
> + ctrl->cmd_busy = 0;
> +
> /*
> * Optionally wait for the hardware to be ready for a new command,
> * indicating completion of the above issued command.
> @@ -861,7 +869,7 @@ struct controller *pcie_init(struct pcie_device
> *dev)
> PCI_EXP_SLTSTA_MRLSC | PCI_EXP_SLTSTA_CC |
> PCI_EXP_SLTSTA_DLLSC);
>
> - ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c
> PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c\n",
> + ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c
> PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c%s\n",
> (slot_cap & PCI_EXP_SLTCAP_PSN) >> 19,
> FLAG(slot_cap, PCI_EXP_SLTCAP_ABP),
> FLAG(slot_cap, PCI_EXP_SLTCAP_PCP),
> @@ -872,7 +880,8 @@ struct controller *pcie_init(struct pcie_device
> *dev)
> FLAG(slot_cap, PCI_EXP_SLTCAP_HPS),
> FLAG(slot_cap, PCI_EXP_SLTCAP_EIP),
> FLAG(slot_cap, PCI_EXP_SLTCAP_NCCS),
> - FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC));
> + FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC),
> + pdev->broken_cmd_compl ? " (with Cmd Compl erratum)" : "");
>
> if (pcie_init_slot(ctrl))
> goto abort_ctrl;
> @@ -891,3 +900,17 @@ void pciehp_release_ctrl(struct controller *ctrl)
> pcie_cleanup_slot(ctrl);
> kfree(ctrl);
> }
> +
> +static void quirk_cmd_compl(struct pci_dev *pdev)
> +{
> + u32 slot_cap;
> +
> + if (pci_is_pcie(pdev)) {
> + pcie_capability_read_dword(pdev, PCI_EXP_SLTCAP, &slot_cap);
> + if (slot_cap & PCI_EXP_SLTCAP_HPC &&
> + !(slot_cap & PCI_EXP_SLTCAP_NCCS))
> + pdev->broken_cmd_compl = 1;
> + }
> +}
> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
> + PCI_CLASS_BRIDGE_PCI, 8, quirk_cmd_compl);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 73178a2fcee0..60cb5350ad28 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -406,6 +406,9 @@ struct pci_dev {
> struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file
> for resources */
> struct bin_attribute *res_attr_wc[DEVICE_COUNT_RESOURCE]; /* sysfs
> file for WC mapping of resources */
>
> +#ifdef CONFIG_HOTPLUG_PCI_PCIE
> + unsigned int broken_cmd_compl:1; /* Command Complete broken */
> +#endif
> #ifdef CONFIG_PCIE_PTM
> unsigned int ptm_root:1;
> unsigned int ptm_enabled:1;
Dear Bjorn,
Am 04.05.2018 um 15:33 schrieb Bjorn Helgaas:
[…]
> Yes, definitely. I intended to do that but got a little lazy. What
> do you think about the following? Paul, if you haven't tested the
> first patch, can you try this one instead? The logic is pretty much
> the same.
>
> 3461a068661c ("PCI: pciehp: Wait for hotplug command completion
> lazily") mentions AMD and Nvidia devices with the same issue, but
> unfortunately doesn't include any specifics.
>
>
> commit b0d6f2230e12c85ae3b65a854a53c67c7c1f6406
> Author: Bjorn Helgaas <[email protected]>
> Date: Thu May 3 18:39:38 2018 -0500
>
> PCI: pciehp: Add quirk for Intel Command Completed erratum
>
> The Intel CF118 erratum means the controller does not set the Command
> Completed bit unless writes to the Slot Command register change "Control"
> bits. Command Completed is never set for writes that only change software
> notification "Enable" bits. This results in timeouts like this:
>
> pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)
>
> When this erratum is present, avoid these timeouts by marking commands
> "completed" immediately unless they change the "Control" bits.
>
> Here's the text of the erratum from the Intel document:
>
> CF118 PCIe Slot Status Register Command Completed bit not always
> updated on any configuration write to the Slot Control
> Register
>
> Problem: For PCIe root ports (devices 0 - 10) supporting hot-plug,
> the Slot Status Register (offset AAh) Command Completed
> (bit[4]) status is updated under the following condition:
> IOH will set Command Completed bit after delivering the new
> commands written in the Slot Controller register (offset
> A8h) to VPP. The IOH detects new commands written in Slot
> Control register by checking the change of value for Power
> Controller Control (bit[10]), Power Indicator Control
> (bits[9:8]), Attention Indicator Control (bits[7:6]), or
> Electromechanical Interlock Control (bit[11]) fields. Any
> other configuration writes to the Slot Control register
> without changing the values of these fields will not cause
> Command Completed bit to be set.
>
> The PCIe Base Specification Revision 2.0 or later describes
> the “Slot Control Register” in section 7.8.10, as follows
> (Reference section 7.8.10, Slot Control Register, Offset
> 18h). In hot-plug capable Downstream Ports, a write to the
> Slot Control register must cause a hot-plug command to be
> generated (see Section 6.7.3.2 for details on hot-plug
> commands). A write to the Slot Control register in a
> Downstream Port that is not hotplug capable must not cause a
> hot-plug command to be executed.
>
> The PCIe Spec intended that every write to the Slot Control
> Register is a command and expected a command complete status
> to abstract the VPP implementation specific nuances from the
> OS software. IOH PCIe Slot Control Register implementation
> is not fully conforming to the PCIe Specification in this
> respect.
>
> Implication: Software checking on the Command Completed status after
> writing to the Slot Control register may time out.
>
> Workaround: Software can read the Slot Control register and compare the
> existing and new values to determine if it should check the
> Command Completed status after writing to the Slot Control
> register.
>
> Link: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html
> Link: https://lkml.kernel.org/r/[email protected]
> Reported-by: Paul Menzel <[email protected]>
> Signed-off-by: Bjorn Helgaas <[email protected]>
>
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 18a42f8f5dc5..e70eba5ea906 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -10,7 +10,6 @@
> * All rights reserved.
> *
> * Send feedback to <[email protected]>,<[email protected]>
> - *
> */
>
> #include <linux/kernel.h>
> @@ -147,25 +146,22 @@ static void pcie_wait_cmd(struct controller *ctrl)
> else
> rc = pcie_poll_cmd(ctrl, jiffies_to_msecs(timeout));
>
> - /*
> - * Controllers with errata like Intel CF118 don't generate
> - * completion notifications unless the power/indicator/interlock
> - * control bits are changed. On such controllers, we'll emit this
> - * timeout message when we wait for completion of commands that
> - * don't change those bits, e.g., commands that merely enable
> - * interrupts.
> - */
> if (!rc)
> ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec ago)\n",
> ctrl->slot_ctrl,
> jiffies_to_msecs(jiffies - ctrl->cmd_started));
> }
>
> +#define CC_ERRATUM_MASK (PCI_EXP_SLTCTL_PCC | \
> + PCI_EXP_SLTCTL_PIC | \
> + PCI_EXP_SLTCTL_AIC | \
> + PCI_EXP_SLTCTL_EIC)
> +
> static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> u16 mask, bool wait)
> {
> struct pci_dev *pdev = ctrl_dev(ctrl);
> - u16 slot_ctrl;
> + u16 slot_ctrl_orig, slot_ctrl;
>
> mutex_lock(&ctrl->ctrl_lock);
>
> @@ -180,6 +176,7 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> goto out;
> }
>
> + slot_ctrl_orig = slot_ctrl;
> slot_ctrl &= ~mask;
> slot_ctrl |= (cmd & mask);
> ctrl->cmd_busy = 1;
> @@ -188,6 +185,17 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> ctrl->cmd_started = jiffies;
> ctrl->slot_ctrl = slot_ctrl;
>
> + /*
> + * Controllers with the Intel CF118 and similar errata advertise
> + * Command Completed support, but they only set Command Completed
> + * if we change the "Control" bits for power, power indicator,
> + * attention indicator, or interlock. If we only change the
> + * "Enable" bits, they never set the Command Completed bit.
> + */
> + if (pdev->broken_cmd_compl &&
> + (slot_ctrl_orig & CC_ERRATUM_MASK) == (slot_ctrl & CC_ERRATUM_MASK))
> + ctrl->cmd_busy = 0;
> +
> /*
> * Optionally wait for the hardware to be ready for a new command,
> * indicating completion of the above issued command.
> @@ -861,7 +869,7 @@ struct controller *pcie_init(struct pcie_device *dev)
> PCI_EXP_SLTSTA_MRLSC | PCI_EXP_SLTSTA_CC |
> PCI_EXP_SLTSTA_DLLSC);
>
> - ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c\n",
> + ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c%s\n",
> (slot_cap & PCI_EXP_SLTCAP_PSN) >> 19,
> FLAG(slot_cap, PCI_EXP_SLTCAP_ABP),
> FLAG(slot_cap, PCI_EXP_SLTCAP_PCP),
> @@ -872,7 +880,8 @@ struct controller *pcie_init(struct pcie_device *dev)
> FLAG(slot_cap, PCI_EXP_SLTCAP_HPS),
> FLAG(slot_cap, PCI_EXP_SLTCAP_EIP),
> FLAG(slot_cap, PCI_EXP_SLTCAP_NCCS),
> - FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC));
> + FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC),
> + pdev->broken_cmd_compl ? " (with Cmd Compl erratum)" : "");
>
> if (pcie_init_slot(ctrl))
> goto abort_ctrl;
> @@ -891,3 +900,17 @@ void pciehp_release_ctrl(struct controller *ctrl)
> pcie_cleanup_slot(ctrl);
> kfree(ctrl);
> }
> +
> +static void quirk_cmd_compl(struct pci_dev *pdev)
> +{
> + u32 slot_cap;
> +
> + if (pci_is_pcie(pdev)) {
> + pcie_capability_read_dword(pdev, PCI_EXP_SLTCAP, &slot_cap);
> + if (slot_cap & PCI_EXP_SLTCAP_HPC &&
> + !(slot_cap & PCI_EXP_SLTCAP_NCCS))
> + pdev->broken_cmd_compl = 1;
> + }
> +}
> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
> + PCI_CLASS_BRIDGE_PCI, 8, quirk_cmd_compl);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 73178a2fcee0..60cb5350ad28 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -406,6 +406,9 @@ struct pci_dev {
> struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
> struct bin_attribute *res_attr_wc[DEVICE_COUNT_RESOURCE]; /* sysfs file for WC mapping of resources */
>
> +#ifdef CONFIG_HOTPLUG_PCI_PCIE
> + unsigned int broken_cmd_compl:1; /* Command Complete broken */
> +#endif
> #ifdef CONFIG_PCIE_PTM
> unsigned int ptm_root:1;
> unsigned int ptm_enabled:1;
>
With this change, the message is also not shown anymore on the Lenovo
X60. Thank you.
Tested-by: Paul Menzel <[email protected]>
Kind regards,
Paul
On Fri, May 04, 2018 at 08:33:27AM -0500, Bjorn Helgaas wrote:
> commit b0d6f2230e12c85ae3b65a854a53c67c7c1f6406
> Author: Bjorn Helgaas <[email protected]>
> Date: Thu May 3 18:39:38 2018 -0500
>
> PCI: pciehp: Add quirk for Intel Command Completed erratum
>
> The Intel CF118 erratum means the controller does not set the Command
> Completed bit unless writes to the Slot Command register change "Control"
> bits. Command Completed is never set for writes that only change software
> notification "Enable" bits. This results in timeouts like this:
>
> pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)
>
> When this erratum is present, avoid these timeouts by marking commands
> "completed" immediately unless they change the "Control" bits.
>
> Here's the text of the erratum from the Intel document:
>
> CF118 PCIe Slot Status Register Command Completed bit not always
> updated on any configuration write to the Slot Control
> Register
>
> Problem: For PCIe root ports (devices 0 - 10) supporting hot-plug,
> the Slot Status Register (offset AAh) Command Completed
> (bit[4]) status is updated under the following condition:
> IOH will set Command Completed bit after delivering the new
> commands written in the Slot Controller register (offset
> A8h) to VPP. The IOH detects new commands written in Slot
> Control register by checking the change of value for Power
> Controller Control (bit[10]), Power Indicator Control
> (bits[9:8]), Attention Indicator Control (bits[7:6]), or
> Electromechanical Interlock Control (bit[11]) fields. Any
> other configuration writes to the Slot Control register
> without changing the values of these fields will not cause
> Command Completed bit to be set.
>
> The PCIe Base Specification Revision 2.0 or later describes
> the “Slot Control Register” in section 7.8.10, as follows
> (Reference section 7.8.10, Slot Control Register, Offset
> 18h). In hot-plug capable Downstream Ports, a write to the
> Slot Control register must cause a hot-plug command to be
> generated (see Section 6.7.3.2 for details on hot-plug
> commands). A write to the Slot Control register in a
> Downstream Port that is not hotplug capable must not cause a
> hot-plug command to be executed.
>
> The PCIe Spec intended that every write to the Slot Control
> Register is a command and expected a command complete status
> to abstract the VPP implementation specific nuances from the
> OS software. IOH PCIe Slot Control Register implementation
> is not fully conforming to the PCIe Specification in this
> respect.
>
> Implication: Software checking on the Command Completed status after
> writing to the Slot Control register may time out.
>
> Workaround: Software can read the Slot Control register and compare the
> existing and new values to determine if it should check the
> Command Completed status after writing to the Slot Control
> register.
>
> Link: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html
> Link: https://lkml.kernel.org/r/[email protected]
> Reported-by: Paul Menzel <[email protected]>
> Signed-off-by: Bjorn Helgaas <[email protected]>
I applied this with Paul's tested-by on pci/hotplug for v4.18.
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 18a42f8f5dc5..e70eba5ea906 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -10,7 +10,6 @@
> * All rights reserved.
> *
> * Send feedback to <[email protected]>,<[email protected]>
> - *
> */
>
> #include <linux/kernel.h>
> @@ -147,25 +146,22 @@ static void pcie_wait_cmd(struct controller *ctrl)
> else
> rc = pcie_poll_cmd(ctrl, jiffies_to_msecs(timeout));
>
> - /*
> - * Controllers with errata like Intel CF118 don't generate
> - * completion notifications unless the power/indicator/interlock
> - * control bits are changed. On such controllers, we'll emit this
> - * timeout message when we wait for completion of commands that
> - * don't change those bits, e.g., commands that merely enable
> - * interrupts.
> - */
> if (!rc)
> ctrl_info(ctrl, "Timeout on hotplug command %#06x (issued %u msec ago)\n",
> ctrl->slot_ctrl,
> jiffies_to_msecs(jiffies - ctrl->cmd_started));
> }
>
> +#define CC_ERRATUM_MASK (PCI_EXP_SLTCTL_PCC | \
> + PCI_EXP_SLTCTL_PIC | \
> + PCI_EXP_SLTCTL_AIC | \
> + PCI_EXP_SLTCTL_EIC)
> +
> static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> u16 mask, bool wait)
> {
> struct pci_dev *pdev = ctrl_dev(ctrl);
> - u16 slot_ctrl;
> + u16 slot_ctrl_orig, slot_ctrl;
>
> mutex_lock(&ctrl->ctrl_lock);
>
> @@ -180,6 +176,7 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> goto out;
> }
>
> + slot_ctrl_orig = slot_ctrl;
> slot_ctrl &= ~mask;
> slot_ctrl |= (cmd & mask);
> ctrl->cmd_busy = 1;
> @@ -188,6 +185,17 @@ static void pcie_do_write_cmd(struct controller *ctrl, u16 cmd,
> ctrl->cmd_started = jiffies;
> ctrl->slot_ctrl = slot_ctrl;
>
> + /*
> + * Controllers with the Intel CF118 and similar errata advertise
> + * Command Completed support, but they only set Command Completed
> + * if we change the "Control" bits for power, power indicator,
> + * attention indicator, or interlock. If we only change the
> + * "Enable" bits, they never set the Command Completed bit.
> + */
> + if (pdev->broken_cmd_compl &&
> + (slot_ctrl_orig & CC_ERRATUM_MASK) == (slot_ctrl & CC_ERRATUM_MASK))
> + ctrl->cmd_busy = 0;
> +
> /*
> * Optionally wait for the hardware to be ready for a new command,
> * indicating completion of the above issued command.
> @@ -861,7 +869,7 @@ struct controller *pcie_init(struct pcie_device *dev)
> PCI_EXP_SLTSTA_MRLSC | PCI_EXP_SLTSTA_CC |
> PCI_EXP_SLTSTA_DLLSC);
>
> - ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c\n",
> + ctrl_info(ctrl, "Slot #%d AttnBtn%c PwrCtrl%c MRL%c AttnInd%c PwrInd%c HotPlug%c Surprise%c Interlock%c NoCompl%c LLActRep%c%s\n",
> (slot_cap & PCI_EXP_SLTCAP_PSN) >> 19,
> FLAG(slot_cap, PCI_EXP_SLTCAP_ABP),
> FLAG(slot_cap, PCI_EXP_SLTCAP_PCP),
> @@ -872,7 +880,8 @@ struct controller *pcie_init(struct pcie_device *dev)
> FLAG(slot_cap, PCI_EXP_SLTCAP_HPS),
> FLAG(slot_cap, PCI_EXP_SLTCAP_EIP),
> FLAG(slot_cap, PCI_EXP_SLTCAP_NCCS),
> - FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC));
> + FLAG(link_cap, PCI_EXP_LNKCAP_DLLLARC),
> + pdev->broken_cmd_compl ? " (with Cmd Compl erratum)" : "");
>
> if (pcie_init_slot(ctrl))
> goto abort_ctrl;
> @@ -891,3 +900,17 @@ void pciehp_release_ctrl(struct controller *ctrl)
> pcie_cleanup_slot(ctrl);
> kfree(ctrl);
> }
> +
> +static void quirk_cmd_compl(struct pci_dev *pdev)
> +{
> + u32 slot_cap;
> +
> + if (pci_is_pcie(pdev)) {
> + pcie_capability_read_dword(pdev, PCI_EXP_SLTCAP, &slot_cap);
> + if (slot_cap & PCI_EXP_SLTCAP_HPC &&
> + !(slot_cap & PCI_EXP_SLTCAP_NCCS))
> + pdev->broken_cmd_compl = 1;
> + }
> +}
> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
> + PCI_CLASS_BRIDGE_PCI, 8, quirk_cmd_compl);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 73178a2fcee0..60cb5350ad28 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -406,6 +406,9 @@ struct pci_dev {
> struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
> struct bin_attribute *res_attr_wc[DEVICE_COUNT_RESOURCE]; /* sysfs file for WC mapping of resources */
>
> +#ifdef CONFIG_HOTPLUG_PCI_PCIE
> + unsigned int broken_cmd_compl:1; /* Command Complete broken */
> +#endif
> #ifdef CONFIG_PCIE_PTM
> unsigned int ptm_root:1;
> unsigned int ptm_enabled:1;
Dear Bjorn,
Am 07.05.2018 um 23:33 schrieb Bjorn Helgaas:
> On Fri, May 04, 2018 at 08:33:27AM -0500, Bjorn Helgaas wrote:
>> commit b0d6f2230e12c85ae3b65a854a53c67c7c1f6406
>> Author: Bjorn Helgaas <[email protected]>
>> Date: Thu May 3 18:39:38 2018 -0500
>>
>> PCI: pciehp: Add quirk for Intel Command Completed erratum
>>
>> The Intel CF118 erratum means the controller does not set the Command
>> Completed bit unless writes to the Slot Command register change "Control"
>> bits. Command Completed is never set for writes that only change software
>> notification "Enable" bits. This results in timeouts like this:
>>
>> pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)
>>
>> When this erratum is present, avoid these timeouts by marking commands
>> "completed" immediately unless they change the "Control" bits.
>>
>> Here's the text of the erratum from the Intel document:
>>
>> CF118 PCIe Slot Status Register Command Completed bit not always
>> updated on any configuration write to the Slot Control
>> Register
>>
>> Problem: For PCIe root ports (devices 0 - 10) supporting hot-plug,
>> the Slot Status Register (offset AAh) Command Completed
>> (bit[4]) status is updated under the following condition:
>> IOH will set Command Completed bit after delivering the new
>> commands written in the Slot Controller register (offset
>> A8h) to VPP. The IOH detects new commands written in Slot
>> Control register by checking the change of value for Power
>> Controller Control (bit[10]), Power Indicator Control
>> (bits[9:8]), Attention Indicator Control (bits[7:6]), or
>> Electromechanical Interlock Control (bit[11]) fields. Any
>> other configuration writes to the Slot Control register
>> without changing the values of these fields will not cause
>> Command Completed bit to be set.
>>
>> The PCIe Base Specification Revision 2.0 or later describes
>> the “Slot Control Register” in section 7.8.10, as follows
>> (Reference section 7.8.10, Slot Control Register, Offset
>> 18h). In hot-plug capable Downstream Ports, a write to the
>> Slot Control register must cause a hot-plug command to be
>> generated (see Section 6.7.3.2 for details on hot-plug
>> commands). A write to the Slot Control register in a
>> Downstream Port that is not hotplug capable must not cause a
>> hot-plug command to be executed.
>>
>> The PCIe Spec intended that every write to the Slot Control
>> Register is a command and expected a command complete status
>> to abstract the VPP implementation specific nuances from the
>> OS software. IOH PCIe Slot Control Register implementation
>> is not fully conforming to the PCIe Specification in this
>> respect.
>>
>> Implication: Software checking on the Command Completed status after
>> writing to the Slot Control register may time out.
>>
>> Workaround: Software can read the Slot Control register and compare the
>> existing and new values to determine if it should check the
>> Command Completed status after writing to the Slot Control
>> register.
>>
>> Link: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html
>> Link: https://lkml.kernel.org/r/[email protected]
>> Reported-by: Paul Menzel <[email protected]>
>> Signed-off-by: Bjorn Helgaas <[email protected]>
>
> I applied this with Paul's tested-by on pci/hotplug for v4.18.
Thank you very much. Will this also be picked up by the stable Linux
kernel series?
[…]
Kind regards,
Paul
On Tue, May 08, 2018 at 08:59:34AM +0200, Paul Menzel wrote:
> Dear Bjorn,
>
>
> Am 07.05.2018 um 23:33 schrieb Bjorn Helgaas:
> > On Fri, May 04, 2018 at 08:33:27AM -0500, Bjorn Helgaas wrote:
> > > commit b0d6f2230e12c85ae3b65a854a53c67c7c1f6406
> > > Author: Bjorn Helgaas <[email protected]>
> > > Date: Thu May 3 18:39:38 2018 -0500
> > >
> > > PCI: pciehp: Add quirk for Intel Command Completed erratum
> > > The Intel CF118 erratum means the controller does not set the Command
> > > Completed bit unless writes to the Slot Command register change "Control"
> > > bits. Command Completed is never set for writes that only change software
> > > notification "Enable" bits. This results in timeouts like this:
> > > pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)
> > > When this erratum is present, avoid these timeouts by marking commands
> > > "completed" immediately unless they change the "Control" bits.
> > > Here's the text of the erratum from the Intel document:
> > > CF118 PCIe Slot Status Register Command Completed bit not always
> > > updated on any configuration write to the Slot Control
> > > Register
> > > Problem: For PCIe root ports (devices 0 - 10) supporting hot-plug,
> > > the Slot Status Register (offset AAh) Command Completed
> > > (bit[4]) status is updated under the following condition:
> > > IOH will set Command Completed bit after delivering the new
> > > commands written in the Slot Controller register (offset
> > > A8h) to VPP. The IOH detects new commands written in Slot
> > > Control register by checking the change of value for Power
> > > Controller Control (bit[10]), Power Indicator Control
> > > (bits[9:8]), Attention Indicator Control (bits[7:6]), or
> > > Electromechanical Interlock Control (bit[11]) fields. Any
> > > other configuration writes to the Slot Control register
> > > without changing the values of these fields will not cause
> > > Command Completed bit to be set.
> > > The PCIe Base Specification Revision 2.0 or later describes
> > > the “Slot Control Register” in section 7.8.10, as follows
> > > (Reference section 7.8.10, Slot Control Register, Offset
> > > 18h). In hot-plug capable Downstream Ports, a write to the
> > > Slot Control register must cause a hot-plug command to be
> > > generated (see Section 6.7.3.2 for details on hot-plug
> > > commands). A write to the Slot Control register in a
> > > Downstream Port that is not hotplug capable must not cause a
> > > hot-plug command to be executed.
> > > The PCIe Spec intended that every write to the Slot Control
> > > Register is a command and expected a command complete status
> > > to abstract the VPP implementation specific nuances from the
> > > OS software. IOH PCIe Slot Control Register implementation
> > > is not fully conforming to the PCIe Specification in this
> > > respect.
> > > Implication: Software checking on the Command Completed status after
> > > writing to the Slot Control register may time out.
> > > Workaround: Software can read the Slot Control register and compare the
> > > existing and new values to determine if it should check the
> > > Command Completed status after writing to the Slot Control
> > > register.
> > > Link: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html
> > > Link: https://lkml.kernel.org/r/[email protected]
> > > Reported-by: Paul Menzel <[email protected]>
> > > Signed-off-by: Bjorn Helgaas <[email protected]>
> >
> > I applied this with Paul's tested-by on pci/hotplug for v4.18.
>
> Thank you very much. Will this also be picked up by the stable Linux kernel
> series?
I did not tag it for stable because I didn't think it was a serious enough
problem, based on this from Documentation/process/stable-kernel-rules.rst:
- It must fix a problem that causes a build error (but not for things
marked CONFIG_BROKEN), an oops, a hang, data corruption, a real
security issue, or some "oh, that's not good" issue. In short, something
critical.
I know I'm on the conservative end of the stable-tagging spectrum, so maybe
I could be convinced to add a stable tag.
My impression was that this bug caused annoying messages and annoying
delays of a couple seconds during shutdown and resume. Is it more serious
than that?
Bjorn
Dear Bjorn,
Am 08.05.2018 um 14:34 schrieb Bjorn Helgaas:
> On Tue, May 08, 2018 at 08:59:34AM +0200, Paul Menzel wrote:
>> Am 07.05.2018 um 23:33 schrieb Bjorn Helgaas:
>>> On Fri, May 04, 2018 at 08:33:27AM -0500, Bjorn Helgaas wrote:
>>>> commit b0d6f2230e12c85ae3b65a854a53c67c7c1f6406
>>>> Author: Bjorn Helgaas <[email protected]>
>>>> Date: Thu May 3 18:39:38 2018 -0500
>>>>
>>>> PCI: pciehp: Add quirk for Intel Command Completed erratum
>>>> The Intel CF118 erratum means the controller does not set the Command
>>>> Completed bit unless writes to the Slot Command register change "Control"
>>>> bits. Command Completed is never set for writes that only change software
>>>> notification "Enable" bits. This results in timeouts like this:
>>>> pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)
>>>> When this erratum is present, avoid these timeouts by marking commands
>>>> "completed" immediately unless they change the "Control" bits.
>>>> Here's the text of the erratum from the Intel document:
>>>> CF118 PCIe Slot Status Register Command Completed bit not always
>>>> updated on any configuration write to the Slot Control
>>>> Register
>>>> Problem: For PCIe root ports (devices 0 - 10) supporting hot-plug,
>>>> the Slot Status Register (offset AAh) Command Completed
>>>> (bit[4]) status is updated under the following condition:
>>>> IOH will set Command Completed bit after delivering the new
>>>> commands written in the Slot Controller register (offset
>>>> A8h) to VPP. The IOH detects new commands written in Slot
>>>> Control register by checking the change of value for Power
>>>> Controller Control (bit[10]), Power Indicator Control
>>>> (bits[9:8]), Attention Indicator Control (bits[7:6]), or
>>>> Electromechanical Interlock Control (bit[11]) fields. Any
>>>> other configuration writes to the Slot Control register
>>>> without changing the values of these fields will not cause
>>>> Command Completed bit to be set.
>>>> The PCIe Base Specification Revision 2.0 or later describes
>>>> the “Slot Control Register” in section 7.8.10, as follows
>>>> (Reference section 7.8.10, Slot Control Register, Offset
>>>> 18h). In hot-plug capable Downstream Ports, a write to the
>>>> Slot Control register must cause a hot-plug command to be
>>>> generated (see Section 6.7.3.2 for details on hot-plug
>>>> commands). A write to the Slot Control register in a
>>>> Downstream Port that is not hotplug capable must not cause a
>>>> hot-plug command to be executed.
>>>> The PCIe Spec intended that every write to the Slot Control
>>>> Register is a command and expected a command complete status
>>>> to abstract the VPP implementation specific nuances from the
>>>> OS software. IOH PCIe Slot Control Register implementation
>>>> is not fully conforming to the PCIe Specification in this
>>>> respect.
>>>> Implication: Software checking on the Command Completed status after
>>>> writing to the Slot Control register may time out.
>>>> Workaround: Software can read the Slot Control register and compare the
>>>> existing and new values to determine if it should check the
>>>> Command Completed status after writing to the Slot Control
>>>> register.
>>>> Link: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html
>>>> Link: https://lkml.kernel.org/r/[email protected]
>>>> Reported-by: Paul Menzel <[email protected]>
>>>> Signed-off-by: Bjorn Helgaas <[email protected]>
>>>
>>> I applied this with Paul's tested-by on pci/hotplug for v4.18.
>>
>> Thank you very much. Will this also be picked up by the stable Linux kernel
>> series?
>
> I did not tag it for stable because I didn't think it was a serious enough
> problem, based on this from Documentation/process/stable-kernel-rules.rst:
>
> - It must fix a problem that causes a build error (but not for things
> marked CONFIG_BROKEN), an oops, a hang, data corruption, a real
> security issue, or some "oh, that's not good" issue. In short, something
> critical.
>
> I know I'm on the conservative end of the stable-tagging spectrum, so maybe
> I could be convinced to add a stable tag.
>
> My impression was that this bug caused annoying messages and annoying
> delays of a couple seconds during shutdown and resume. Is it more serious
> than that?
No, not more then that. But “oh, that’s not good” fits in my opinion. My
impression was, that’s how most stable patches get in.
Kind regards,
Paul
On Fri, Apr 27, 2018 at 02:22:07PM -0500, Bjorn Helgaas wrote:
> Sinan mooted the idea of using a "no-wait" path of sending the "don't
> generate hotplug interrupts" command. I think we should work on this
> idea a little more. If we're shutting down the whole system, I can't
> believe there's much value in *anything* we do in the pciehp_remove()
> path.
>
> Maybe we should just get rid of pciehp_remove() (and probably
> pcie_port_remove_service() and the other service driver remove methods)
> completely. That dates from when the service drivers could be modules that
> could be potentially unloaded, but unloading them hasn't been possible for
> years.
Every Thunderbolt device contains a PCIe switch with at least one
(downstream) hotplug port, so pciehp_remove() is executed on unplug
of a Thunderbolt device and the assumption that it's unnecessary
simply because it's builtin isn't correct.
Thanks,
Lukas
On Wed, May 09, 2018 at 01:41:24PM +0200, Lukas Wunner wrote:
> On Fri, Apr 27, 2018 at 02:22:07PM -0500, Bjorn Helgaas wrote:
> > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > generate hotplug interrupts" command. I think we should work on this
> > idea a little more. If we're shutting down the whole system, I can't
> > believe there's much value in *anything* we do in the pciehp_remove()
> > path.
> >
> > Maybe we should just get rid of pciehp_remove() (and probably
> > pcie_port_remove_service() and the other service driver remove methods)
> > completely. That dates from when the service drivers could be modules that
> > could be potentially unloaded, but unloading them hasn't been possible for
> > years.
>
> Every Thunderbolt device contains a PCIe switch with at least one
> (downstream) hotplug port, so pciehp_remove() is executed on unplug
> of a Thunderbolt device and the assumption that it's unnecessary
> simply because it's builtin isn't correct.
I agree that simply being builtin isn't a sufficient argument for getting
rid of pciehp_remove().
But if we do need pciehp_remove(), we should be able to make a rational
case for why that is. If we're about to turn off the power, it's not
obvious why we would need to deallocate memory, remove sysfs stuff, etc.
If we need to configure the hardware to make it easier for a kexec'd
kernel, that's a possible argument but we should make it explicit.
Bjorn
On Wed, May 09, 2018 at 07:57:52AM -0500, Bjorn Helgaas wrote:
> On Wed, May 09, 2018 at 01:41:24PM +0200, Lukas Wunner wrote:
> > On Fri, Apr 27, 2018 at 02:22:07PM -0500, Bjorn Helgaas wrote:
> > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > > generate hotplug interrupts" command. I think we should work on this
> > > idea a little more. If we're shutting down the whole system, I can't
> > > believe there's much value in *anything* we do in the pciehp_remove()
> > > path.
> > >
> > > Maybe we should just get rid of pciehp_remove() (and probably
> > > pcie_port_remove_service() and the other service driver remove methods)
> > > completely. That dates from when the service drivers could be modules that
> > > could be potentially unloaded, but unloading them hasn't been possible for
> > > years.
> >
> > Every Thunderbolt device contains a PCIe switch with at least one
> > (downstream) hotplug port, so pciehp_remove() is executed on unplug
> > of a Thunderbolt device and the assumption that it's unnecessary
> > simply because it's builtin isn't correct.
>
> I agree that simply being builtin isn't a sufficient argument for getting
> rid of pciehp_remove().
>
> But if we do need pciehp_remove(), we should be able to make a rational
> case for why that is. If we're about to turn off the power, it's not
> obvious why we would need to deallocate memory, remove sysfs stuff, etc.
> If we need to configure the hardware to make it easier for a kexec'd
> kernel, that's a possible argument but we should make it explicit.
With Thunderbolt, up to 6 devices may be daisy-chained. This means that a
hotplug port may have further hotplug ports as (grand-)children.
If power is turned off manually via sysfs for a hotplug port, all children
(including hotplug ports) are removed by pciehp even though they physically
remain attached to the machine. If such removed-in-software-but-physically-
still-present devices send an interrupt, and interrupts were not orderly
disabled on ->remove, they will be considered spurious interrupts by genirq
code. In particular, level-triggered INTx interrupts will immediately lead
to an unpleasant user-visible splat and the interrupt will be switched to
polling.
So there's no way around orderly disabling interrupts in the ->remove path.
I agree that ->shutdown is a different story in principle and that disabling
devices seems superfluous and counter-intuitive. I imagine kexec might not
be the only reason, but also devices passed through to VMs. (What happens
if a VM hands a device back to the host in an unclean state on shutdown?)
Thanks,
Lukas