On Tue, Sep 11, 2018 at 10:17:44AM +0200, Takashi Iwai wrote:
> [ seems like my previous post didn't go out properly; if you have
> already received it, please discard this one ]
Sorry, I got it, it's just in my large queue :(
> Hi Rafael, Greg,
>
> James Wang reported on SUSE bugzilla that his machine spews many
> AMD-Vi errors at reboot like:
>
> [ 154.907879] systemd-shutdown[1]: Detaching loop devices.
> [ 154.954583] kvm: exiting hardware virtualization
> [ 154.999953] usb 5-2: USB disconnect, device number 2
> [ 155.025278] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.081360] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.136778] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.191772] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.247055] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.302614] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.358996] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.392155] usb 4-2: new full-speed USB device number 2 using ohci-pci
> [ 155.413752] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.413762] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.560307] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.616039] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.667843] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.719497] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.772697] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.823919] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.875490] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.927258] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 155.979318] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 156.031813] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 156.084293] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> [ 156.272157] reboot: Restarting system
> [ 156.290316] reboot: machine restart
>
> And, James bisected and spotted that it's introduced by the commit
> 722e5f2b1eec ("driver core: Partially revert "driver core: correct
> device's shutdown order""). Reverting the commit fixes the problem.
>
> He mentioned about Uncorrectable Machine Check Exception seen at
> shutdown, too, where it doesn't appear after the revert. (Though,
> it's not sure whether it's really relevant.)
>
> The errors are clearly related with the USB device (a KVM device,
> IIRC), and the errors are not seen if the USB device is disconnected.
>
> We experienced this at first with SLE15 kernel (4.12 with backports),
> but later the same issue was confirmed on 4.18.y and 4.19-rc2. Also,
> it's confirmed that revert works on the upstream kernels, too.
>
> Does this hit your radar?
Ugh, no, I haven't heard of this before, Rafael?
So the need for the revert fixes some machines, but others need the
patch, this isn't going to be fun :(
greg k-h
On Tuesday, September 11, 2018 11:33:24 AM CEST Greg Kroah-Hartman wrote:
> On Tue, Sep 11, 2018 at 10:17:44AM +0200, Takashi Iwai wrote:
> > [ seems like my previous post didn't go out properly; if you have
> > already received it, please discard this one ]
>
> Sorry, I got it, it's just in my large queue :(
>
> > Hi Rafael, Greg,
> >
> > James Wang reported on SUSE bugzilla that his machine spews many
> > AMD-Vi errors at reboot like:
> >
> > [ 154.907879] systemd-shutdown[1]: Detaching loop devices.
> > [ 154.954583] kvm: exiting hardware virtualization
> > [ 154.999953] usb 5-2: USB disconnect, device number 2
> > [ 155.025278] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.081360] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.136778] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.191772] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.247055] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.302614] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.358996] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.392155] usb 4-2: new full-speed USB device number 2 using ohci-pci
> > [ 155.413752] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.413762] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.560307] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.616039] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.667843] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.719497] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.772697] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.823919] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.875490] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.927258] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.979318] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 156.031813] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 156.084293] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 156.272157] reboot: Restarting system
> > [ 156.290316] reboot: machine restart
> >
> > And, James bisected and spotted that it's introduced by the commit
> > 722e5f2b1eec ("driver core: Partially revert "driver core: correct
> > device's shutdown order""). Reverting the commit fixes the problem.
Well, has anyone tried to understand why this is so?
It looks like the probe-time reordering of the devices_kset list worked around
some init-time dependency issue, but we can't reorder devices_kset then as it
breaks parent-child ordering in general.
> > He mentioned about Uncorrectable Machine Check Exception seen at
> > shutdown, too, where it doesn't appear after the revert. (Though,
> > it's not sure whether it's really relevant.)
> >
> > The errors are clearly related with the USB device (a KVM device,
> > IIRC), and the errors are not seen if the USB device is disconnected.
> >
> > We experienced this at first with SLE15 kernel (4.12 with backports),
> > but later the same issue was confirmed on 4.18.y and 4.19-rc2. Also,
> > it's confirmed that revert works on the upstream kernels, too.
> >
> > Does this hit your radar?
>
> Ugh, no, I haven't heard of this before, Rafael?
>
> So the need for the revert fixes some machines, but others need the
> patch, this isn't going to be fun :(
We need to understand what's going on on the machines that stopped working
and fix them.
Calling devices_kset_move_last() from really_probe() is clearly incorrect
and restoring it would be a mistake IMO.
BTW, there is a series of patches from Pingfan Liu:
https://patchwork.kernel.org/project/linux-pm/list/?series=9535
that may help in principle, so any chance to try them on the affected
systems?
Thanks,
Rafael
On Tue, Sep 11, 2018 at 12:51:32PM +0200, Rafael J. Wysocki wrote:
> that may help in principle, so any chance to try them on the affected
> systems?
Right, and I don't recall James trying the upstream kernel on his box.
James?
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On Tue, 11 Sep 2018 13:01:11 +0200,
Borislav Petkov wrote:
>
> On Tue, Sep 11, 2018 at 12:51:32PM +0200, Rafael J. Wysocki wrote:
> > that may help in principle, so any chance to try them on the affected
> > systems?
>
> Right, and I don't recall James trying the upstream kernel on his box.
It was tested on 4.18.5, but not with 4.19-rc, AFAIK.
> James?
Yep, James, please test the kernel in OBS Kernel:HEAD (or IBS
Devel:Kernel:master) repo for testing the latest 4.19-rc3.
I'm building a test kernel in IBS home:tiwai:test:shutdown-fix repo
containing the three patches Rafael suggested. If the kernel above
still shows the problem, try my test kernel to see whether it changes
anything.
thanks,
Takashi
On 09/11/2018 07:55 PM, Takashi Iwai wrote:
> On Tue, 11 Sep 2018 13:01:11 +0200,
> Borislav Petkov wrote:
>> On Tue, Sep 11, 2018 at 12:51:32PM +0200, Rafael J. Wysocki wrote:
>>> that may help in principle, so any chance to try them on the affected
>>> systems?
>> Right, and I don't recall James trying the upstream kernel on his box.
> It was tested on 4.18.5, but not with 4.19-rc, AFAIK.
>
>> James?
> Yep, James, please test the kernel in OBS Kernel:HEAD (or IBS
> Devel:Kernel:master) repo for testing the latest 4.19-rc3.
No problem.
> I'm building a test kernel in IBS home:tiwai:test:shutdown-fix repo
> containing the three patches Rafael suggested. If the kernel above
> still shows the problem, try my test kernel to see whether it changes
> anything.
No problem.
But I afraid I have to offline in next 16 hours, I'm on the way to
Prague office.
sorry for the delay. I
James
>
> thanks,
>
> Takashi
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
On Tue, Sep 11, 2018 at 5:37 PM Greg Kroah-Hartman
<[email protected]> wrote:
>
> On Tue, Sep 11, 2018 at 10:17:44AM +0200, Takashi Iwai wrote:
> > [ seems like my previous post didn't go out properly; if you have
> > already received it, please discard this one ]
>
> Sorry, I got it, it's just in my large queue :(
>
> > Hi Rafael, Greg,
> >
> > James Wang reported on SUSE bugzilla that his machine spews many
> > AMD-Vi errors at reboot like:
> >
> > [ 154.907879] systemd-shutdown[1]: Detaching loop devices.
> > [ 154.954583] kvm: exiting hardware virtualization
> > [ 154.999953] usb 5-2: USB disconnect, device number 2
> > [ 155.025278] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.081360] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.136778] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.191772] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.247055] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.302614] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.358996] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.392155] usb 4-2: new full-speed USB device number 2 using ohci-pci
> > [ 155.413752] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.413762] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.560307] ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.616039] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.667843] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.719497] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.772697] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.823919] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.875490] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.927258] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 155.979318] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 156.031813] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 156.084293] AMD-Vi: Event logged [IO_PAGE_FAULT device=00:12.1 domain=0x0006 address=0x0000000000000080 flags=0x0020]
> > [ 156.272157] reboot: Restarting system
> > [ 156.290316] reboot: machine restart
> >
[...]
> > The errors are clearly related with the USB device (a KVM device,
> > IIRC), and the errors are not seen if the USB device is disconnected.
> >
Sounds like the io pgtbl is invalidated before ohci-pci. But I can not
figure out why, since it is very late to tear down of iommu, which is
after device_shutdown()
Cc James, could you try to enable initcall_debug, and paste the
shutdown seq with 722e5f2b1eec ("driver core: Partially revert "driver
core: correct device's shutdown order"") and without it?
Thanks,
Pingfan
On 09/12/2018 02:41 PM, Pingfan Liu wrote:
> Cc James, could you try to enable initcall_debug, and paste the
> shutdown seq with 722e5f2b1eec ("driver core: Partially revert "driver
> core: correct device's shutdown order"") and without it?
OK. And I will scheudule some testing orders, ahah, I'm in a business trip.
All operations will be a little bit delay, but I will make it done.
James
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
On 9/12/2018 11:10 AM, James Wang wrote:
>
> On 09/12/2018 02:41 PM, Pingfan Liu wrote:
>> Cc James, could you try to enable initcall_debug, and paste the
>> shutdown seq with 722e5f2b1eec ("driver core: Partially revert "driver
>> core: correct device's shutdown order"") and without it?
> OK. And I will scheudule some testing orders, ahah, I'm in a business trip.
> All operations will be a little bit delay, but I will make it done.
>
> James
>
Can you please open a Bugzilla entry at bugzilla.kernel.org for the
tracking of this issue and attach the results to that one for future
reference?
On 09/12/2018 11:56 AM, Rafael J. Wysocki wrote:
> On 9/12/2018 11:10 AM, James Wang wrote:
>>
>> On 09/12/2018 02:41 PM, Pingfan Liu wrote:
>>> Cc James, could you try to enable initcall_debug, and paste the
>>> shutdown seq with 722e5f2b1eec ("driver core: Partially revert "driver
>>> core: correct device's shutdown order"") and without it?
>> OK. And I will scheudule some testing orders, ahah, I'm in a business
>> trip.
>> All operations will be a little bit delay, but I will make it done.
>>
>> James
>>
> Can you please open a Bugzilla entry at bugzilla.kernel.org for the
> tracking of this issue and attach the results to that one for future
> reference?
>
Hi sir,
File a bug is no problem. but which product category should be used.
I attach logs here, first. and then file a bug.
>
>
>
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
On 9/13/2018 5:21 PM, James Wang wrote:
>
> On 09/12/2018 11:56 AM, Rafael J. Wysocki wrote:
>> On 9/12/2018 11:10 AM, James Wang wrote:
>>> On 09/12/2018 02:41 PM, Pingfan Liu wrote:
>>>> Cc James, could you try to enable initcall_debug, and paste the
>>>> shutdown seq with 722e5f2b1eec ("driver core: Partially revert "driver
>>>> core: correct device's shutdown order"") and without it?
>>> OK. And I will scheudule some testing orders, ahah, I'm in a business
>>> trip.
>>> All operations will be a little bit delay, but I will make it done.
>>>
>>> James
>>>
>> Can you please open a Bugzilla entry at bugzilla.kernel.org for the
>> tracking of this issue and attach the results to that one for future
>> reference?
>>
> Hi sir,
> File a bug is no problem. but which product category should be used.
Please file it under "Drivers/Other" and let me know the number. Thanks!
Hi Rafael,
Bug has been filed:
Bug 201125 - ohci-pci 0000:00:12.1: AMD-Vi: Event logged [IO_PAGE_FAULT
domain=0x0006 address=0x0000000000000080 flags=0x0020]
James
On 09/13/2018 10:50 PM, Rafael J. Wysocki wrote:
> On 9/13/2018 5:21 PM, James Wang wrote:
>>
>> On 09/12/2018 11:56 AM, Rafael J. Wysocki wrote:
>>> On 9/12/2018 11:10 AM, James Wang wrote:
>>>> On 09/12/2018 02:41 PM, Pingfan Liu wrote:
>>>>> Cc James, could you try to enable initcall_debug, and paste the
>>>>> shutdown seq with 722e5f2b1eec ("driver core: Partially revert
>>>>> "driver
>>>>> core: correct device's shutdown order"") and without it?
>>>> OK. And I will scheudule some testing orders, ahah, I'm in a business
>>>> trip.
>>>> All operations will be a little bit delay, but I will make it done.
>>>>
>>>> James
>>>>
>>> Can you please open a Bugzilla entry at bugzilla.kernel.org for the
>>> tracking of this issue and attach the results to that one for future
>>> reference?
>>>
>> Hi sir,
>> File a bug is no problem. but which product category should be used.
>
> Please file it under "Drivers/Other" and let me know the number. Thanks!
>
>
>
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)