2014-04-13 20:02:33

by Stefani Seibold

[permalink] [raw]
Subject: X86: kexec issues with i915 in 3.14

Rebooting my kernel vanilla kernel 3.14 will fail with tons of kernel
log messages:

[ 0.262754] IOMMU: Setting identity map for device 0000:00:1a.0 [0x7c45f000 - 0x7c46bfff]
[ 0.262780] IOMMU: Setting identity map for device 0000:00:14.0 [0x7c45f000 - 0x7c46bfff]
[ 0.262798] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ 0.262807] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
[ 0.262948] PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
[ 0.262948] dmar: DRHD: handling fault status reg 3
[ 0.262951] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffe000
DMAR:[fault reason 05] PTE Write access is not set
[ 0.262955] dmar: DRHD: handling fault status reg 3
[ 0.262959] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr fff3c000
DMAR:[fault reason 05] PTE Write access is not set
[ 0.262965] dmar: DRHD: handling fault status reg 3
[ 0.262968] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr ffe4a000
DMAR:[fault reason 05] PTE Write access is not set
[ 0.262974] dmar: DRHD: handling fault status reg 3
[ 0.262976] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr fff6f000
DMAR:[fault reason 05] PTE Write access is not set
[ 0.262983] dmar: DRHD: handling fault status reg 3
[ 0.262985] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr ffe8c000
DMAR:[fault reason 05] PTE Write access is not set
[ 0.262991] dmar: DRHD: handling fault status reg 3
[ 0.262994] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr fffb3000
DMAR:[fault reason 05] PTE Write access is not set
[ 0.263000] dmar: DRHD: handling fault status reg 3
[ 0.263002] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr ffecf000
DMAR:[fault reason 05] PTE Write access is not set
[ 0.263009] dmar: DRHD: handling fault status reg 3
[ 0.263011] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffd000

this message repeats more the 21000 times. After this the kernel
messages continues with

[ 0.683267] fbcon: inteldrmfb (fb0) is primary device
[ 0.864123] Console: switching to colour frame buffer device 320x90
[ 0.880630] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[ 0.880632] i915 0000:00:02.0: registered panic notifier
[ 0.881077] ACPI Exception: AE_NOT_FOUND, Evaluating _DOD (20131218/video-1245)
[ 0.881081] ACPI: Video Device [PEGN] (multi-head: no rom: yes post: no)
[ 0.881134] input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A08:00/device:10/LNXVIDEO:00/input/input2
[ 0.888055] ACPI: Video Device [GFX0] (multi-head: yes rom: no post: no)
[ 0.888266] input: Video Bus as /devices/LNXSYSTM:00/device:00/PNP0A08:00/LNXVIDEO:01/input/input3
[ 0.888289] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[ 0.888571] mei_me 0000:00:16.0: irq 57 for MSI/MSI-X
[ 0.889545] rtsx_pci 0000:3e:00.0: irq 58 for MSI/MSI-X
[ 0.889559] rtsx_pci 0000:3e:00.0: rtsx_pci_acquire_irq: pcr->msi_en = 1, pci->irq = 58
[ 0.890098] ACPI Warning: SystemIO range 0x0000000000001828-0x000000000000182f conflicts with OpRegion 0x0000000000001800-0x000000000000187f (\PMIO) (20131218/utaddress-258)
[ 0.890104] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 0.890107] ACPI Warning: SystemIO range 0x0000000000001c30-0x0000000000001c3f conflicts with OpRegion 0x0000000000001c00-0x0000000000001c3f (\GPRL) (20131218/utaddress-258)
[ 0.890111] ACPI Warning: SystemIO range 0x0000000000001c30-0x0000000000001c3f conflicts with OpRegion 0x0000000000001c00-0x0000000000001fff (\GPR_) (20131218/utaddress-258)
[ 0.890114] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 0.890115] ACPI Warning: SystemIO range 0x0000000000001c00-0x0000000000001c2f conflicts with OpRegion 0x0000000000001c00-0x0000000000001c3f (\GPRL) (20131218/utaddress-258)
[ 0.890118] ACPI Warning: SystemIO range 0x0000000000001c00-0x0000000000001c2f conflicts with OpRegion 0x0000000000001c00-0x0000000000001fff (\GPR_) (20131218/utaddress-258)
[ 0.890122] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[ 0.890123] lpc_ich: Resource conflict(s) found affecting gpio_ich
[ 0.890215] ahci 0000:00:1f.2: version 3.0

lspci give me for the device 00:02.0:

VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen
Core Processor Integrated Graphics Controller (rev 06)

After this the system seams in normal condition, X is starting and i can
log on and use the machine. Any idea?

But mostly the machine will look up and i see only garbage on the
screen.

I will attach my kernel config.


Attachments:
kernel.config.gz (28.44 kB)

2014-04-14 00:28:28

by Woodhouse, David

[permalink] [raw]
Subject: Re: X86: kexec issues with i915 in 3.14

On Sun, 2014-04-13 at 22:01 +0200, Stefani Seibold wrote:
> Rebooting my kernel vanilla kernel 3.14 will fail with tons of kernel
> log messages:
>
> [ 0.262754] IOMMU: Setting identity map for device 0000:00:1a.0 [0x7c45f000 - 0x7c46bfff]
> [ 0.262780] IOMMU: Setting identity map for device 0000:00:14.0 [0x7c45f000 - 0x7c46bfff]
> [ 0.262798] IOMMU: Prepare 0-16MiB unity mapping for LPC
> [ 0.262807] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
> [ 0.262948] PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
> [ 0.262948] dmar: DRHD: handling fault status reg 3
> [ 0.262951] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffe000
> DMAR:[fault reason 05] PTE Write access is not set

I'm inferring from the subject line that you mean kexec, not
"rebooting"?

It looks like a peripheral device is being left active and doing DMA by
the previous kernel, rather than being shut down. So as soon as the new
kernel resets the IOMMU mappings, that peripheral device is causing
faults.

We really ought to rate-limit the faults and isolate the offending
device before there are 21,000 of them. As discussed elsewhere recently,
we could do with a way to tell the PCI layer that it offended us but I
suppose we could at *least* stop the IOMMU from reporting faults for it.

Is this new behaviour? I'm not sure why this should have changed...

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


Attachments:
smime.p7s (3.36 kB)

2014-04-14 19:50:28

by Stefani Seibold

[permalink] [raw]
Subject: Re: X86: kexec issues with i915 in 3.14

Am Montag, den 14.04.2014, 00:28 +0000 schrieb Woodhouse, David:
> On Sun, 2014-04-13 at 22:01 +0200, Stefani Seibold wrote:
> > Rebooting my kernel vanilla kernel 3.14 will fail with tons of kernel
> > log messages:
> >
> > [ 0.262754] IOMMU: Setting identity map for device 0000:00:1a.0 [0x7c45f000 - 0x7c46bfff]
> > [ 0.262780] IOMMU: Setting identity map for device 0000:00:14.0 [0x7c45f000 - 0x7c46bfff]
> > [ 0.262798] IOMMU: Prepare 0-16MiB unity mapping for LPC
> > [ 0.262807] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
> > [ 0.262948] PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
> > [ 0.262948] dmar: DRHD: handling fault status reg 3
> > [ 0.262951] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffe000
> > DMAR:[fault reason 05] PTE Write access is not set
>
> I'm inferring from the subject line that you mean kexec, not
> "rebooting"?
>

Rebooting via BIOS works, but booting via kexec will result the message
storm or hang kernel with a corrupted display.

> It looks like a peripheral device is being left active and doing DMA by
> the previous kernel, rather than being shut down. So as soon as the new
> kernel resets the IOMMU mappings, that peripheral device is causing
> faults.
>
> We really ought to rate-limit the faults and isolate the offending
> device before there are 21,000 of them. As discussed elsewhere recently,
> we could do with a way to tell the PCI layer that it offended us but I
> suppose we could at *least* stop the IOMMU from reporting faults for it.
>
> Is this new behaviour? I'm not sure why this should have changed...
>

I can reproduce the behaviour also with a 3.13.7 kernel.

One thing i found after the end of the 21.000 messages was a GPU crash:

[ 5.002484] r8169 0000:03:00.0 eth0: link up
[ 5.002489] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 6.745051] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle
[ 11.743768] [drm] stuck on render ring
[ 11.743773] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 11.743774] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 11.743775] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 11.743777] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 11.743778] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 14.240743] systemd-journald[158]: File /var/log/journal/bb613621feef82d686edde0046e9bcea/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.

- Stefani

2014-04-15 08:54:16

by Jiang Liu

[permalink] [raw]
Subject: Re: X86: kexec issues with i915 in 3.14

Hi Stefanin,
As David has mentioned, the warning messages indicates the VGA
controller hasn't been shut down correctly during reboot and keeps doing
DMA write operations after loading the new kernel. Do you have found
any older kernel without this issue?
There is a patch set to solve similar issue for crashdump,
please refer to https://lkml.org/lkml/2014/1/10/518.

Thanks!
Gerry

On 2014/4/15 3:49, Stefani Seibold wrote:
> Am Montag, den 14.04.2014, 00:28 +0000 schrieb Woodhouse, David:
>> On Sun, 2014-04-13 at 22:01 +0200, Stefani Seibold wrote:
>>> Rebooting my kernel vanilla kernel 3.14 will fail with tons of kernel
>>> log messages:
>>>
>>> [ 0.262754] IOMMU: Setting identity map for device 0000:00:1a.0 [0x7c45f000 - 0x7c46bfff]
>>> [ 0.262780] IOMMU: Setting identity map for device 0000:00:14.0 [0x7c45f000 - 0x7c46bfff]
>>> [ 0.262798] IOMMU: Prepare 0-16MiB unity mapping for LPC
>>> [ 0.262807] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
>>> [ 0.262948] PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
>>> [ 0.262948] dmar: DRHD: handling fault status reg 3
>>> [ 0.262951] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr ffffe000
>>> DMAR:[fault reason 05] PTE Write access is not set
>>
>> I'm inferring from the subject line that you mean kexec, not
>> "rebooting"?
>>
>
> Rebooting via BIOS works, but booting via kexec will result the message
> storm or hang kernel with a corrupted display.
>
>> It looks like a peripheral device is being left active and doing DMA by
>> the previous kernel, rather than being shut down. So as soon as the new
>> kernel resets the IOMMU mappings, that peripheral device is causing
>> faults.
>>
>> We really ought to rate-limit the faults and isolate the offending
>> device before there are 21,000 of them. As discussed elsewhere recently,
>> we could do with a way to tell the PCI layer that it offended us but I
>> suppose we could at *least* stop the IOMMU from reporting faults for it.
>>
>> Is this new behaviour? I'm not sure why this should have changed...
>>
>
> I can reproduce the behaviour also with a 3.13.7 kernel.
>
> One thing i found after the end of the 21.000 messages was a GPU crash:
>
> [ 5.002484] r8169 0000:03:00.0 eth0: link up
> [ 5.002489] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [ 6.745051] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle
> [ 11.743768] [drm] stuck on render ring
> [ 11.743773] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [ 11.743774] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
> [ 11.743775] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
> [ 11.743777] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
> [ 11.743778] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
> [ 14.240743] systemd-journald[158]: File /var/log/journal/bb613621feef82d686edde0046e9bcea/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
>
> - Stefani
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-04-15 18:52:47

by Stefani Seibold

[permalink] [raw]
Subject: Re: X86: kexec issues with i915 in 3.14

On Tuesday, 15.04.2014, 16:54 +0800 wrote Jiang Liu:
> Hi Stefanin,
> As David has mentioned, the warning messages indicates the VGA
> controller hasn't been shut down correctly during reboot and keeps doing
> DMA write operations after loading the new kernel. Do you have found
> any older kernel without this issue?
> There is a patch set to solve similar issue for crashdump,
> please refer to https://lkml.org/lkml/2014/1/10/518.
>
> Thanks!
> Gerry
>

I still understand.

Maybe the above patch will cure the symptoms but i will not heal the
cause.

But the driver for the intel VGA must not assume the current state of
the device. It is necessary to setup the whole VGA device during the
probe phase.

Otherwise when kexec a kernel there are tons of log entries or in many
cases a garbaged screen output and the whole kernels will hang.

- Stefani