I just got a used Thinkpad X201 (Core i5 M 520, Intel QM57
chipset) and hit some kernel panics while trying to view
image/animation-intensive stuff in Firefox (X11) unless I use
"iommu_intel=igfx_off".
With Debian stable backport kernels, "linux-image-4.17.0-0.bpo.3-amd64"
(4.17.17-1~bpo9+1) has no problems. But "linux-image-4.18.0-0.bpo.3-amd64"
(4.18.20-2~bpo9+1) gives a blank screen before I can login via agetty
and run startx.
Building 4.19.12 myself got me into X11 and able to start
Firefox to panic the kernel. I also updated to the latest BIOS
(1.40), but it's an EOL laptop (but it's still the most powerful
laptop I use). I intend to replace the BIOS with Coreboot soon...
Initially, I thought I was hitting another GPU hang from 4.18+:
https://bugs.freedesktop.org/show_bug.cgi?id=107945
But building drm-tip @ commit 28bb1fc015cedadf3b099b8bd0bb27609849f362
("drm-tip: 2018y-12m-25d-08h-12m-37s UTC integration manifest")
I was still able to reproduce the panic unless I use iommu_intel=igfx_off
"i915.reset=1" did not help matters, either.
Below is what I got from netconsole while on drm-tip:
Kernel panic - not syncing: DMAR hardware is malfunctioning
Shutting down cpus with NMI
Kernel Offset: disabled
---[ end Kernel panic - not syncing: DMAR hardware is malfunctioning ]---
------------[ cut here ]------------
sched: Unexpected reschedule of offline CPU#3!
WARNING: CPU: 0 PID: 105 at native_smp_send_reschedule+0x34/0x40
Modules linked in: netconsole ccm snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel arc4 iwldvm aesni_intel aes_x86_64 crypto_simd cryptd mac80211 glue_helper intel_cstate iwlwifi intel_uncore i915 intel_gtt i2c_algo_bit iosf_mbi drm_kms_helper cfbfillrect syscopyarea intel_ips cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea thinkpad_acpi prime_numbers cfg80211 ledtrig_audio i2c_i801 sg snd_hda_intel led_class snd_hda_codec drm ac drm_panel_orientation_quirks snd_hwdep battery e1000e agpgart snd_hda_core snd_pcm snd_timer ptp snd soundcore pps_core ehci_pci ehci_hcd lpc_ich video mfd_core button acpi_cpufreq ecryptfs ip_tables x_tables ipv6 evdev thermal [last unloaded: netconsole]
CPU: 0 PID: 105 Comm: kworker/u8:3 Not tainted 4.20.0-rc7b1+ #1
Hardware name: LENOVO 3680FBU/3680FBU, BIOS 6QET70WW (1.40 ) 10/11/2012
Workqueue: i915 __i915_gem_free_work [i915]
RIP: 0010:native_smp_send_reschedule+0x34/0x40
Code: 05 69 c6 c9 00 73 15 48 8b 05 18 2d b3 00 be fd 00 00 00 48 8b 40 30 e9 9a 58 7d 00 89 fe 48 c7 c7 78 73 af 81 e8 dc c2 01 00 <0f> 0b c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 8b 05 0d 7d df
RSP: 0018:ffff888075003d98 EFLAGS: 00010092
RAX: 000000000000002e RBX: ffff8880751a0740 RCX: 0000000000000006
RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff888075015440
RBP: ffff88806e823700 R08: 0000000000000000 R09: ffff888072fc07c0
R10: ffff888075003d60 R11: 00000000fff5c002 R12: ffff8880751a0740
R13: ffff8880751a0740 R14: 0000000000000000 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff888075000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdb1f53f000 CR3: 0000000001c0a004 CR4: 00000000000206f0
Call Trace:
<IRQ>
? check_preempt_curr+0x4e/0x90
? ttwu_do_wakeup.isra.19+0x14/0xf0
? try_to_wake_up+0x323/0x410
? autoremove_wake_function+0xe/0x30
? __wake_up_common+0x8d/0x140
? __wake_up_common_lock+0x6c/0x90
? irq_work_run_list+0x49/0x80
? tick_sched_handle.isra.6+0x50/0x50
? update_process_times+0x3b/0x50
? tick_sched_handle.isra.6+0x30/0x50
? tick_sched_timer+0x3b/0x80
? __hrtimer_run_queues+0xea/0x270
? hrtimer_interrupt+0x101/0x240
? smp_apic_timer_interrupt+0x6a/0x150
? apic_timer_interrupt+0xf/0x20
</IRQ>
? panic+0x1ca/0x212
? panic+0x1c7/0x212
? __iommu_flush_iotlb+0x19e/0x1c0
? iommu_flush_iotlb_psi+0x96/0xf0
? intel_unmap+0xbf/0xf0
? i915_gem_object_put_pages_gtt+0x36/0x220 [i915]
? drm_ht_remove+0x20/0x20 [drm]
? drm_mm_remove_node+0x1ad/0x310 [drm]
? __pm_runtime_resume+0x54/0x70
? __i915_gem_object_unset_pages+0x129/0x170 [i915]
? __i915_gem_object_put_pages+0x70/0xa0 [i915]
? __i915_gem_free_objects+0x245/0x4e0 [i915]
? __switch_to_asm+0x24/0x60
? __i915_gem_free_work+0x65/0xa0 [i915]
? process_one_work+0x1fd/0x410
? worker_thread+0x49/0x3f0
? kthread+0xf8/0x130
? process_one_work+0x410/0x410
? kthread_park+0x90/0x90
? ret_from_fork+0x35/0x40
WARNING: CPU: 0 PID: 105 at native_smp_send_reschedule+0x34/0x40
---[ end trace 7dd2184d8c86cef5 ]---
------------[ cut here ]------------
sched: Unexpected reschedule of offline CPU#2!
WARNING: CPU: 0 PID: 105 at native_smp_send_reschedule+0x34/0x40
Modules linked in: netconsole ccm snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel arc4 iwldvm aesni_intel aes_x86_64 crypto_simd cryptd mac80211 glue_helper intel_cstate iwlwifi intel_uncore i915 intel_gtt i2c_algo_bit iosf_mbi drm_kms_helper cfbfillrect syscopyarea intel_ips cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea thinkpad_acpi prime_numbers cfg80211 ledtrig_audio i2c_i801 sg snd_hda_intel led_class snd_hda_codec drm ac drm_panel_orientation_quirks snd_hwdep battery e1000e agpgart snd_hda_core snd_pcm snd_timer ptp snd soundcore pps_core ehci_pci ehci_hcd lpc_ich video mfd_core button acpi_cpufreq ecryptfs ip_tables x_tables ipv6 evdev thermal [last unloaded: netconsole]
CPU: 0 PID: 105 Comm: kworker/u8:3 Tainted: G W 4.20.0-rc7b1+ #1
Hardware name: LENOVO 3680FBU/3680FBU, BIOS 6QET70WW (1.40 ) 10/11/2012
Workqueue: i915 __i915_gem_free_work [i915]
RIP: 0010:native_smp_send_reschedule+0x34/0x40
Code: 05 69 c6 c9 00 73 15 48 8b 05 18 2d b3 00 be fd 00 00 00 48 8b 40 30 e9 9a 58 7d 00 89 fe 48 c7 c7 78 73 af 81 e8 dc c2 01 00 <0f> 0b c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 8b 05 0d 7d df
RSP: 0018:ffff888075003d10 EFLAGS: 00010086
RAX: 000000000000002e RBX: ffff888075120740 RCX: 0000000000000006
RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff888075015440
RBP: ffff88807378b700 R08: 0000000000000000 R09: ffff888072fc07c0
R10: ffff888075003cd8 R11: 00000000ffeb4a02 R12: ffff888075120740
R13: ffff888075120740 R14: 0000000000000004 R15: 0000000000000002
FS: 0000000000000000(0000) GS:ffff888075000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdb1f53f000 CR3: 0000000001c0a004 CR4: 00000000000206f0
Call Trace:
<IRQ>
? check_preempt_curr+0x4e/0x90
? ttwu_do_wakeup.isra.19+0x14/0xf0
? try_to_wake_up+0x323/0x410
? __wake_up_common+0x8d/0x140
? ep_poll_callback+0xbd/0x2a0
? __wake_up_common+0x8d/0x140
? __wake_up_common_lock+0x6c/0x90
? irq_work_run_list+0x49/0x80
? tick_sched_handle.isra.6+0x50/0x50
? update_process_times+0x3b/0x50
? tick_sched_handle.isra.6+0x30/0x50
? tick_sched_timer+0x3b/0x80
? __hrtimer_run_queues+0xea/0x270
? hrtimer_interrupt+0x101/0x240
? smp_apic_timer_interrupt+0x6a/0x150
? apic_timer_interrupt+0xf/0x20
</IRQ>
? panic+0x1ca/0x212
? panic+0x1c7/0x212
? __iommu_flush_iotlb+0x19e/0x1c0
? iommu_flush_iotlb_psi+0x96/0xf0
? intel_unmap+0xbf/0xf0
? i915_gem_object_put_pages_gtt+0x36/0x220 [i915]
? drm_ht_remove+0x20/0x20 [drm]
---[ end trace 7dd2184d8c86cef6 ]---
Thanks. I barely use graphics and certainly not with KVM;
so I don't think I'll be missing anything igfx_off. But
maybe this bugreport can help other X201 users.
Quoting Eric Wong (2018-12-27 13:49:48)
> I just got a used Thinkpad X201 (Core i5 M 520, Intel QM57
> chipset) and hit some kernel panics while trying to view
> image/animation-intensive stuff in Firefox (X11) unless I use
> "iommu_intel=igfx_off".
>
> With Debian stable backport kernels, "linux-image-4.17.0-0.bpo.3-amd64"
> (4.17.17-1~bpo9+1) has no problems. But "linux-image-4.18.0-0.bpo.3-amd64"
> (4.18.20-2~bpo9+1) gives a blank screen before I can login via agetty
> and run startx.
Could you open a new bug at (and attach relevant information there):
https://01.org/linuxgraphics/documentation/how-report-bugs
Most confusing about this is that 4.17 would have worked to begin with,
without intel_iommu=igfx_off (unless it was the default for older
kernel?)
Did you maybe update other parts of the system while updating the
kernel?
If you could attach full boot dmesg from working and non-working kernel +
have config file of both kernel's in Bugzilla. That'd be a good start!
Regards, Joonas
> Building 4.19.12 myself got me into X11 and able to start
> Firefox to panic the kernel. I also updated to the latest BIOS
> (1.40), but it's an EOL laptop (but it's still the most powerful
> laptop I use). I intend to replace the BIOS with Coreboot soon...
>
> Initially, I thought I was hitting another GPU hang from 4.18+:
>
> https://bugs.freedesktop.org/show_bug.cgi?id=107945
>
> But building drm-tip @ commit 28bb1fc015cedadf3b099b8bd0bb27609849f362
> ("drm-tip: 2018y-12m-25d-08h-12m-37s UTC integration manifest")
> I was still able to reproduce the panic unless I use iommu_intel=igfx_off
> "i915.reset=1" did not help matters, either.
>
> Below is what I got from netconsole while on drm-tip:
>
> Kernel panic - not syncing: DMAR hardware is malfunctioning
> Shutting down cpus with NMI
> Kernel Offset: disabled
> ---[ end Kernel panic - not syncing: DMAR hardware is malfunctioning ]---
> ------------[ cut here ]------------
> sched: Unexpected reschedule of offline CPU#3!
> WARNING: CPU: 0 PID: 105 at native_smp_send_reschedule+0x34/0x40
> Modules linked in: netconsole ccm snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel arc4 iwldvm aesni_intel aes_x86_64 crypto_simd cryptd mac80211 glue_helper intel_cstate iwlwifi intel_uncore i915 intel_gtt i2c_algo_bit iosf_mbi drm_kms_helper cfbfillrect syscopyarea intel_ips cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea thinkpad_acpi prime_numbers cfg80211 ledtrig_audio i2c_i801 sg snd_hda_intel led_class snd_hda_codec drm ac drm_panel_orientation_quirks snd_hwdep battery e1000e agpgart snd_hda_core snd_pcm snd_timer ptp snd soundcore pps_core ehci_pci ehci_hcd lpc_ich video mfd_core button acpi_cpufreq ecryptfs ip_tables x_tables ipv6 evdev thermal [last unloaded: netconsole]
> CPU: 0 PID: 105 Comm: kworker/u8:3 Not tainted 4.20.0-rc7b1+ #1
> Hardware name: LENOVO 3680FBU/3680FBU, BIOS 6QET70WW (1.40 ) 10/11/2012
> Workqueue: i915 __i915_gem_free_work [i915]
> RIP: 0010:native_smp_send_reschedule+0x34/0x40
> Code: 05 69 c6 c9 00 73 15 48 8b 05 18 2d b3 00 be fd 00 00 00 48 8b 40 30 e9 9a 58 7d 00 89 fe 48 c7 c7 78 73 af 81 e8 dc c2 01 00 <0f> 0b c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 8b 05 0d 7d df
> RSP: 0018:ffff888075003d98 EFLAGS: 00010092
> RAX: 000000000000002e RBX: ffff8880751a0740 RCX: 0000000000000006
> RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff888075015440
> RBP: ffff88806e823700 R08: 0000000000000000 R09: ffff888072fc07c0
> R10: ffff888075003d60 R11: 00000000fff5c002 R12: ffff8880751a0740
> R13: ffff8880751a0740 R14: 0000000000000000 R15: 0000000000000003
> FS: 0000000000000000(0000) GS:ffff888075000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fdb1f53f000 CR3: 0000000001c0a004 CR4: 00000000000206f0
> Call Trace:
> <IRQ>
> ? check_preempt_curr+0x4e/0x90
> ? ttwu_do_wakeup.isra.19+0x14/0xf0
> ? try_to_wake_up+0x323/0x410
> ? autoremove_wake_function+0xe/0x30
> ? __wake_up_common+0x8d/0x140
> ? __wake_up_common_lock+0x6c/0x90
> ? irq_work_run_list+0x49/0x80
> ? tick_sched_handle.isra.6+0x50/0x50
> ? update_process_times+0x3b/0x50
> ? tick_sched_handle.isra.6+0x30/0x50
> ? tick_sched_timer+0x3b/0x80
> ? __hrtimer_run_queues+0xea/0x270
> ? hrtimer_interrupt+0x101/0x240
> ? smp_apic_timer_interrupt+0x6a/0x150
> ? apic_timer_interrupt+0xf/0x20
> </IRQ>
> ? panic+0x1ca/0x212
> ? panic+0x1c7/0x212
> ? __iommu_flush_iotlb+0x19e/0x1c0
> ? iommu_flush_iotlb_psi+0x96/0xf0
> ? intel_unmap+0xbf/0xf0
> ? i915_gem_object_put_pages_gtt+0x36/0x220 [i915]
> ? drm_ht_remove+0x20/0x20 [drm]
> ? drm_mm_remove_node+0x1ad/0x310 [drm]
> ? __pm_runtime_resume+0x54/0x70
> ? __i915_gem_object_unset_pages+0x129/0x170 [i915]
> ? __i915_gem_object_put_pages+0x70/0xa0 [i915]
> ? __i915_gem_free_objects+0x245/0x4e0 [i915]
> ? __switch_to_asm+0x24/0x60
> ? __i915_gem_free_work+0x65/0xa0 [i915]
> ? process_one_work+0x1fd/0x410
> ? worker_thread+0x49/0x3f0
> ? kthread+0xf8/0x130
> ? process_one_work+0x410/0x410
> ? kthread_park+0x90/0x90
> ? ret_from_fork+0x35/0x40
> WARNING: CPU: 0 PID: 105 at native_smp_send_reschedule+0x34/0x40
> ---[ end trace 7dd2184d8c86cef5 ]---
> ------------[ cut here ]------------
> sched: Unexpected reschedule of offline CPU#2!
> WARNING: CPU: 0 PID: 105 at native_smp_send_reschedule+0x34/0x40
> Modules linked in: netconsole ccm snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel arc4 iwldvm aesni_intel aes_x86_64 crypto_simd cryptd mac80211 glue_helper intel_cstate iwlwifi intel_uncore i915 intel_gtt i2c_algo_bit iosf_mbi drm_kms_helper cfbfillrect syscopyarea intel_ips cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea thinkpad_acpi prime_numbers cfg80211 ledtrig_audio i2c_i801 sg snd_hda_intel led_class snd_hda_codec drm ac drm_panel_orientation_quirks snd_hwdep battery e1000e agpgart snd_hda_core snd_pcm snd_timer ptp snd soundcore pps_core ehci_pci ehci_hcd lpc_ich video mfd_core button acpi_cpufreq ecryptfs ip_tables x_tables ipv6 evdev thermal [last unloaded: netconsole]
> CPU: 0 PID: 105 Comm: kworker/u8:3 Tainted: G W 4.20.0-rc7b1+ #1
> Hardware name: LENOVO 3680FBU/3680FBU, BIOS 6QET70WW (1.40 ) 10/11/2012
> Workqueue: i915 __i915_gem_free_work [i915]
> RIP: 0010:native_smp_send_reschedule+0x34/0x40
> Code: 05 69 c6 c9 00 73 15 48 8b 05 18 2d b3 00 be fd 00 00 00 48 8b 40 30 e9 9a 58 7d 00 89 fe 48 c7 c7 78 73 af 81 e8 dc c2 01 00 <0f> 0b c3 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 8b 05 0d 7d df
> RSP: 0018:ffff888075003d10 EFLAGS: 00010086
> RAX: 000000000000002e RBX: ffff888075120740 RCX: 0000000000000006
> RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff888075015440
> RBP: ffff88807378b700 R08: 0000000000000000 R09: ffff888072fc07c0
> R10: ffff888075003cd8 R11: 00000000ffeb4a02 R12: ffff888075120740
> R13: ffff888075120740 R14: 0000000000000004 R15: 0000000000000002
> FS: 0000000000000000(0000) GS:ffff888075000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fdb1f53f000 CR3: 0000000001c0a004 CR4: 00000000000206f0
> Call Trace:
> <IRQ>
> ? check_preempt_curr+0x4e/0x90
> ? ttwu_do_wakeup.isra.19+0x14/0xf0
> ? try_to_wake_up+0x323/0x410
> ? __wake_up_common+0x8d/0x140
> ? ep_poll_callback+0xbd/0x2a0
> ? __wake_up_common+0x8d/0x140
> ? __wake_up_common_lock+0x6c/0x90
> ? irq_work_run_list+0x49/0x80
> ? tick_sched_handle.isra.6+0x50/0x50
> ? update_process_times+0x3b/0x50
> ? tick_sched_handle.isra.6+0x30/0x50
> ? tick_sched_timer+0x3b/0x80
> ? __hrtimer_run_queues+0xea/0x270
> ? hrtimer_interrupt+0x101/0x240
> ? smp_apic_timer_interrupt+0x6a/0x150
> ? apic_timer_interrupt+0xf/0x20
> </IRQ>
> ? panic+0x1ca/0x212
> ? panic+0x1c7/0x212
> ? __iommu_flush_iotlb+0x19e/0x1c0
> ? iommu_flush_iotlb_psi+0x96/0xf0
> ? intel_unmap+0xbf/0xf0
> ? i915_gem_object_put_pages_gtt+0x36/0x220 [i915]
> ? drm_ht_remove+0x20/0x20 [drm]
> ---[ end trace 7dd2184d8c86cef6 ]---
>
>
> Thanks. I barely use graphics and certainly not with KVM;
> so I don't think I'll be missing anything igfx_off. But
> maybe this bugreport can help other X201 users.
Joonas Lahtinen <[email protected]> wrote:
> Quoting Eric Wong (2018-12-27 13:49:48)
> > I just got a used Thinkpad X201 (Core i5 M 520, Intel QM57
> > chipset) and hit some kernel panics while trying to view
> > image/animation-intensive stuff in Firefox (X11) unless I use
> > "iommu_intel=igfx_off".
> >
> > With Debian stable backport kernels, "linux-image-4.17.0-0.bpo.3-amd64"
> > (4.17.17-1~bpo9+1) has no problems. But "linux-image-4.18.0-0.bpo.3-amd64"
> > (4.18.20-2~bpo9+1) gives a blank screen before I can login via agetty
> > and run startx.
> Most confusing about this is that 4.17 would have worked to begin with,
> without intel_iommu=igfx_off (unless it was the default for older
> kernel?)
Yeah, so the Debian bpo 4.17(.17) kernel did not set
CONFIG_INTEL_IOMMU_DEFAULT_ON, so I didn't encounter problems.
My self-built kernels all set CONFIG_INTEL_IOMMU_DEFAULT_ON.
Booting the Debian 4.17 kernel with "intel_iommu=on" gives the
same hanging problem I hit with self-built 4.19.{12,13} kernels.
I'm not sure how far back the problem goes (maybe forever),
since I only got this hardware. Not sure what's the problem
with Debian 4.18, either; but (self-built) 4.19.13 is fine w/o
CONFIG_INTEL_IOMMU_DEFAULT_ON.
Debian backports doesn't have kernels for 4.19 or 4.20, yet.
> Did you maybe update other parts of the system while updating the
> kernel?
Definitely not; just the kernel + headers ("make bindeb-pkg)".
> If you could attach full boot dmesg from working and non-working kernel +
> have config file of both kernel's in Bugzilla. That'd be a good start!
Sorry, I get anxiety attacks when it comes to logins and forms.
Anyways, I managed to get the Debian kernel dmesg output uploaded
with and without iommu_intel=on:
https://bugs.freedesktop.org/attachment.cgi?bugid=109219
Quoting Eric Wong (2019-01-04 03:06:26)
> Joonas Lahtinen <[email protected]> wrote:
> > Quoting Eric Wong (2018-12-27 13:49:48)
> > > I just got a used Thinkpad X201 (Core i5 M 520, Intel QM57
> > > chipset) and hit some kernel panics while trying to view
> > > image/animation-intensive stuff in Firefox (X11) unless I use
> > > "iommu_intel=igfx_off".
> > >
> > > With Debian stable backport kernels, "linux-image-4.17.0-0.bpo.3-amd64"
> > > (4.17.17-1~bpo9+1) has no problems. But "linux-image-4.18.0-0.bpo.3-amd64"
> > > (4.18.20-2~bpo9+1) gives a blank screen before I can login via agetty
> > > and run startx.
>
> > Most confusing about this is that 4.17 would have worked to begin with,
> > without intel_iommu=igfx_off (unless it was the default for older
> > kernel?)
>
> Yeah, so the Debian bpo 4.17(.17) kernel did not set
> CONFIG_INTEL_IOMMU_DEFAULT_ON, so I didn't encounter problems.
> My self-built kernels all set CONFIG_INTEL_IOMMU_DEFAULT_ON.
So it's the case that IOMMU never worked on your machine.
My recommendation would be to simply use intel_iommu=igfx_off if you
need IOMMU.
Old hardware is known to have issues with IOMMU, and retroactively
enabling IOMMU on those machines just brings them up :/
Regards, Joonas
> Booting the Debian 4.17 kernel with "intel_iommu=on" gives the
> same hanging problem I hit with self-built 4.19.{12,13} kernels.
>
> I'm not sure how far back the problem goes (maybe forever),
> since I only got this hardware. Not sure what's the problem
> with Debian 4.18, either; but (self-built) 4.19.13 is fine w/o
> CONFIG_INTEL_IOMMU_DEFAULT_ON.
>
> Debian backports doesn't have kernels for 4.19 or 4.20, yet.
>
> > Did you maybe update other parts of the system while updating the
> > kernel?
>
> Definitely not; just the kernel + headers ("make bindeb-pkg)".
>
> > If you could attach full boot dmesg from working and non-working kernel +
> > have config file of both kernel's in Bugzilla. That'd be a good start!
>
> Sorry, I get anxiety attacks when it comes to logins and forms.
> Anyways, I managed to get the Debian kernel dmesg output uploaded
> with and without iommu_intel=on:
> https://bugs.freedesktop.org/attachment.cgi?bugid=109219
Joonas Lahtinen <[email protected]> wrote:
> Quoting Eric Wong (2019-01-04 03:06:26)
> > Yeah, so the Debian bpo 4.17(.17) kernel did not set
> > CONFIG_INTEL_IOMMU_DEFAULT_ON, so I didn't encounter problems.
> > My self-built kernels all set CONFIG_INTEL_IOMMU_DEFAULT_ON.
>
> So it's the case that IOMMU never worked on your machine.
>
> My recommendation would be to simply use intel_iommu=igfx_off if you
> need IOMMU.
>
> Old hardware is known to have issues with IOMMU, and retroactively
> enabling IOMMU on those machines just brings them up :/
How about we use a quirk in case distros make IOMMU the default
one day?
--------8<--------
Subject: [PATCH] iommu/intel: quirk to disable DMAR for QM57 igfx
Like the GM45, it seems the integrated graphics on QM57 seems
broken and hanging graphics with "intel_iommu=on". So allow
future users to unconditionally enable DMAR support and not have
to remember or know to specify "intel_iommu=igfx_off"
cf. https://lore.kernel.org/lkml/20181227114948.ev4b3jte3ubsc5us@dcvr/
cf. https://lore.kernel.org/lkml/154659116310.4596.13613897418163029789@jlahtine-desk.ger.corp.intel.com/
Signed-off-by: Eric Wong <[email protected]>
---
drivers/iommu/intel-iommu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 048b5ab36a02..dc2507a01580 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5399,7 +5399,7 @@ const struct iommu_ops intel_iommu_ops = {
static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
{
- /* G4x/GM45 integrated gfx dmar support is totally busted. */
+ /* G4x/GM45/QM57 integrated gfx dmar support is totally busted. */
pr_info("Disabling IOMMU for graphics on this chipset\n");
dmar_map_gfx = 0;
}
@@ -5411,6 +5411,7 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e20, quirk_iommu_g4x_gfx);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e30, quirk_iommu_g4x_gfx);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e40, quirk_iommu_g4x_gfx);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e90, quirk_iommu_g4x_gfx);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_iommu_g4x_gfx);
static void quirk_iommu_rwbf(struct pci_dev *dev)
{
@@ -5457,7 +5458,6 @@ static void quirk_calpella_no_shadow_gtt(struct pci_dev *dev)
}
}
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0040, quirk_calpella_no_shadow_gtt);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_calpella_no_shadow_gtt);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0062, quirk_calpella_no_shadow_gtt);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x006a, quirk_calpella_no_shadow_gtt);
--
EW
On Fri, Jan 18, 2019 at 12:17:05PM +0000, Eric Wong wrote:
> @@ -5411,6 +5411,7 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e20, quirk_iommu_g4x_gfx);
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e30, quirk_iommu_g4x_gfx);
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e40, quirk_iommu_g4x_gfx);
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e90, quirk_iommu_g4x_gfx);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_iommu_g4x_gfx);
>
> static void quirk_iommu_rwbf(struct pci_dev *dev)
> {
> @@ -5457,7 +5458,6 @@ static void quirk_calpella_no_shadow_gtt(struct pci_dev *dev)
> }
> }
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0040, quirk_calpella_no_shadow_gtt);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_calpella_no_shadow_gtt);
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0062, quirk_calpella_no_shadow_gtt);
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x006a, quirk_calpella_no_shadow_gtt);
This seems to make sense to me. Joonas, any comments or objections?
Regards,
Joerg
On Tue, Jan 22, 2019 at 11:39 AM Joerg Roedel <[email protected]> wrote:
>
> On Fri, Jan 18, 2019 at 12:17:05PM +0000, Eric Wong wrote:
> > @@ -5411,6 +5411,7 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e20, quirk_iommu_g4x_gfx);
> > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e30, quirk_iommu_g4x_gfx);
> > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e40, quirk_iommu_g4x_gfx);
> > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e90, quirk_iommu_g4x_gfx);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_iommu_g4x_gfx);
> >
> > static void quirk_iommu_rwbf(struct pci_dev *dev)
> > {
> > @@ -5457,7 +5458,6 @@ static void quirk_calpella_no_shadow_gtt(struct pci_dev *dev)
> > }
> > }
> > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0040, quirk_calpella_no_shadow_gtt);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_calpella_no_shadow_gtt);
> > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0062, quirk_calpella_no_shadow_gtt);
> > DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x006a, quirk_calpella_no_shadow_gtt);
>
> This seems to make sense to me. Joonas, any comments or objections?
This is ironlake, which has a huge iommu hack in the gpu driver to
work around hard hangs, which:
- causes massive stalls and kills performance
- isn't well tested (it's the only one that needs this), so tends to break
So if we do this then imo we should:
- probably nuke that w/a too (check for needs_idle_maps and all the
related stuff in i915_gem_gtt.c)
- roll it out for all affected chips (i.e. need to include 0x0040).
Note that the string of platforms which have various issues with iommu
and igfx is very long, thus far we only disabled it where there's no
workaround to stop it from hanging the box, but otherwise left it
enabled. So if we make a policy change to also disable it anywhere
where it doesn't work well (instead of not at all), there's a pile
more platforms to switch.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Hi Daniel,
On Tue, Jan 22, 2019 at 11:46:39AM +0100, Daniel Vetter wrote:
> Note that the string of platforms which have various issues with iommu
> and igfx is very long, thus far we only disabled it where there's no
> workaround to stop it from hanging the box, but otherwise left it
> enabled. So if we make a policy change to also disable it anywhere
> where it doesn't work well (instead of not at all), there's a pile
> more platforms to switch.
I think its best to just disable iommu for the igfx devices on these
platforms. Can you pick up Eric's patch and extend it with the list of
affected platforms?
Thanks,
Joerg
Quoting Joerg Roedel (2019-01-22 13:01:09)
> Hi Daniel,
>
> On Tue, Jan 22, 2019 at 11:46:39AM +0100, Daniel Vetter wrote:
> > Note that the string of platforms which have various issues with iommu
> > and igfx is very long, thus far we only disabled it where there's no
> > workaround to stop it from hanging the box, but otherwise left it
> > enabled. So if we make a policy change to also disable it anywhere
> > where it doesn't work well (instead of not at all), there's a pile
> > more platforms to switch.
>
> I think its best to just disable iommu for the igfx devices on these
> platforms. Can you pick up Eric's patch and extend it with the list of
> affected platforms?
We've been discussing this again more actively since a few months ago,
and the discussion is still ongoing internally.
According to our IOMMU folks there exists some desire to be able to assign
the iGFX device aka have intel_iommu=on instead of intel_iommu=igfx_off
due to how the devices might be grouped in IOMMU groups. Even when you
would not be using the iGFX device.
So for some uses, the fact that the device (group) is assignable seems
to be more important than the iGFX device to be working. I'm afraid
that retroactively disabling the assignment for such an old platform
might break those usage scenarios. By my quick reading of the code,
there's no way for user to turn the iGFX DMAR on once the quirk
disables it.
I guess one solution would be to default to intel_iommu=igfx_off for
platforms that are older than certain threshold. But still allow
user to enable. But that then requires duplicating the PCI ID database
into iommu code.
I don't really have winning moves to present, but I'm open to hearing
how we can avoid more damage than starting to default to intel_iommu=on
did already.
Regards, Joonas
>
> Thanks,
>
> Joerg
On Tue, Jan 22, 2019 at 04:48:26PM +0200, Joonas Lahtinen wrote:
> According to our IOMMU folks there exists some desire to be able to assign
> the iGFX device aka have intel_iommu=on instead of intel_iommu=igfx_off
> due to how the devices might be grouped in IOMMU groups. Even when you
> would not be using the iGFX device.
You can force the igfx device into a SI domain, or does that also
trigger the iommu issues on the chipset?
In any case, if iommu=on breaks these systems I want to make them work
again with opt-out, even at the cost of breaking assignability.
Regards,
Joerg
Quoting Joerg Roedel (2019-01-22 18:51:35)
> On Tue, Jan 22, 2019 at 04:48:26PM +0200, Joonas Lahtinen wrote:
> > According to our IOMMU folks there exists some desire to be able to assign
> > the iGFX device aka have intel_iommu=on instead of intel_iommu=igfx_off
> > due to how the devices might be grouped in IOMMU groups. Even when you
> > would not be using the iGFX device.
>
> You can force the igfx device into a SI domain, or does that also
> trigger the iommu issues on the chipset?
To be honest, we've had a mixture different issues on different SKUs
that have not been hit in the past when intel_iommu was just disabled by
default.
I know that in one group of the problems, the issue has been debugged
into the GPU having its own set of virtualization mapping translation
hardware with caching and it fails to track changes to the mapping. So
if a identity mapping was established and never changed, I'd assume that
to fix at least that class of problems.
Would just passing intel_iommu=on already cause a non-identity mapping to
possibly be used for the integrated GPU? If it did, then it would
explain quite few of the issues.
We have many reports where just having intel_iommu=on (and using the
system normally, without any virtualization stuff going on) will cause
unexplained GPU hangs. For those users, simply switching to
intel_iommu=igfx_off solves the problems, and the debug often ends
there.
Regards, Joonas
> In any case, if iommu=on breaks these systems I want to make them work
> again with opt-out, even at the cost of breaking assignability.
>
> Regards,
>
> Joerg
On Wed, Jan 23, 2019 at 05:02:38PM +0200, Joonas Lahtinen wrote:
> We have many reports where just having intel_iommu=on (and using the
> system normally, without any virtualization stuff going on) will cause
> unexplained GPU hangs. For those users, simply switching to
> intel_iommu=igfx_off solves the problems, and the debug often ends
> there.
If you can reproduce problems on your side, then you can try to enable
CONFIG_INTEL_IOMMU_BROKEN_GFX_WA to force the GFX devices into the
identity mapping. We can also add a boot-parameter and workarounds if it
turns out that this is sufficient to make the GFX devices work with
IOMMU enabled.
Regards,
Joerg