Hi, everyone,
I apologise in advance if I've sent this to {too many, the wrong}
people. For some time now, I've been getting sporadic (in about one
out of five boots) DRM-related RCU complaints on an Ivy Bridge-based
(Core i7-3720QM) Mac Mini. Call trace (on Linux 6.3-rc6) follows:
[ 5.794026] rcu: INFO: rcu_sched detected expedited stalls on
CPUs/tasks: { 0-.... } 3 jiffies s: 309 root: 0x1/.
[ 5.794044] rcu: blocking rcu_node structures (internal RCU debug):
[ 5.794045] Sending NMI from CPU 1 to CPUs 0:
[ 5.794049] NMI backtrace for cpu 0
[ 5.794051] CPU: 0 PID: 537 Comm: Xorg Not tainted 6.3.0-rc6-debug+ #57
[ 5.794053] Hardware name: Apple Inc.
Macmini6,2/Mac-F65AE981FFA204ED, BIOS 429.0.0.0.0 03/18/2022
[ 5.794054] RIP: 0010:fwtable_read32+0x45/0x100
[ 5.794060] Code: 5e 00 49 89 c5 48 8b 45 08 8b b0 2c 1d 00 00 85
f6 0f 85 94 00 00 00 8d 83 00 00 fc ff 3d ff 5f 0d 00 77 28 48 03 5d
00 8b 1b <48> 8b 45 08 8b 88 2c 1d 00 00 85 c9 75 48 4c 89 ee 4c 89 e7
e8 82
[ 5.794061] RSP: 0018:ffff88810d17f968 EFLAGS: 00000082
[ 5.794063] RAX: 0000000000085010 RBX: 0000000000000200 RCX: ffff888101d80000
[ 5.794064] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff888101d81f28
[ 5.794064] RBP: ffff888101d81f08 R08: 0000000000000000 R09: 00000000000032ab
[ 5.794065] R10: 0000000000000001 R11: 00000000000061b2 R12: ffff888101d81f28
[ 5.794066] R13: 0000000000000246 R14: ffff888102f693d0 R15: ffff88810d17faef
[ 5.794067] FS: 00007f579a20da80(0000) GS:ffff888267200000(0000)
knlGS:0000000000000000
[ 5.794068] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.794069] CR2: 000055fe9b660028 CR3: 0000000111e97004 CR4: 00000000001706f0
[ 5.794069] Call Trace:
[ 5.794071] <TASK>
[ 5.794073] get_reserved.isra.0+0x23/0x30
[ 5.794076] get_clock+0x25/0x70
[ 5.794078] sclhi+0x43/0x70
[ 5.794081] i2c_stop+0x2b/0x40
[ 5.794083] bit_xfer+0x22a/0x5d0
[ 5.794085] ? __pm_runtime_resume+0x40/0x50
[ 5.794088] ? __intel_display_power_get_domain.part.0+0x4d/0x60
[ 5.794092] gmbus_xfer+0x3c/0x90
[ 5.794094] __i2c_transfer+0x15b/0x4d0
[ 5.794096] i2c_transfer+0x3c/0xa0
[ 5.794098] drm_do_probe_ddc_edid+0xad/0x120
[ 5.794101] drm_get_edid+0x71/0x90
[ 5.794104] intel_crt_get_edid+0xf/0x80
[ 5.794106] intel_crt_detect_ddc+0x28/0xc0
[ 5.794107] intel_crt_detect+0x1f5/0xc40
[ 5.794108] ? pollwake+0x61/0x70
[ 5.794111] drm_helper_probe_detect+0x43/0xa0
[ 5.794115] drm_helper_probe_single_connector_modes+0x403/0x490
[ 5.794117] drm_mode_getconnector+0x32d/0x430
[ 5.794119] ? drm_connector_list_iter_next+0x7c/0xa0
[ 5.794120] ? drm_connector_property_set_ioctl+0x30/0x30
[ 5.794122] drm_ioctl_kernel+0x91/0x130
[ 5.794124] drm_ioctl+0x1d8/0x410
[ 5.794125] ? drm_connector_property_set_ioctl+0x30/0x30
[ 5.794127] __x64_sys_ioctl+0x3b9/0x8f0
[ 5.794129] ? handle_mm_fault+0xc1/0x170
[ 5.794131] ? exc_page_fault+0x18b/0x470
[ 5.794134] do_syscall_64+0x2b/0x50
[ 5.794136] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 5.794139] RIP: 0033:0x7f579a1119ef
[ 5.794141] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28
00 00
[ 5.794142] RSP: 002b:00007ffc1dee7620 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 5.794143] RAX: ffffffffffffffda RBX: 000055fe9b660060 RCX: 00007f579a1119ef
[ 5.794144] RDX: 00007ffc1dee76c0 RSI: 00000000c05064a7 RDI: 000000000000000f
[ 5.794144] RBP: 00007ffc1dee76c0 R08: 0000000000000000 R09: 0000000000000000
[ 5.794145] R10: 00007f579a6bc1d8 R11: 0000000000000246 R12: 00000000c05064a7
[ 5.794146] R13: 000000000000000f R14: 00000000c05064a7 R15: 00007ffc1dee76c0
[ 5.794147] </TASK>
Please let me know if you need any additional information (kconfig,
etc.) in order to fix this. I will, of course, be happy to test any
patch thrown my way.
Kind regards,
Rui
On Wed, 12 Apr 2023, Rui Salvaterra <[email protected]> wrote:
> Hi, everyone,
>
> I apologise in advance if I've sent this to {too many, the wrong}
> people. For some time now, I've been getting sporadic (in about one
> out of five boots) DRM-related RCU complaints on an Ivy Bridge-based
> (Core i7-3720QM) Mac Mini. Call trace (on Linux 6.3-rc6) follows:
Please file a bug at fdo gitlab [1]. Add drm.debug=0xe and maybe
log_buf_len=4M or similar kernel parameters, and attach dmesg all the
way from boot to reproducing the problem.
How long is "for some time now"?
BR,
Jani.
[1] https://gitlab.freedesktop.org/drm/intel/wikis/How-to-file-i915-bugs
>
> [ 5.794026] rcu: INFO: rcu_sched detected expedited stalls on
> CPUs/tasks: { 0-.... } 3 jiffies s: 309 root: 0x1/.
> [ 5.794044] rcu: blocking rcu_node structures (internal RCU debug):
> [ 5.794045] Sending NMI from CPU 1 to CPUs 0:
> [ 5.794049] NMI backtrace for cpu 0
> [ 5.794051] CPU: 0 PID: 537 Comm: Xorg Not tainted 6.3.0-rc6-debug+ #57
> [ 5.794053] Hardware name: Apple Inc.
> Macmini6,2/Mac-F65AE981FFA204ED, BIOS 429.0.0.0.0 03/18/2022
> [ 5.794054] RIP: 0010:fwtable_read32+0x45/0x100
> [ 5.794060] Code: 5e 00 49 89 c5 48 8b 45 08 8b b0 2c 1d 00 00 85
> f6 0f 85 94 00 00 00 8d 83 00 00 fc ff 3d ff 5f 0d 00 77 28 48 03 5d
> 00 8b 1b <48> 8b 45 08 8b 88 2c 1d 00 00 85 c9 75 48 4c 89 ee 4c 89 e7
> e8 82
> [ 5.794061] RSP: 0018:ffff88810d17f968 EFLAGS: 00000082
> [ 5.794063] RAX: 0000000000085010 RBX: 0000000000000200 RCX: ffff888101d80000
> [ 5.794064] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff888101d81f28
> [ 5.794064] RBP: ffff888101d81f08 R08: 0000000000000000 R09: 00000000000032ab
> [ 5.794065] R10: 0000000000000001 R11: 00000000000061b2 R12: ffff888101d81f28
> [ 5.794066] R13: 0000000000000246 R14: ffff888102f693d0 R15: ffff88810d17faef
> [ 5.794067] FS: 00007f579a20da80(0000) GS:ffff888267200000(0000)
> knlGS:0000000000000000
> [ 5.794068] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5.794069] CR2: 000055fe9b660028 CR3: 0000000111e97004 CR4: 00000000001706f0
> [ 5.794069] Call Trace:
> [ 5.794071] <TASK>
> [ 5.794073] get_reserved.isra.0+0x23/0x30
> [ 5.794076] get_clock+0x25/0x70
> [ 5.794078] sclhi+0x43/0x70
> [ 5.794081] i2c_stop+0x2b/0x40
> [ 5.794083] bit_xfer+0x22a/0x5d0
> [ 5.794085] ? __pm_runtime_resume+0x40/0x50
> [ 5.794088] ? __intel_display_power_get_domain.part.0+0x4d/0x60
> [ 5.794092] gmbus_xfer+0x3c/0x90
> [ 5.794094] __i2c_transfer+0x15b/0x4d0
> [ 5.794096] i2c_transfer+0x3c/0xa0
> [ 5.794098] drm_do_probe_ddc_edid+0xad/0x120
> [ 5.794101] drm_get_edid+0x71/0x90
> [ 5.794104] intel_crt_get_edid+0xf/0x80
> [ 5.794106] intel_crt_detect_ddc+0x28/0xc0
> [ 5.794107] intel_crt_detect+0x1f5/0xc40
> [ 5.794108] ? pollwake+0x61/0x70
> [ 5.794111] drm_helper_probe_detect+0x43/0xa0
> [ 5.794115] drm_helper_probe_single_connector_modes+0x403/0x490
> [ 5.794117] drm_mode_getconnector+0x32d/0x430
> [ 5.794119] ? drm_connector_list_iter_next+0x7c/0xa0
> [ 5.794120] ? drm_connector_property_set_ioctl+0x30/0x30
> [ 5.794122] drm_ioctl_kernel+0x91/0x130
> [ 5.794124] drm_ioctl+0x1d8/0x410
> [ 5.794125] ? drm_connector_property_set_ioctl+0x30/0x30
> [ 5.794127] __x64_sys_ioctl+0x3b9/0x8f0
> [ 5.794129] ? handle_mm_fault+0xc1/0x170
> [ 5.794131] ? exc_page_fault+0x18b/0x470
> [ 5.794134] do_syscall_64+0x2b/0x50
> [ 5.794136] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [ 5.794139] RIP: 0033:0x7f579a1119ef
> [ 5.794141] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
> 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
> 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28
> 00 00
> [ 5.794142] RSP: 002b:00007ffc1dee7620 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [ 5.794143] RAX: ffffffffffffffda RBX: 000055fe9b660060 RCX: 00007f579a1119ef
> [ 5.794144] RDX: 00007ffc1dee76c0 RSI: 00000000c05064a7 RDI: 000000000000000f
> [ 5.794144] RBP: 00007ffc1dee76c0 R08: 0000000000000000 R09: 0000000000000000
> [ 5.794145] R10: 00007f579a6bc1d8 R11: 0000000000000246 R12: 00000000c05064a7
> [ 5.794146] R13: 000000000000000f R14: 00000000c05064a7 R15: 00007ffc1dee76c0
> [ 5.794147] </TASK>
>
> Please let me know if you need any additional information (kconfig,
> etc.) in order to fix this. I will, of course, be happy to test any
> patch thrown my way.
>
> Kind regards,
> Rui
--
Jani Nikula, Intel Open Source Graphics Center
Hi, Jani,
On Wed, 12 Apr 2023 at 10:28, Jani Nikula <[email protected]> wrote:
>
> Please file a bug at fdo gitlab [1]. Add drm.debug=0xe and maybe
> log_buf_len=4M or similar kernel parameters, and attach dmesg all the
> way from boot to reproducing the problem.
Sure, will do, thanks!
> How long is "for some time now"?
That's a wonderful question. I actually found a screenshot I took on
the 14th October(!) with one of these traces. I don't know the exact
kernel version, but since I've been running weekly release candidates
for years, it shouldn't be hard to pinpoint, if strictly needed. Since
the trace seemed informative and I haven't noticed any problems, I was
kind of hoping it would solve itself in time. I guess my patience
lasted 6 months. :)
Kind regards,
Rui
Hi again, everyone.
So, while preparing to file the bug report with the requested
information, I got a trace completely unrelated to DRM (on a swapon
call, it seems).
[ 4.868340] rcu: INFO: rcu_sched detected expedited stalls on
CPUs/tasks: { 4-.... } 3 jiffies s: 265 root: 0x10/.
[ 4.868349] rcu: blocking rcu_node structures (internal RCU debug):
[ 4.868351] Sending NMI from CPU 3 to CPUs 4:
[ 4.868355] NMI backtrace for cpu 4
[ 4.868357] CPU: 4 PID: 462 Comm: swapon Not tainted 6.3.0-rc6-debug+ #57
[ 4.868359] Hardware name: Apple Inc.
Macmini6,2/Mac-F65AE981FFA204ED, BIOS 429.0.0.0.0 03/18/2022
[ 4.868360] RIP: 0010:zram_submit_bio+0x57c/0x940
[ 4.868365] Code: 04 4c 01 f0 48 8d 48 08 f0 48 0f ba 68 08 0d 0f
82 80 00 00 00 4c 89 ef e8 01 eb ff ff 49 8b 45 00 4a 8d 44 30 09 f0
80 20 df <f0> 48 ff 45 00 48 81 eb 00 10 00 00 41 83 c4 01 48 81 fb ff
0f 00
[ 4.868366] RSP: 0018:ffff8881057dbcd8 EFLAGS: 00000246
[ 4.868368] RAX: ffffc90001c186d9 RBX: 000000003e893000 RCX: ffffc90001c186d8
[ 4.868369] RDX: ffffc90001c186d0 RSI: 0000000000000000 RDI: ffff88810083b400
[ 4.868369] RBP: ffff88810083b470 R08: 0000000000027e40 R09: 0000000000025850
[ 4.868370] R10: 000000000014b212 R11: ffff88810ba03180 R12: 00000000000c176d
[ 4.868371] R13: ffff88810083b400 R14: 0000000000c176d0 R15: 0000000000000000
[ 4.868372] FS: 00007fbd8f8ce800(0000) GS:ffff888266100000(0000)
knlGS:0000000000000000
[ 4.868373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.868374] CR2: 0000563005371000 CR3: 000000010355c003 CR4: 00000000001706e0
[ 4.868375] Call Trace:
[ 4.868377] <TASK>
[ 4.868378] ? block_read_full_folio+0x23e/0x2e0
[ 4.868383] ? kmem_cache_alloc+0x1b/0x110
[ 4.868385] ? mempool_alloc+0x37/0x140
[ 4.868388] ? pcpu_block_update_hint_alloc+0xce/0x2f0
[ 4.868390] __submit_bio+0x41/0xd0
[ 4.868394] submit_bio_noacct_nocheck+0xc4/0x2b0
[ 4.868396] blk_next_bio+0x55/0x70
[ 4.868398] __blkdev_issue_discard+0xc8/0x180
[ 4.868401] blkdev_issue_discard+0x3c/0x80
[ 4.868403] __x64_sys_swapon+0xb71/0x1120
[ 4.868407] do_syscall_64+0x2b/0x50
[ 4.868410] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 4.868414] RIP: 0033:0x7fbd8f712d5b
[ 4.868416] Code: 73 01 c3 48 8b 0d bd 30 0e 00 f7 d8 64 89 01 48
83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 a7 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 30 0e 00 f7 d8 64 89
01 48
[ 4.868417] RSP: 002b:00007ffcaf9a3448 EFLAGS: 00000246 ORIG_RAX:
00000000000000a7
[ 4.868418] RAX: ffffffffffffffda RBX: 0000000000018064 RCX: 00007fbd8f712d5b
[ 4.868419] RDX: 0000000000018064 RSI: 0000000000018064 RDI: 000056300535fb10
[ 4.868420] RBP: 00007ffcaf9a3530 R08: 000000014b213000 R09: 00007fbd8f7f70f0
[ 4.868420] R10: 0000000000001000 R11: 0000000000000246 R12: 000056300535fb10
[ 4.868421] R13: 0000000000000064 R14: 00007ffcaf9a3530 R15: 0000000000000000
[ 4.868423] </TASK>
Could it be that RCU is reporting expedited stalls too eagerly? And,
if so, why only on this machine?
On Thu, Apr 13, 2023 at 08:30:02AM +0100, Rui Salvaterra wrote:
> Hi again, everyone.
>
> So, while preparing to file the bug report with the requested
> information, I got a trace completely unrelated to DRM (on a swapon
> call, it seems).
>
> [ 4.868340] rcu: INFO: rcu_sched detected expedited stalls on
> CPUs/tasks: { 4-.... } 3 jiffies s: 265 root: 0x10/.
> [ 4.868349] rcu: blocking rcu_node structures (internal RCU debug):
> [ 4.868351] Sending NMI from CPU 3 to CPUs 4:
> [ 4.868355] NMI backtrace for cpu 4
> [ 4.868357] CPU: 4 PID: 462 Comm: swapon Not tainted 6.3.0-rc6-debug+ #57
> [ 4.868359] Hardware name: Apple Inc.
> Macmini6,2/Mac-F65AE981FFA204ED, BIOS 429.0.0.0.0 03/18/2022
> [ 4.868360] RIP: 0010:zram_submit_bio+0x57c/0x940
> [ 4.868365] Code: 04 4c 01 f0 48 8d 48 08 f0 48 0f ba 68 08 0d 0f
> 82 80 00 00 00 4c 89 ef e8 01 eb ff ff 49 8b 45 00 4a 8d 44 30 09 f0
> 80 20 df <f0> 48 ff 45 00 48 81 eb 00 10 00 00 41 83 c4 01 48 81 fb ff
> 0f 00
> [ 4.868366] RSP: 0018:ffff8881057dbcd8 EFLAGS: 00000246
> [ 4.868368] RAX: ffffc90001c186d9 RBX: 000000003e893000 RCX: ffffc90001c186d8
> [ 4.868369] RDX: ffffc90001c186d0 RSI: 0000000000000000 RDI: ffff88810083b400
> [ 4.868369] RBP: ffff88810083b470 R08: 0000000000027e40 R09: 0000000000025850
> [ 4.868370] R10: 000000000014b212 R11: ffff88810ba03180 R12: 00000000000c176d
> [ 4.868371] R13: ffff88810083b400 R14: 0000000000c176d0 R15: 0000000000000000
> [ 4.868372] FS: 00007fbd8f8ce800(0000) GS:ffff888266100000(0000)
> knlGS:0000000000000000
> [ 4.868373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4.868374] CR2: 0000563005371000 CR3: 000000010355c003 CR4: 00000000001706e0
> [ 4.868375] Call Trace:
> [ 4.868377] <TASK>
> [ 4.868378] ? block_read_full_folio+0x23e/0x2e0
> [ 4.868383] ? kmem_cache_alloc+0x1b/0x110
> [ 4.868385] ? mempool_alloc+0x37/0x140
> [ 4.868388] ? pcpu_block_update_hint_alloc+0xce/0x2f0
> [ 4.868390] __submit_bio+0x41/0xd0
> [ 4.868394] submit_bio_noacct_nocheck+0xc4/0x2b0
> [ 4.868396] blk_next_bio+0x55/0x70
> [ 4.868398] __blkdev_issue_discard+0xc8/0x180
> [ 4.868401] blkdev_issue_discard+0x3c/0x80
> [ 4.868403] __x64_sys_swapon+0xb71/0x1120
> [ 4.868407] do_syscall_64+0x2b/0x50
> [ 4.868410] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [ 4.868414] RIP: 0033:0x7fbd8f712d5b
> [ 4.868416] Code: 73 01 c3 48 8b 0d bd 30 0e 00 f7 d8 64 89 01 48
> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 a7 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 30 0e 00 f7 d8 64 89
> 01 48
> [ 4.868417] RSP: 002b:00007ffcaf9a3448 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000a7
> [ 4.868418] RAX: ffffffffffffffda RBX: 0000000000018064 RCX: 00007fbd8f712d5b
> [ 4.868419] RDX: 0000000000018064 RSI: 0000000000018064 RDI: 000056300535fb10
> [ 4.868420] RBP: 00007ffcaf9a3530 R08: 000000014b213000 R09: 00007fbd8f7f70f0
> [ 4.868420] R10: 0000000000001000 R11: 0000000000000246 R12: 000056300535fb10
> [ 4.868421] R13: 0000000000000064 R14: 00007ffcaf9a3530 R15: 0000000000000000
> [ 4.868423] </TASK>
>
> Could it be that RCU is reporting expedited stalls too eagerly? And,
> if so, why only on this machine?
My guess would be that you have CONFIG_RCU_EXP_CPU_STALL_TIMEOUT set to
some small non-zero number, for example, you might have set up a recent
Android .config or some such. The default of zero would give you about
21 seconds rather than the three jiffies that you are seeing.
Could you please check your .config?
Thanx, Paul
Hi, Paul,
On Thu, 13 Apr 2023 at 15:43, Paul E. McKenney <[email protected]> wrote:
>
> My guess would be that you have CONFIG_RCU_EXP_CPU_STALL_TIMEOUT set to
> some small non-zero number, for example, you might have set up a recent
> Android .config or some such. The default of zero would give you about
> 21 seconds rather than the three jiffies that you are seeing.
>
> Could you please check your .config?
Well, this is embarrassing. I can't fathom why/how, but I had it set
to 20, on this machine. That is, 20 millisseconds. I guess its a
miracle I haven't seen *more* expedited RCU traces. Sorry for the
noise, everyone.
Kind regards,
Rui
On Thu, Apr 13, 2023 at 04:32:32PM +0100, Rui Salvaterra wrote:
> Hi, Paul,
>
> On Thu, 13 Apr 2023 at 15:43, Paul E. McKenney <[email protected]> wrote:
> >
> > My guess would be that you have CONFIG_RCU_EXP_CPU_STALL_TIMEOUT set to
> > some small non-zero number, for example, you might have set up a recent
> > Android .config or some such. The default of zero would give you about
> > 21 seconds rather than the three jiffies that you are seeing.
> >
> > Could you please check your .config?
>
> Well, this is embarrassing. I can't fathom why/how, but I had it set
> to 20, on this machine. That is, 20 millisseconds. I guess its a
> miracle I haven't seen *more* expedited RCU traces. Sorry for the
> noise, everyone.
Been there, done that!
And actually, it is kind of reassuring that the Linux kernel avoids
tens-of-milliseocnds latency blows in the common case on your system.
Thanx, Paul
Hi, Paul,
On Thu, 13 Apr 2023 at 17:21, Paul E. McKenney <[email protected]> wrote:
>
> Been there, done that!
>
> And actually, it is kind of reassuring that the Linux kernel avoids
> tens-of-milliseocnds latency blows in the common case on your system.
Especially since we're talking about a HZ=100 non-preemptive kernel. :)
Rui