2021-10-02 06:11:05

by Steven Rostedt

[permalink] [raw]
Subject: [BUG 5.15-rc3] kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!

When I tried to test patches applied to v5.15-rc3, I hit this bug (and
hence can not test my code), on 32 bit x86.

------------[ cut here ]------------
kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!
invalid opcode: 0000 [#1] SMP PTI
CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.14.0-rc1-test+ #456
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
EIP: __i915_sw_fence_init+0x15/0x38
Code: 2b 3d 58 98 88 c1 74 05 e8 60 d9 58 00 8d 65 f4 5b 5e 5f 5d c3 3e
8d 74 26 00 55 89 e5 56 89 d6 53 85 d2 74 05 f6 c2 03 74 02 <0f> 0b 89
ca 8b 4d 08 89 c3 e8 48 94 ab ff 89 73 34 c7 43 38 01 00
EAX: c2508260 EBX: c2508000 ECX: c143de1e EDX: c09dfadd
ESI: c09dfadd EDI: c45e7200 EBP: c26c9c68 ESP: c26c9c60
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010202
CR0: 80050033 CR2: 00000000 CR3: 019e2000 CR4: 001506f0
Call Trace:
intel_context_init+0x112/0x145
intel_context_create+0x29/0x37
intel_ring_submission_setup+0x3cb/0x5a8
? kfree+0x135/0x1c6
? wa_init_finish+0x32/0x59
? wa_init_finish+0x4f/0x59
? intel_engine_init_ctx_wa+0x39a/0x3b3
intel_engines_init+0x2dd/0x4d0
? gen6_bsd_submit_request+0x97/0x97
intel_gt_init+0x122/0x20d
i915_gem_init+0x80/0xef
i915_driver_probe+0x880/0xa90
? i915_pci_remove+0x27/0x27
i915_pci_probe+0xdd/0xf6
? __pm_runtime_resume+0x63/0x6b
? i915_pci_remove+0x27/0x27
pci_device_probe+0xbc/0x11e
really_probe+0x13e/0x328
__driver_probe_device+0x140/0x176
driver_probe_device+0x1f/0x71
__driver_attach+0xf6/0x109
? __device_attach_driver+0xbd/0xbd
bus_for_each_dev+0x5b/0x88
driver_attach+0x19/0x1b
? __device_attach_driver+0xbd/0xbd
bus_add_driver+0xf2/0x199
driver_register+0x8c/0xbe
__pci_register_driver+0x5b/0x60
i915_register_pci_driver+0x19/0x1b
i915_init+0x15/0x67
? radeon_module_init+0x6a/0x6a
do_one_initcall+0xce/0x21c
? rcu_read_lock_sched_held+0x35/0x6d
? trace_initcall_level+0x5f/0x99
kernel_init_freeable+0x1fb/0x247
? rest_init+0x129/0x129
kernel_init+0x17/0xfd
ret_from_fork+0x1c/0x28
Modules linked in:
---[ end trace 791dc89810d853da ]---
EIP: __i915_sw_fence_init+0x15/0x38
Code: 2b 3d 58 98 88 c1 74 05 e8 60 d9 58 00 8d 65 f4 5b 5e 5f 5d c3 3e
8d 74 26 00 55 89 e5 56 89 d6 53 85 d2 74 05 f6 c2 03 74 02 <0f> 0b 89
ca 8b 4d 08 89 c3 e8 48 94 ab ff 89 73 34 c7 43 38 01 00
EAX: c2508260 EBX: c2508000 ECX: c143de1e EDX: c09dfadd
ESI: c09dfadd EDI: c45e7200 EBP: c26c9c68 ESP: c26c9c60
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010202
CR0: 80050033 CR2: 00000000 CR3: 019e2000 CR4: 001506f0
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Attached is the dmesg and the config.

I bisected it down to this commit:

3ffe82d701a4 ("drm/i915/xehp: handle new steering options")

-- Steve


Attachments:
(No filename) (2.86 kB)
mitest-config.gz (35.82 kB)
mitest-dmesg.gz (10.62 kB)
Download all attachments

2021-10-02 10:28:03

by Hugh Dickins

[permalink] [raw]
Subject: Re: [BUG 5.15-rc3] kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!

On Sat, 2 Oct 2021, Steven Rostedt wrote:

> When I tried to test patches applied to v5.15-rc3, I hit this bug (and
> hence can not test my code), on 32 bit x86.
>
> ------------[ cut here ]------------
> kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!
> invalid opcode: 0000 [#1] SMP PTI
> CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.14.0-rc1-test+ #456
> Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
> EIP: __i915_sw_fence_init+0x15/0x38
> Code: 2b 3d 58 98 88 c1 74 05 e8 60 d9 58 00 8d 65 f4 5b 5e 5f 5d c3 3e
> 8d 74 26 00 55 89 e5 56 89 d6 53 85 d2 74 05 f6 c2 03 74 02 <0f> 0b 89
> ca 8b 4d 08 89 c3 e8 48 94 ab ff 89 73 34 c7 43 38 01 00
> EAX: c2508260 EBX: c2508000 ECX: c143de1e EDX: c09dfadd
> ESI: c09dfadd EDI: c45e7200 EBP: c26c9c68 ESP: c26c9c60
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010202
> CR0: 80050033 CR2: 00000000 CR3: 019e2000 CR4: 001506f0
> Call Trace:
> intel_context_init+0x112/0x145
> intel_context_create+0x29/0x37
> intel_ring_submission_setup+0x3cb/0x5a8
> ? kfree+0x135/0x1c6
> ? wa_init_finish+0x32/0x59
> ? wa_init_finish+0x4f/0x59
> ? intel_engine_init_ctx_wa+0x39a/0x3b3
> intel_engines_init+0x2dd/0x4d0
> ? gen6_bsd_submit_request+0x97/0x97
> intel_gt_init+0x122/0x20d
> i915_gem_init+0x80/0xef
> i915_driver_probe+0x880/0xa90
> ? i915_pci_remove+0x27/0x27
> i915_pci_probe+0xdd/0xf6
> ? __pm_runtime_resume+0x63/0x6b
> ? i915_pci_remove+0x27/0x27
> pci_device_probe+0xbc/0x11e
> really_probe+0x13e/0x328
> __driver_probe_device+0x140/0x176
> driver_probe_device+0x1f/0x71
> __driver_attach+0xf6/0x109
> ? __device_attach_driver+0xbd/0xbd
> bus_for_each_dev+0x5b/0x88
> driver_attach+0x19/0x1b
> ? __device_attach_driver+0xbd/0xbd
> bus_add_driver+0xf2/0x199
> driver_register+0x8c/0xbe
> __pci_register_driver+0x5b/0x60
> i915_register_pci_driver+0x19/0x1b
> i915_init+0x15/0x67
> ? radeon_module_init+0x6a/0x6a
> do_one_initcall+0xce/0x21c
> ? rcu_read_lock_sched_held+0x35/0x6d
> ? trace_initcall_level+0x5f/0x99
> kernel_init_freeable+0x1fb/0x247
> ? rest_init+0x129/0x129
> kernel_init+0x17/0xfd
> ret_from_fork+0x1c/0x28
> Modules linked in:
> ---[ end trace 791dc89810d853da ]---
> EIP: __i915_sw_fence_init+0x15/0x38
> Code: 2b 3d 58 98 88 c1 74 05 e8 60 d9 58 00 8d 65 f4 5b 5e 5f 5d c3 3e
> 8d 74 26 00 55 89 e5 56 89 d6 53 85 d2 74 05 f6 c2 03 74 02 <0f> 0b 89
> ca 8b 4d 08 89 c3 e8 48 94 ab ff 89 73 34 c7 43 38 01 00
> EAX: c2508260 EBX: c2508000 ECX: c143de1e EDX: c09dfadd
> ESI: c09dfadd EDI: c45e7200 EBP: c26c9c68 ESP: c26c9c60
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010202
> CR0: 80050033 CR2: 00000000 CR3: 019e2000 CR4: 001506f0
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> Kernel Offset: disabled
> ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
>
> Attached is the dmesg and the config.
>
> I bisected it down to this commit:
>
> 3ffe82d701a4 ("drm/i915/xehp: handle new steering options")

Yes (though bisection doesn't work right on this one): the fix
https://lore.kernel.org/lkml/[email protected]/
seems to have got lost in the system: it has not even appeared in
linux-next yet. I was going to send a reminder later this weekend.

Here it is again (but edited to replace "__aligned(4)" in the original
by the official "__i915_sw_fence_call" I discovered afterwards; and
ignoring recent discussions of where __attributes ought to appear :-)


[PATCH] drm/i915: fix blank screen booting crashes

5.15-rc1 crashes with blank screen when booting up on two ThinkPads
using i915. Bisections converge convincingly, but arrive at different
and suprising "culprits", none of them the actual culprit.

netconsole (with init_netconsole() hacked to call i915_init() when
logging has started, instead of by module_init()) tells the story:

kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!
with RSI: ffffffff814d408b pointing to sw_fence_dummy_notify().
I've been building with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, and that
function needs to be 4-byte aligned.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Hugh Dickins <[email protected]>
---

drivers/gpu/drm/i915/gt/intel_context.c | 1 +
1 file changed, 1 insertion(+)

--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -362,6 +362,7 @@ static int __intel_context_active(struct
return 0;
}

+__i915_sw_fence_call /* Respect the I915_SW_FENCE_MASK */
static int sw_fence_dummy_notify(struct i915_sw_fence *sf,
enum i915_sw_fence_notify state)
{

2021-10-02 12:42:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [BUG 5.15-rc3] kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!

On Sat, 2 Oct 2021 03:17:29 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> Yes (though bisection doesn't work right on this one): the fix

Interesting, as it appeared to be very reliable. But I didn't do the
"try before / after" on the patch.

> https://lore.kernel.org/lkml/[email protected]/
> seems to have got lost in the system: it has not even appeared in
> linux-next yet. I was going to send a reminder later this weekend.
>
> Here it is again (but edited to replace "__aligned(4)" in the original
> by the official "__i915_sw_fence_call" I discovered afterwards; and
> ignoring recent discussions of where __attributes ought to appear :-)
>
>
> [PATCH] drm/i915: fix blank screen booting crashes

Thanks, I added it to my "fixes" patch set that I apply before testing.
It looks to have done the trick, and the kernel boots now.

Tested-by: Steven Rostedt (VMware) <[email protected]>

-- Steve

2021-10-02 17:17:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG 5.15-rc3] kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!

On Sat, Oct 2, 2021 at 5:17 AM Steven Rostedt <[email protected]> wrote:
>
> On Sat, 2 Oct 2021 03:17:29 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:
>
> > Yes (though bisection doesn't work right on this one): the fix
>
> Interesting, as it appeared to be very reliable. But I didn't do the
> "try before / after" on the patch.

Well, even the before/after might well have worked, since the problem
depended on how that sw_fence_dummy_notify() function ended up
aligned. So random unrelated changes could re-align it just by
mistake.

Patch applied directly.

I'd also like to point out how that BUG_ON() actually made things
worse, and made this harder to debug. If it had been a WARN_ON_ONCE(),
this would presumably not even have needed bisecting, it would have
been obvious.

BUG_ON() really is pretty much *always* the wrong thing to do. It
onl;y results in problems being harder to see because you end up with
a dead machine and the message is often hidden.

Linus

2021-10-02 17:21:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [BUG 5.15-rc3] kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!

On Sat, 2 Oct 2021, Linus Torvalds wrote:
> On Sat, Oct 2, 2021 at 5:17 AM Steven Rostedt <[email protected]> wrote:
> > On Sat, 2 Oct 2021 03:17:29 -0700 (PDT)
> > Hugh Dickins <[email protected]> wrote:
> >
> > > Yes (though bisection doesn't work right on this one): the fix
> >
> > Interesting, as it appeared to be very reliable. But I didn't do the
> > "try before / after" on the patch.
>
> Well, even the before/after might well have worked, since the problem
> depended on how that sw_fence_dummy_notify() function ended up
> aligned. So random unrelated changes could re-align it just by
> mistake.

Yup.

>
> Patch applied directly.

Great, thanks a lot.

>
> I'd also like to point out how that BUG_ON() actually made things
> worse, and made this harder to debug. If it had been a WARN_ON_ONCE(),
> this would presumably not even have needed bisecting, it would have
> been obvious.
>
> BUG_ON() really is pretty much *always* the wrong thing to do. It
> onl;y results in problems being harder to see because you end up with
> a dead machine and the message is often hidden.

Jani made the same point. But I guess they then went off into the weeds
of how to recover when warning, that the fix itself did not progress.

Hugh

2021-10-04 08:00:20

by Jani Nikula

[permalink] [raw]
Subject: Re: [BUG 5.15-rc3] kernel BUG at drivers/gpu/drm/i915/i915_sw_fence.c:245!

On Sat, 02 Oct 2021, Hugh Dickins <[email protected]> wrote:
> On Sat, 2 Oct 2021, Linus Torvalds wrote:
>> On Sat, Oct 2, 2021 at 5:17 AM Steven Rostedt <[email protected]> wrote:
>> > On Sat, 2 Oct 2021 03:17:29 -0700 (PDT)
>> > Hugh Dickins <[email protected]> wrote:
>> >
>> > > Yes (though bisection doesn't work right on this one): the fix
>> >
>> > Interesting, as it appeared to be very reliable. But I didn't do the
>> > "try before / after" on the patch.
>>
>> Well, even the before/after might well have worked, since the problem
>> depended on how that sw_fence_dummy_notify() function ended up
>> aligned. So random unrelated changes could re-align it just by
>> mistake.
>
> Yup.
>
>>
>> Patch applied directly.
>
> Great, thanks a lot.

Thanks & sorry, really looks like we managed to drop this between the
cracks. :(

>
>>
>> I'd also like to point out how that BUG_ON() actually made things
>> worse, and made this harder to debug. If it had been a WARN_ON_ONCE(),
>> this would presumably not even have needed bisecting, it would have
>> been obvious.
>>
>> BUG_ON() really is pretty much *always* the wrong thing to do. It
>> onl;y results in problems being harder to see because you end up with
>> a dead machine and the message is often hidden.
>
> Jani made the same point. But I guess they then went off into the weeds
> of how to recover when warning, that the fix itself did not progress.

Yes. That, as well as removing the entire alignment thing to reuse a
couple of bits for flags. Too fragile for its own good.

BR,
Jani.


--
Jani Nikula, Intel Open Source Graphics Center