LinuxLists.cc - [BUG 6.3-rc1] Bad lock in ttm_bo_delayed

2023-03-08 02:22:34

Subject: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

In a report for a regression in my code, I tried to run v6.3-rc1 through my
tests. It crashed at boot up on my first test (my start up tests do take a
long time, hence the 206 seconds of boot!).

[ 206.238782] ------------[ cut here ]------------
[ 206.277786] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 206.277946] WARNING: CPU: 0 PID: 332 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x566/0xfec
[ 206.313338] Modules linked in:
[ 206.324732] CPU: 0 PID: 332 Comm: kworker/0:13H Not tainted 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #965
[ 206.338273] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 206.353596] Workqueue: ttm ttm_bo_delayed_delete
[ 206.370520] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
[ 206.382855] Code: e8 ab 59 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 58 c0 3b cf 85 c9 0f 85 17 fb ff ff 68 e0 8d 07 cf 68 2b ac 05 cf e8 e6 e6 3f ff <0f> 0b 58 5a e9 ff fa ff ff e8 78 59 95 ff 85 c0 74 0e 8b 0d 58 c0
[ 206.411247] EAX: 00000028 EBX: 00000000 ECX: c3ae5dd8 EDX: 00000002
[ 206.425193] ESI: 00000000 EDI: c2d5f0bc EBP: c3ae5f00 ESP: c3ae5eac
[ 206.439236] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
[ 206.453597] CR0: 80050033 CR2: ff9ff000 CR3: 0f512000 CR4: 00150ef0
[ 206.467841] Call Trace:
[ 206.481059] ? ttm_bo_delayed_delete+0x30/0x94
[ 206.494980] ww_mutex_lock+0x32/0x94
[ 206.508699] ttm_bo_delayed_delete+0x30/0x94
[ 206.522371] process_one_work+0x21a/0x538
[ 206.536306] worker_thread+0x146/0x398
[ 206.549860] kthread+0xea/0x10c
[ 206.563141] ? process_one_work+0x538/0x538
[ 206.576835] ? kthread_complete_and_exit+0x1c/0x1c
[ 206.590652] ret_from_fork+0x1c/0x28
[ 206.604522] irq event stamp: 4219
[ 206.617852] hardirqs last enabled at (4219): [<ced2a039>] _raw_spin_unlock_irqrestore+0x2d/0x58
[ 206.633077] hardirqs last disabled at (4218): [<ce1d3a65>] kvfree_call_rcu+0x155/0x2ec
[ 206.648161] softirqs last enabled at (3570): [<ced2b113>] __do_softirq+0x2f3/0x48b
[ 206.663025] softirqs last disabled at (3565): [<ce0c84e9>] call_on_stack+0x45/0x4c
[ 206.678065] ---[ end trace 0000000000000000 ]---

Looks like there was a lock possibly used after free. But as commit
9bff18d13473a9fdf81d5158248472a9d8ecf2bd ("drm/ttm: use per BO cleanup
workers") changed a lot of this code, I figured it may be the culprit.

-- Steve

2023-03-08 02:26:24

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Tue, 7 Mar 2023 21:22:23 -0500
Steven Rostedt <[email protected]> wrote:

> Looks like there was a lock possibly used after free. But as commit
> 9bff18d13473a9fdf81d5158248472a9d8ecf2bd ("drm/ttm: use per BO cleanup
> workers") changed a lot of this code, I figured it may be the culprit.

If I bothered to look at the second warning after this one (I usually stop
after the first), it appears to state there was a use after free issue.

[ 206.692285] ------------[ cut here ]------------
[ 206.706333] refcount_t: underflow; use-after-free.
[ 206.720577] WARNING: CPU: 0 PID: 332 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
[ 206.735810] Modules linked in:
[ 206.749493] CPU: 0 PID: 332 Comm: kworker/0:13H Tainted: G W 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #965
[ 206.765833] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 206.781767] Workqueue: ttm ttm_bo_delayed_delete
[ 206.796500] EIP: refcount_warn_saturate+0xb6/0xfc
[ 206.811121] Code: 68 50 1c 0d cf e8 66 b3 a9 ff 0f 0b 58 c9 c3 90 80 3d 57 c6 38 cf 00 75 8a c6 05 57 c6 38 cf 01 68 7c 1c 0d cf e8 46 b3 a9 ff <0f> 0b 59 c9 c3 80 3d 55 c6 38 cf 00 0f 85 67 ff ff ff c6 05 55 c6
[ 206.844560] EAX: 00000026 EBX: c2d5f150 ECX: c3ae5e40 EDX: 00000002
[ 206.862109] ESI: c2d5f0bc EDI: f6f91200 EBP: c3ae5f18 ESP: c3ae5f14
[ 206.878773] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
[ 206.895665] CR0: 80050033 CR2: ff9ff000 CR3: 0f512000 CR4: 00150ef0
[ 206.912303] Call Trace:
[ 206.927940] ttm_bo_delayed_delete+0x8c/0x94
[ 206.944179] process_one_work+0x21a/0x538
[ 206.960605] worker_thread+0x146/0x398
[ 206.976839] kthread+0xea/0x10c
[ 206.992696] ? process_one_work+0x538/0x538
[ 207.008827] ? kthread_complete_and_exit+0x1c/0x1c
[ 207.025150] ret_from_fork+0x1c/0x28
[ 207.041307] irq event stamp: 4219
[ 207.056883] hardirqs last enabled at (4219): [<ced2a039>] _raw_spin_unlock_irqrestore+0x2d/0x58
[ 207.074298] hardirqs last disabled at (4218): [<ce1d3a65>] kvfree_call_rcu+0x155/0x2ec
[ 207.091461] softirqs last enabled at (3570): [<ced2b113>] __do_softirq+0x2f3/0x48b
[ 207.107979] softirqs last disabled at (3565): [<ce0c84e9>] call_on_stack+0x45/0x4c
[ 207.123827] ---[ end trace 0000000000000000 ]---

-- Steve

2023-03-08 06:17:53

by Christian König

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

Am 08.03.23 um 03:26 schrieb Steven Rostedt:
> On Tue, 7 Mar 2023 21:22:23 -0500
> Steven Rostedt <[email protected]> wrote:
>
>> Looks like there was a lock possibly used after free. But as commit
>> 9bff18d13473a9fdf81d5158248472a9d8ecf2bd ("drm/ttm: use per BO cleanup
>> workers") changed a lot of this code, I figured it may be the culprit.
> If I bothered to look at the second warning after this one (I usually stop
> after the first), it appears to state there was a use after free issue.

Yeah, that looks like the reference count was somehow messed up.

What test case/environment do you run to trigger this?

Thanks for the notice,
Christian.

>
> [ 206.692285] ------------[ cut here ]------------
> [ 206.706333] refcount_t: underflow; use-after-free.
> [ 206.720577] WARNING: CPU: 0 PID: 332 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
> [ 206.735810] Modules linked in:
> [ 206.749493] CPU: 0 PID: 332 Comm: kworker/0:13H Tainted: G W 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #965
> [ 206.765833] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 206.781767] Workqueue: ttm ttm_bo_delayed_delete
> [ 206.796500] EIP: refcount_warn_saturate+0xb6/0xfc
> [ 206.811121] Code: 68 50 1c 0d cf e8 66 b3 a9 ff 0f 0b 58 c9 c3 90 80 3d 57 c6 38 cf 00 75 8a c6 05 57 c6 38 cf 01 68 7c 1c 0d cf e8 46 b3 a9 ff <0f> 0b 59 c9 c3 80 3d 55 c6 38 cf 00 0f 85 67 ff ff ff c6 05 55 c6
> [ 206.844560] EAX: 00000026 EBX: c2d5f150 ECX: c3ae5e40 EDX: 00000002
> [ 206.862109] ESI: c2d5f0bc EDI: f6f91200 EBP: c3ae5f18 ESP: c3ae5f14
> [ 206.878773] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
> [ 206.895665] CR0: 80050033 CR2: ff9ff000 CR3: 0f512000 CR4: 00150ef0
> [ 206.912303] Call Trace:
> [ 206.927940] ttm_bo_delayed_delete+0x8c/0x94
> [ 206.944179] process_one_work+0x21a/0x538
> [ 206.960605] worker_thread+0x146/0x398
> [ 206.976839] kthread+0xea/0x10c
> [ 206.992696] ? process_one_work+0x538/0x538
> [ 207.008827] ? kthread_complete_and_exit+0x1c/0x1c
> [ 207.025150] ret_from_fork+0x1c/0x28
> [ 207.041307] irq event stamp: 4219
> [ 207.056883] hardirqs last enabled at (4219): [<ced2a039>] _raw_spin_unlock_irqrestore+0x2d/0x58
> [ 207.074298] hardirqs last disabled at (4218): [<ce1d3a65>] kvfree_call_rcu+0x155/0x2ec
> [ 207.091461] softirqs last enabled at (3570): [<ced2b113>] __do_softirq+0x2f3/0x48b
> [ 207.107979] softirqs last disabled at (3565): [<ce0c84e9>] call_on_stack+0x45/0x4c
> [ 207.123827] ---[ end trace 0000000000000000 ]---
>
>
> -- Steve

2023-03-08 12:43:48

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, 8 Mar 2023 07:17:38 +0100
Christian König <[email protected]> wrote:

> What test case/environment do you run to trigger this?

I'm running a 32bit x86 qemu instance. Attached is the config.

The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
and the VM image itself is here: https://rostedt.org/vm-images/tracetest-32.qcow2.bz2

It happened again in another test (it's not 100% reproducible).

[ 23.234838] ------------[ cut here ]------------
[ 23.236391] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 23.236429] WARNING: CPU: 0 PID: 61 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x566/0xfec
[ 23.240990] Modules linked in:
[ 23.242368] CPU: 0 PID: 61 Comm: kworker/0:1H Not tainted 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
[ 23.245106] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 23.247900] Workqueue: ttm ttm_bo_delayed_delete
[ 23.249642] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
[ 23.251563] Code: e8 2b 5a 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 18 71 3b c8 85 c9 0f 85 17 fb ff ff 68 c0 58 07 c8 68 07 77 05 c8 e8 e6 0a 40 ff <0f> 0b 58 5a e9 ff fa ff ff e8 f8 59 95 ff 85 c0 74 0e 8b 0d 18 71
[ 23.256901] EAX: 00000028 EBX: 00000000 ECX: c1847dd8 EDX: 00000002
[ 23.258849] ESI: 00000000 EDI: c12958bc EBP: c1847f00 ESP: c1847eac
[ 23.260786] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[ 23.262840] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
[ 23.264781] Call Trace:
[ 23.265899] ? lock_is_held_type+0xbe/0x10c
[ 23.267434] ? ttm_bo_delayed_delete+0x30/0x94
[ 23.268971] ww_mutex_lock+0x32/0x94
[ 23.270327] ttm_bo_delayed_delete+0x30/0x94
[ 23.271818] process_one_work+0x21a/0x538
[ 23.273242] worker_thread+0x146/0x398
[ 23.274616] kthread+0xea/0x10c
[ 23.275859] ? process_one_work+0x538/0x538
[ 23.277312] ? kthread_complete_and_exit+0x1c/0x1c
[ 23.278899] ret_from_fork+0x1c/0x28
[ 23.280223] irq event stamp: 33
[ 23.281440] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
[ 23.283860] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
[ 23.286066] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
[ 23.288220] softirqs last disabled at (0): [<00000000>] 0x0
[ 23.289952] ---[ end trace 0000000000000000 ]---
[ 23.291501] ------------[ cut here ]------------
[ 23.293027] refcount_t: underflow; use-after-free.
[ 23.294644] WARNING: CPU: 0 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
[ 23.296959] Modules linked in:
[ 23.298168] CPU: 0 PID: 61 Comm: kworker/0:1H Tainted: G W 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
[ 23.301073] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 23.303642] Workqueue: ttm ttm_bo_delayed_delete
[ 23.305190] EIP: refcount_warn_saturate+0xb6/0xfc
[ 23.306767] Code: 68 70 e1 0c c8 e8 f6 d6 a9 ff 0f 0b 58 c9 c3 90 80 3d 8a 78 38 c8 00 75 8a c6 05 8a 78 38 c8 01 68 9c e1 0c c8 e8 d6 d6 a9 ff <0f> 0b 59 c9 c3 80 3d 88 78 38 c8 00 0f 85 67 ff ff ff c6 05 88 78
[ 23.311935] EAX: 00000026 EBX: c1295950 ECX: c1847e40 EDX: 00000002
[ 23.313884] ESI: c12958bc EDI: f7591100 EBP: c1847f18 ESP: c1847f14
[ 23.315840] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
[ 23.317887] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
[ 23.319859] Call Trace:
[ 23.320978] ttm_bo_delayed_delete+0x8c/0x94
[ 23.322492] process_one_work+0x21a/0x538
[ 23.323959] worker_thread+0x146/0x398
[ 23.325353] kthread+0xea/0x10c
[ 23.326609] ? process_one_work+0x538/0x538
[ 23.328081] ? kthread_complete_and_exit+0x1c/0x1c
[ 23.329683] ret_from_fork+0x1c/0x28
[ 23.331011] irq event stamp: 33
[ 23.332251] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
[ 23.334334] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
[ 23.336176] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
[ 23.337904] softirqs last disabled at (0): [<00000000>] 0x0
[ 23.339313] ---[ end trace 0000000000000000 ]---

-- Steve

Attachments:

(No filename) (4.16 kB)
config (151.71 kB)
Download all attachments

2023-03-15 15:10:04

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, 8 Mar 2023 07:17:38 +0100
Christian König <[email protected]> wrote:

> Am 08.03.23 um 03:26 schrieb Steven Rostedt:
> > On Tue, 7 Mar 2023 21:22:23 -0500
> > Steven Rostedt <[email protected]> wrote:
> >
> >> Looks like there was a lock possibly used after free. But as commit
> >> 9bff18d13473a9fdf81d5158248472a9d8ecf2bd ("drm/ttm: use per BO cleanup
> >> workers") changed a lot of this code, I figured it may be the culprit.
> > If I bothered to look at the second warning after this one (I usually stop
> > after the first), it appears to state there was a use after free issue.
>
> Yeah, that looks like the reference count was somehow messed up.
>
> What test case/environment do you run to trigger this?
>
> Thanks for the notice,

I'm still getting this on Linus's latest tree.

[ 230.530222] ------------[ cut here ]------------
[ 230.569795] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 230.569957] WARNING: CPU: 0 PID: 212 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x62a/0x1300
[ 230.612599] Modules linked in:
[ 230.632144] CPU: 0 PID: 212 Comm: kworker/0:8H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #992
[ 230.654939] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 230.678866] Workqueue: ttm ttm_bo_delayed_delete
[ 230.699452] EIP: __ww_mutex_lock.constprop.0+0x62a/0x1300
[ 230.720582] Code: e8 3b 9a 95 ff 85 c0 0f 84 61 fa ff ff 8b 0d 58 bc 3a c4 85 c9 0f 85 53 fa ff ff 68 54 98 06 c4 68 b7 b6 04 c4 e8 46 af 40 ff <0f> 0b 58 5a e9 3b fa ff ff 8d 74 26 00 90 a1 ec 47 b0 c4 85 c0 75
[ 230.768336] EAX: 00000028 EBX: 00000000 ECX: c51abdd8 EDX: 00000002
[ 230.792001] ESI: 00000000 EDI: c53856bc EBP: c51abf00 ESP: c51abeac
[ 230.815944] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
[ 230.840033] CR0: 80050033 CR2: ff9ff000 CR3: 04506000 CR4: 00150ef0
[ 230.864059] Call Trace:
[ 230.886369] ? ttm_bo_delayed_delete+0x30/0x94
[ 230.909902] ww_mutex_lock+0x32/0x94
[ 230.932550] ttm_bo_delayed_delete+0x30/0x94
[ 230.955798] process_one_work+0x21a/0x484
[ 230.979335] worker_thread+0x14a/0x39c
[ 231.002258] kthread+0xea/0x10c
[ 231.024769] ? process_one_work+0x484/0x484
[ 231.047870] ? kthread_complete_and_exit+0x1c/0x1c
[ 231.071498] ret_from_fork+0x1c/0x28
[ 231.094701] irq event stamp: 4023
[ 231.117272] hardirqs last enabled at (4023): [<c3d1df99>] _raw_spin_unlock_irqrestore+0x2d/0x58
[ 231.143217] hardirqs last disabled at (4022): [<c31d5a55>] kvfree_call_rcu+0x155/0x2ec
[ 231.166058] softirqs last enabled at (3460): [<c3d1f403>] __do_softirq+0x2c3/0x3bb
[ 231.183104] softirqs last disabled at (3455): [<c30c96a9>] call_on_stack+0x45/0x4c
[ 231.200336] ---[ end trace 0000000000000000 ]---
[ 231.216572] ------------[ cut here ]------------

This is preventing me from adding any of my own patches on v6.3-rcX due to
this bug failing my tests. Which means I can't add anything to linux-next
until this is fixed!

-- Steve

Attachments:

(No filename) (2.94 kB)
config-fail (151.91 kB)
Download all attachments

2023-03-15 15:25:21

by Christian König

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

Am 15.03.23 um 16:09 schrieb Steven Rostedt:
> On Wed, 8 Mar 2023 07:17:38 +0100
> Christian König <[email protected]> wrote:
>
>> Am 08.03.23 um 03:26 schrieb Steven Rostedt:
>>> On Tue, 7 Mar 2023 21:22:23 -0500
>>> Steven Rostedt <[email protected]> wrote:
>>>
>>>> Looks like there was a lock possibly used after free. But as commit
>>>> 9bff18d13473a9fdf81d5158248472a9d8ecf2bd ("drm/ttm: use per BO cleanup
>>>> workers") changed a lot of this code, I figured it may be the culprit.
>>> If I bothered to look at the second warning after this one (I usually stop
>>> after the first), it appears to state there was a use after free issue.
>> Yeah, that looks like the reference count was somehow messed up.
>>
>> What test case/environment do you run to trigger this?
>>
>> Thanks for the notice,
> I'm still getting this on Linus's latest tree.

This must be some reference counting issue which only happens in your
particular use case. We have tested this quite extensively and couldn't
reproduce it so far.

Can you apply this code snippet here and see if you get any warning in
the system logs?

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 459f1b4440da..efc390bfd69c 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -314,6 +314,7 @@ static void ttm_bo_delayed_delete(struct work_struct
*work)
        dma_resv_lock(bo->base.resv, NULL);
        ttm_bo_cleanup_memtype_use(bo);
        dma_resv_unlock(bo->base.resv);
+       bo->delayed_delete.func = NULL;
        ttm_bo_put(bo);
}

@@ -327,6 +328,8 @@ static void ttm_bo_release(struct kref *kref)
        WARN_ON_ONCE(bo->pin_count);
        WARN_ON_ONCE(bo->bulk_move);

+       WARN_ON(bo->delayed_delete.func != NULL);
+
        if (!bo->deleted) {
                ret = ttm_bo_individualize_resv(bo);
                if (ret) {

Thanks,
Christian.

>
> [ 230.530222] ------------[ cut here ]------------
> [ 230.569795] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> [ 230.569957] WARNING: CPU: 0 PID: 212 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x62a/0x1300
> [ 230.612599] Modules linked in:
> [ 230.632144] CPU: 0 PID: 212 Comm: kworker/0:8H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #992
> [ 230.654939] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 230.678866] Workqueue: ttm ttm_bo_delayed_delete
> [ 230.699452] EIP: __ww_mutex_lock.constprop.0+0x62a/0x1300
> [ 230.720582] Code: e8 3b 9a 95 ff 85 c0 0f 84 61 fa ff ff 8b 0d 58 bc 3a c4 85 c9 0f 85 53 fa ff ff 68 54 98 06 c4 68 b7 b6 04 c4 e8 46 af 40 ff <0f> 0b 58 5a e9 3b fa ff ff 8d 74 26 00 90 a1 ec 47 b0 c4 85 c0 75
> [ 230.768336] EAX: 00000028 EBX: 00000000 ECX: c51abdd8 EDX: 00000002
> [ 230.792001] ESI: 00000000 EDI: c53856bc EBP: c51abf00 ESP: c51abeac
> [ 230.815944] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
> [ 230.840033] CR0: 80050033 CR2: ff9ff000 CR3: 04506000 CR4: 00150ef0
> [ 230.864059] Call Trace:
> [ 230.886369] ? ttm_bo_delayed_delete+0x30/0x94
> [ 230.909902] ww_mutex_lock+0x32/0x94
> [ 230.932550] ttm_bo_delayed_delete+0x30/0x94
> [ 230.955798] process_one_work+0x21a/0x484
> [ 230.979335] worker_thread+0x14a/0x39c
> [ 231.002258] kthread+0xea/0x10c
> [ 231.024769] ? process_one_work+0x484/0x484
> [ 231.047870] ? kthread_complete_and_exit+0x1c/0x1c
> [ 231.071498] ret_from_fork+0x1c/0x28
> [ 231.094701] irq event stamp: 4023
> [ 231.117272] hardirqs last enabled at (4023): [<c3d1df99>] _raw_spin_unlock_irqrestore+0x2d/0x58
> [ 231.143217] hardirqs last disabled at (4022): [<c31d5a55>] kvfree_call_rcu+0x155/0x2ec
> [ 231.166058] softirqs last enabled at (3460): [<c3d1f403>] __do_softirq+0x2c3/0x3bb
> [ 231.183104] softirqs last disabled at (3455): [<c30c96a9>] call_on_stack+0x45/0x4c
> [ 231.200336] ---[ end trace 0000000000000000 ]---
> [ 231.216572] ------------[ cut here ]------------
>
>
> This is preventing me from adding any of my own patches on v6.3-rcX due to
> this bug failing my tests. Which means I can't add anything to linux-next
> until this is fixed!
>
> -- Steve

2023-03-15 15:58:12

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, 15 Mar 2023 16:25:11 +0100
Christian König <[email protected]> wrote:
> >>
> >> Thanks for the notice,
> > I'm still getting this on Linus's latest tree.
>
> This must be some reference counting issue which only happens in your
> particular use case. We have tested this quite extensively and couldn't
> reproduce it so far.

Have you tried 32 bit with my config. I also sent a link to your previous
email that gives access to the VM image I'm using that is triggering this
issue.

Here it is again:

The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
and the VM image itself is here: https://rostedt.org/vm-images/tracetest-32.qcow2.bz2

>
> Can you apply this code snippet here and see if you get any warning in
> the system logs?
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 459f1b4440da..efc390bfd69c 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -314,6 +314,7 @@ static void ttm_bo_delayed_delete(struct work_struct
> *work)
>         dma_resv_lock(bo->base.resv, NULL);
>         ttm_bo_cleanup_memtype_use(bo);
>         dma_resv_unlock(bo->base.resv);
> +       bo->delayed_delete.func = NULL;
>         ttm_bo_put(bo);
> }
>
> @@ -327,6 +328,8 @@ static void ttm_bo_release(struct kref *kref)
>         WARN_ON_ONCE(bo->pin_count);
>         WARN_ON_ONCE(bo->bulk_move);
>
> +       WARN_ON(bo->delayed_delete.func != NULL);
> +
>         if (!bo->deleted) {
>                 ret = ttm_bo_individualize_resv(bo);
>                 if (ret) {
>

The WARN_ON triggered:

[ 21.481449] mpls_gso: MPLS GSO support
[ 21.488795] IPI shorthand broadcast: enabled
[ 21.488873] ------------[ cut here ]------------
[ 21.490101] ------------[ cut here ]------------

[ 21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x2ac/0x2fc <<<---- Line of the added WARN_ON()

[ 21.492940] refcount_t: underflow; use-after-free.
[ 21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
[ 21.496116] Modules linked in:
[ 21.497197] Modules linked in:
[ 21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[ 21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
[ 21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 21.503533] sched_clock: Marking stable (20788024762, 714243692)->(22140778105, -638509651)
[ 21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 21.504089] Workqueue: ttm ttm_bo_delayed_delete
[ 21.507196] Workqueue: events drm_fb_helper_damage_work
[ 21.509235]
[ 21.510291] registered taskstats version 1
[ 21.510302] Running ring buffer tests...
[ 21.511792]
[ 21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
[ 21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
[ 21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
[ 21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
[ 21.517905] EAX: 00000026 EBX: c129d150 ECX: 00000040 EDX: 00000002
[ 21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 00000001
[ 21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
[ 21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
[ 21.521546] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[ 21.526154] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[ 21.526162] CR0: 80050033 CR2: 00000000 CR3: 18506000 CR4: 00150ef0
[ 21.526166] Call Trace:
[ 21.526189] ? ww_mutex_unlock+0x3a/0x94
[ 21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
[ 21.531722] ? ttm_bo_cleanup_refs+0xc4/0x1e0
[ 21.533114] Call Trace:
[ 21.534516] ttm_mem_evict_first+0x3d3/0x568
[ 21.535901] ttm_bo_delayed_delete+0x9c/0xa4
[ 21.537391] ? kfree+0x6b/0xdc
[ 21.538901] process_one_work+0x21a/0x484
[ 21.540279] ? ttm_range_man_alloc+0xe0/0xec
[ 21.540854] worker_thread+0x14a/0x39c
[ 21.541714] ? ttm_range_man_fini_nocheck+0xe8/0xe8
[ 21.543332] kthread+0xea/0x10c
[ 21.544301] ttm_bo_mem_space+0x1d0/0x1e4
[ 21.544942] ? process_one_work+0x484/0x484
[ 21.545887] ttm_bo_validate+0xc5/0x19c
[ 21.546986] ? kthread_complete_and_exit+0x1c/0x1c
[ 21.547680] ttm_bo_init_reserved+0x15e/0x1fc
[ 21.548716] ret_from_fork+0x1c/0x28
[ 21.549650] qxl_bo_create+0x145/0x20c

Note, this is all on boot up before user space is running.

-- Steve

2023-03-15 17:31:55

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, 15 Mar 2023 11:57:12 -0400
Steven Rostedt <[email protected]> wrote:

So I'm looking at the backtraces.

> The WARN_ON triggered:
>
> [ 21.481449] mpls_gso: MPLS GSO support
> [ 21.488795] IPI shorthand broadcast: enabled
> [ 21.488873] ------------[ cut here ]------------
> [ 21.490101] ------------[ cut here ]------------
>
> [ 21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x2ac/0x2fc <<<---- Line of the added WARN_ON()

This happened on CPU 1.

>
> [ 21.492940] refcount_t: underflow; use-after-free.
> [ 21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc

This happened on CPU 0.

> [ 21.496116] Modules linked in:
> [ 21.497197] Modules linked in:
> [ 21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
> [ 21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
> [ 21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 21.503533] sched_clock: Marking stable (20788024762, 714243692)->(22140778105, -638509651)
> [ 21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 21.504089] Workqueue: ttm ttm_bo_delayed_delete
> [ 21.507196] Workqueue: events drm_fb_helper_damage_work
> [ 21.509235]
> [ 21.510291] registered taskstats version 1
> [ 21.510302] Running ring buffer tests...
> [ 21.511792]
> [ 21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
> [ 21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
> [ 21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
> [ 21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
> [ 21.517905] EAX: 00000026 EBX: c129d150 ECX: 00000040 EDX: 00000002
> [ 21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 00000001
> [ 21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
> [ 21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
> [ 21.521546] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [ 21.526154] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [ 21.526162] CR0: 80050033 CR2: 00000000 CR3: 18506000 CR4: 00150ef0
> [ 21.526166] Call Trace:
> [ 21.526189] ? ww_mutex_unlock+0x3a/0x94
> [ 21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
> [ 21.531722] ? ttm_bo_cleanup_refs+0xc4/0x1e0
> [ 21.533114] Call Trace:
> [ 21.534516] ttm_mem_evict_first+0x3d3/0x568
> [ 21.535901] ttm_bo_delayed_delete+0x9c/0xa4
> [ 21.537391] ? kfree+0x6b/0xdc
> [ 21.538901] process_one_work+0x21a/0x484
> [ 21.540279] ? ttm_range_man_alloc+0xe0/0xec
> [ 21.540854] worker_thread+0x14a/0x39c
> [ 21.541714] ? ttm_range_man_fini_nocheck+0xe8/0xe8
> [ 21.543332] kthread+0xea/0x10c
> [ 21.544301] ttm_bo_mem_space+0x1d0/0x1e4
> [ 21.544942] ? process_one_work+0x484/0x484
> [ 21.545887] ttm_bo_validate+0xc5/0x19c
> [ 21.546986] ? kthread_complete_and_exit+0x1c/0x1c
> [ 21.547680] ttm_bo_init_reserved+0x15e/0x1fc
> [ 21.548716] ret_from_fork+0x1c/0x28
> [ 21.549650] qxl_bo_create+0x145/0x20c

The qxl_bo_create() calls ttm_bo_init_reserved() as the object in question
is about to be freed.

I'm guessing what is happening here, is that an object was to be freed by
the delayed_delete, and in the mean time, something else picked it up.

What's protecting this from not being used again?

-- Steve

2023-03-15 17:55:09

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, 15 Mar 2023 11:57:12 -0400
Steven Rostedt <[email protected]> wrote:

> The WARN_ON triggered:
>
> [ 21.481449] mpls_gso: MPLS GSO support
> [ 21.488795] IPI shorthand broadcast: enabled
> [ 21.488873] ------------[ cut here ]------------
> [ 21.490101] ------------[ cut here ]------------
>
> [ 21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x2ac/0x2fc <<<---- Line of the added WARN_ON()
>
> [ 21.492940] refcount_t: underflow; use-after-free.
> [ 21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
> [ 21.496116] Modules linked in:
> [ 21.497197] Modules linked in:
> [ 21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
> [ 21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
> [ 21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 21.503533] sched_clock: Marking stable (20788024762, 714243692)->(22140778105, -638509651)
> [ 21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 21.504089] Workqueue: ttm ttm_bo_delayed_delete
> [ 21.507196] Workqueue: events drm_fb_helper_damage_work
> [ 21.509235]
> [ 21.510291] registered taskstats version 1
> [ 21.510302] Running ring buffer tests...
> [ 21.511792]
> [ 21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
> [ 21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
> [ 21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
> [ 21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
> [ 21.517905] EAX: 00000026 EBX: c129d150 ECX: 00000040 EDX: 00000002
> [ 21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 00000001
> [ 21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
> [ 21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
> [ 21.521546] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [ 21.526154] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [ 21.526162] CR0: 80050033 CR2: 00000000 CR3: 18506000 CR4: 00150ef0
> [ 21.526166] Call Trace:
> [ 21.526189] ? ww_mutex_unlock+0x3a/0x94
> [ 21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
> [ 21.531722] ? ttm_bo_cleanup_refs+0xc4/0x1e0
> [ 21.533114] Call Trace:
> [ 21.534516] ttm_mem_evict_first+0x3d3/0x568
> [ 21.535901] ttm_bo_delayed_delete+0x9c/0xa4
> [ 21.537391] ? kfree+0x6b/0xdc
> [ 21.538901] process_one_work+0x21a/0x484
> [ 21.540279] ? ttm_range_man_alloc+0xe0/0xec
> [ 21.540854] worker_thread+0x14a/0x39c
> [ 21.541714] ? ttm_range_man_fini_nocheck+0xe8/0xe8
> [ 21.543332] kthread+0xea/0x10c

So I triggered it again, and the same backtrace is there.

> [ 21.544301] ttm_bo_mem_space+0x1d0/0x1e4

It looks like the object is being reserved before it's fully removed. And
it's somewhere in this tty_bo_mem_space() (which comes from the
qxl_bo_create()).

I don't know this code at all, nor do I have any idea of what it's trying
to do. All I know is that this is triggering often (not always), and it has
to do with some race.

Now my config has lots of debugging enabled, which slows down the system
quite a bit. This also happens to open up race windows. Just because your
testing doesn't trigger it, doesn't mean that the race doesn't exist. It's
just likely to be very hard to hit.

> [ 21.544942] ? process_one_work+0x484/0x484
> [ 21.545887] ttm_bo_validate+0xc5/0x19c
> [ 21.546986] ? kthread_complete_and_exit+0x1c/0x1c
> [ 21.547680] ttm_bo_init_reserved+0x15e/0x1fc
> [ 21.548716] ret_from_fork+0x1c/0x28
> [ 21.549650] qxl_bo_create+0x145/0x20c

Here's the latest backtrace:

[ 170.817449] ------------[ cut here ]------------
[ 170.817455] ------------[ cut here ]------------
[ 170.818210] refcount_t: underflow; use-after-free.
[ 170.818228] WARNING: CPU: 0 PID: 267 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
[ 170.819352] WARNING: CPU: 3 PID: 2382 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x278/0x2c8
[ 170.820124] Modules linked in:
[ 170.822127] Modules linked in:
[ 170.823829]
[ 170.823832] CPU: 0 PID: 267 Comm: kworker/0:10H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
[ 170.824610]
[ 170.825121] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 170.825124] Workqueue: ttm ttm_bo_delayed_delete
[ 170.825498] CPU: 3 PID: 2382 Comm: kworker/3:3 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
[ 170.826996]
[ 170.827367] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 170.828839] EIP: refcount_warn_saturate+0xb6/0xfc
[ 170.829842] Workqueue: events drm_fb_helper_damage_work
[ 170.831325] Code: 68 50 ab ef d3 e8 76 ce b2 ff 0f 0b 58 c9 c3 90 80 3d 4e e0 18 d4 00 75 8a c6 05 4e e0 18 d4 01 68 7c ab ef d3 e8 56 ce b2 ff <0f> 0b 59 c9 c3 80 3d 4c e0 18 d4 00 0f 85 67 ff ff ff c6 05 4c e0
[ 170.831670]
[ 170.833138] EAX: 00000026 EBX: c2aa6ef4 ECX: 00000002 EDX: 80000002
[ 170.834080] EIP: ttm_bo_release+0x278/0x2c8
[ 170.834910] ESI: c27744e0 EDI: f758eaa8 EBP: c3083f34 ESP: c3083f30
[ 170.838453] Code: 00 90 89 f8 e9 91 fe ff ff 90 0f 0b 8b 43 10 85 c0 0f 84 d5 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 d0 fd ff ff 8d 76 00 <0f> 0b e9 c6 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
[ 170.838714] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[ 170.839872] EAX: d3835cf0 EBX: c2aa6ed8 ECX: c2aa6ed8 EDX: 00000001
[ 170.840546] CR0: 80050033 CR2: 005ce01c CR3: 03afa000 CR4: 00150ef0
[ 170.841669] ESI: c2aa6e00 EDI: c11333d8 EBP: c3b15ca0 ESP: c3b15c68
[ 170.844583] Call Trace:
[ 170.845738] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[ 170.846732] ttm_bo_delayed_delete+0x7c/0x84
[ 170.847766] CR0: 80050033 CR2: 005b59b1 CR3: 03b12000 CR4: 00150ef0
[ 170.848761] process_one_work+0x18a/0x350
[ 170.849177] Call Trace:
[ 170.850250] worker_thread+0x136/0x378
[ 170.850942] ? ttm_resource_free+0x57/0x6c
[ 170.851928] ? _raw_spin_unlock_irqrestore+0x25/0x40
[ 170.852583] ? ttm_bo_cleanup_memtype_use+0x5f/0x64
[ 170.853002] kthread+0xda/0xfc
[ 170.853608] ? ttm_bo_cleanup_refs+0xbb/0x1c8
[ 170.854266] ? rescuer_thread+0x340/0x340
[ 170.855058] ttm_mem_evict_first+0x324/0x3a0
[ 170.855840] ? kthread_complete_and_exit+0x1c/0x1c
[ 170.856341] ttm_bo_mem_space+0x1d8/0x1ec
[ 170.857056] ret_from_fork+0x1c/0x28
[ 170.857702] ttm_bo_validate+0xb3/0x13c
[ 170.858393] ---[ end trace 0000000000000000 ]---
[ 170.859167] ttm_bo_init_reserved+0x12e/0x1cc
[ 170.863261] systemd[1]: Starting Create System Users...
[ 170.863964] qxl_bo_create+0x14a/0x1f8
[ 170.892037] ? qxl_ttm_debugfs_init+0x48/0x48
[ 170.892043] qxl_alloc_bo_reserved+0x39/0x98
Starting ESC[0;1;39mCreat[ 170.892047] ? kmalloc_trace+0x22/0x84
e System UsersESC[0m...
[ 170.895826] qxl_image_alloc_objects+0x91/0xfc
[ 170.897122] qxl_draw_dirty_fb+0x159/0x440
[ 170.898207] systemd[1]: Finished Set the console keyboard layout.
[ 170.898329] qxl_framebuffer_surface_dirty+0xfb/0x1cc
[ 170.901528] ? qxl_create_plane+0xc4/0xc4
[ 170.901533] drm_fbdev_fb_dirty+0x129/0x1ac
[ESC[0;32m OK ESC[0m] Finished ESC[0[ 170.901537] drm_fb_helper_damage_work+0x8f/0x158
;1;39mSet the console keyboard layoutESC[0m.
[ 170.905435] process_one_work+0x18a/0x350
[ 170.906578] worker_thread+0x136/0x378
[ 170.906832] systemd[1]: Finished Load/Save Random Seed.
[ 170.907195] ? _raw_spin_unlock_irqrestore+0x25/0x40
[ 170.910267] kthread+0xda/0xfc
[ 170.910271] ? rescuer_thread+0x340/0x340
[ 170.910275] ? kthread_complete_and_exit+0x1c/0x1c
[ 170.910278] ret_from_fork+0x1c/0x28
[ 170.910282] ---[ end trace 0000000000000000 ]---

-- Steve

2023-03-15 18:25:26

by Christian König

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

Am 15.03.23 um 18:54 schrieb Steven Rostedt:
> On Wed, 15 Mar 2023 11:57:12 -0400
> Steven Rostedt <[email protected]> wrote:
>
>> The WARN_ON triggered:
>>
>> [ 21.481449] mpls_gso: MPLS GSO support
>> [ 21.488795] IPI shorthand broadcast: enabled
>> [ 21.488873] ------------[ cut here ]------------
>> [ 21.490101] ------------[ cut here ]------------
>>
>> [ 21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x2ac/0x2fc <<<---- Line of the added WARN_ON()
>>
>> [ 21.492940] refcount_t: underflow; use-after-free.
>> [ 21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
>> [ 21.496116] Modules linked in:
>> [ 21.497197] Modules linked in:

The problem here is that two backtraces mix together. So it's pretty
hard to figure out what's going on.

>> [ 21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
>> [ 21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
>> [ 21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
>> [ 21.503533] sched_clock: Marking stable (20788024762, 714243692)->(22140778105, -638509651)
>> [ 21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
>> [ 21.504089] Workqueue: ttm ttm_bo_delayed_delete
>> [ 21.507196] Workqueue: events drm_fb_helper_damage_work
>> [ 21.509235]
>> [ 21.510291] registered taskstats version 1
>> [ 21.510302] Running ring buffer tests...
>> [ 21.511792]
>> [ 21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
>> [ 21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
>> [ 21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
>> [ 21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
>> [ 21.517905] EAX: 00000026 EBX: c129d150 ECX: 00000040 EDX: 00000002
>> [ 21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 00000001
>> [ 21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
>> [ 21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
>> [ 21.521546] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
>> [ 21.526154] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
>> [ 21.526162] CR0: 80050033 CR2: 00000000 CR3: 18506000 CR4: 00150ef0
>> [ 21.526166] Call Trace:
>> [ 21.526189] ? ww_mutex_unlock+0x3a/0x94
>> [ 21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
>> [ 21.531722] ? ttm_bo_cleanup_refs+0xc4/0x1e0
>> [ 21.533114] Call Trace:
>> [ 21.534516] ttm_mem_evict_first+0x3d3/0x568
>> [ 21.535901] ttm_bo_delayed_delete+0x9c/0xa4
>> [ 21.537391] ? kfree+0x6b/0xdc
>> [ 21.538901] process_one_work+0x21a/0x484
>> [ 21.540279] ? ttm_range_man_alloc+0xe0/0xec
>> [ 21.540854] worker_thread+0x14a/0x39c
>> [ 21.541714] ? ttm_range_man_fini_nocheck+0xe8/0xe8
>> [ 21.543332] kthread+0xea/0x10c
> So I triggered it again, and the same backtrace is there.
>
>> [ 21.544301] ttm_bo_mem_space+0x1d0/0x1e4
> It looks like the object is being reserved before it's fully removed. And
> it's somewhere in this tty_bo_mem_space() (which comes from the
> qxl_bo_create()).
>
> I don't know this code at all, nor do I have any idea of what it's trying
> to do. All I know is that this is triggering often (not always), and it has
> to do with some race.
>
> Now my config has lots of debugging enabled, which slows down the system
> quite a bit. This also happens to open up race windows. Just because your
> testing doesn't trigger it, doesn't mean that the race doesn't exist. It's
> just likely to be very hard to hit.
>
>> [ 21.544942] ? process_one_work+0x484/0x484
>> [ 21.545887] ttm_bo_validate+0xc5/0x19c
>> [ 21.546986] ? kthread_complete_and_exit+0x1c/0x1c
>> [ 21.547680] ttm_bo_init_reserved+0x15e/0x1fc
>> [ 21.548716] ret_from_fork+0x1c/0x28
>> [ 21.549650] qxl_bo_create+0x145/0x20c
> Here's the latest backtrace:
>
> [ 170.817449] ------------[ cut here ]------------
> [ 170.817455] ------------[ cut here ]------------
> [ 170.818210] refcount_t: underflow; use-after-free.
> [ 170.818228] WARNING: CPU: 0 PID: 267 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
> [ 170.819352] WARNING: CPU: 3 PID: 2382 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x278/0x2c8
> [ 170.820124] Modules linked in:
> [ 170.822127] Modules linked in:
> [ 170.823829]
> [ 170.823832] CPU: 0 PID: 267 Comm: kworker/0:10H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
> [ 170.824610]
> [ 170.825121] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 170.825124] Workqueue: ttm ttm_bo_delayed_delete
> [ 170.825498] CPU: 3 PID: 2382 Comm: kworker/3:3 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #998
> [ 170.826996]
> [ 170.827367] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 170.828839] EIP: refcount_warn_saturate+0xb6/0xfc
> [ 170.829842] Workqueue: events drm_fb_helper_damage_work
> [ 170.831325] Code: 68 50 ab ef d3 e8 76 ce b2 ff 0f 0b 58 c9 c3 90 80 3d 4e e0 18 d4 00 75 8a c6 05 4e e0 18 d4 01 68 7c ab ef d3 e8 56 ce b2 ff <0f> 0b 59 c9 c3 80 3d 4c e0 18 d4 00 0f 85 67 ff ff ff c6 05 4c e0
> [ 170.831670]
> [ 170.833138] EAX: 00000026 EBX: c2aa6ef4 ECX: 00000002 EDX: 80000002
> [ 170.834080] EIP: ttm_bo_release+0x278/0x2c8
> [ 170.834910] ESI: c27744e0 EDI: f758eaa8 EBP: c3083f34 ESP: c3083f30
> [ 170.838453] Code: 00 90 89 f8 e9 91 fe ff ff 90 0f 0b 8b 43 10 85 c0 0f 84 d5 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 d0 fd ff ff 8d 76 00 <0f> 0b e9 c6 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
> [ 170.838714] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [ 170.839872] EAX: d3835cf0 EBX: c2aa6ed8 ECX: c2aa6ed8 EDX: 00000001
> [ 170.840546] CR0: 80050033 CR2: 005ce01c CR3: 03afa000 CR4: 00150ef0
> [ 170.841669] ESI: c2aa6e00 EDI: c11333d8 EBP: c3b15ca0 ESP: c3b15c68
> [ 170.844583] Call Trace:
> [ 170.845738] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [ 170.846732] ttm_bo_delayed_delete+0x7c/0x84
> [ 170.847766] CR0: 80050033 CR2: 005b59b1 CR3: 03b12000 CR4: 00150ef0
> [ 170.848761] process_one_work+0x18a/0x350
> [ 170.849177] Call Trace:
> [ 170.850250] worker_thread+0x136/0x378
> [ 170.850942] ? ttm_resource_free+0x57/0x6c
> [ 170.851928] ? _raw_spin_unlock_irqrestore+0x25/0x40
> [ 170.852583] ? ttm_bo_cleanup_memtype_use+0x5f/0x64
> [ 170.853002] kthread+0xda/0xfc
> [ 170.853608] ? ttm_bo_cleanup_refs+0xbb/0x1c8
> [ 170.854266] ? rescuer_thread+0x340/0x340
> [ 170.855058] ttm_mem_evict_first+0x324/0x3a0
> [ 170.855840] ? kthread_complete_and_exit+0x1c/0x1c
> [ 170.856341] ttm_bo_mem_space+0x1d8/0x1ec
> [ 170.857056] ret_from_fork+0x1c/0x28
> [ 170.857702] ttm_bo_validate+0xb3/0x13c
> [ 170.858393] ---[ end trace 0000000000000000 ]---
> [ 170.859167] ttm_bo_init_reserved+0x12e/0x1cc
> [ 170.863261] systemd[1]: Starting Create System Users...
> [ 170.863964] qxl_bo_create+0x14a/0x1f8
> [ 170.892037] ? qxl_ttm_debugfs_init+0x48/0x48
> [ 170.892043] qxl_alloc_bo_reserved+0x39/0x98
> Starting ESC[0;1;39mCreat[ 170.892047] ? kmalloc_trace+0x22/0x84
> e System UsersESC[0m...
> [ 170.895826] qxl_image_alloc_objects+0x91/0xfc
> [ 170.897122] qxl_draw_dirty_fb+0x159/0x440
> [ 170.898207] systemd[1]: Finished Set the console keyboard layout.
> [ 170.898329] qxl_framebuffer_surface_dirty+0xfb/0x1cc
> [ 170.901528] ? qxl_create_plane+0xc4/0xc4
> [ 170.901533] drm_fbdev_fb_dirty+0x129/0x1ac
> [ESC[0;32m OK ESC[0m] Finished ESC[0[ 170.901537] drm_fb_helper_damage_work+0x8f/0x158
> ;1;39mSet the console keyboard layoutESC[0m.
> [ 170.905435] process_one_work+0x18a/0x350
> [ 170.906578] worker_thread+0x136/0x378
> [ 170.906832] systemd[1]: Finished Load/Save Random Seed.
> [ 170.907195] ? _raw_spin_unlock_irqrestore+0x25/0x40
> [ 170.910267] kthread+0xda/0xfc
> [ 170.910271] ? rescuer_thread+0x340/0x340
> [ 170.910275] ? kthread_complete_and_exit+0x1c/0x1c
> [ 170.910278] ret_from_fork+0x1c/0x28
> [ 170.910282] ---[ end trace 0000000000000000 ]---

That looks like an eviction is racing with finishing this BO. Let me
take a closer look.

It would help if you replace the WARN_ON(bo->delayed_delete.func) with
an "if (WARN_ON(bo->delayed_delete.func)) return;" and disable the
refcount warning.

This way we should not get other warnings except for the one we are
looking for.

Regards,
Christian.

>
> -- Steve

2023-03-15 18:35:50

by Christian König

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

Am 15.03.23 um 18:31 schrieb Steven Rostedt:
> On Wed, 15 Mar 2023 11:57:12 -0400
> Steven Rostedt <[email protected]> wrote:
>
> So I'm looking at the backtraces.
>
>> The WARN_ON triggered:
>>
>> [ 21.481449] mpls_gso: MPLS GSO support
>> [ 21.488795] IPI shorthand broadcast: enabled
>> [ 21.488873] ------------[ cut here ]------------
>> [ 21.490101] ------------[ cut here ]------------
>>
>> [ 21.491693] WARNING: CPU: 1 PID: 38 at drivers/gpu/drm/ttm/ttm_bo.c:332 ttm_bo_release+0x2ac/0x2fc <<<---- Line of the added WARN_ON()
> This happened on CPU 1.
>
>> [ 21.492940] refcount_t: underflow; use-after-free.
>> [ 21.492965] WARNING: CPU: 0 PID: 84 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
> This happened on CPU 0.
>
>> [ 21.496116] Modules linked in:
>> [ 21.497197] Modules linked in:
>> [ 21.500105] CPU: 1 PID: 38 Comm: kworker/1:1 Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
>> [ 21.500789] CPU: 0 PID: 84 Comm: kworker/0:1H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #993
>> [ 21.501882] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
>> [ 21.503533] sched_clock: Marking stable (20788024762, 714243692)->(22140778105, -638509651)
>> [ 21.504080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
>> [ 21.504089] Workqueue: ttm ttm_bo_delayed_delete
>> [ 21.507196] Workqueue: events drm_fb_helper_damage_work
>> [ 21.509235]
>> [ 21.510291] registered taskstats version 1
>> [ 21.510302] Running ring buffer tests...
>> [ 21.511792]
>> [ 21.513870] EIP: refcount_warn_saturate+0xb6/0xfc
>> [ 21.515261] EIP: ttm_bo_release+0x2ac/0x2fc
>> [ 21.516566] Code: 68 00 27 0c d8 e8 36 3b aa ff 0f 0b 58 c9 c3 90 80 3d 41 c2 37 d8 00 75 8a c6 05 41 c2 37 d8 01 68 2c 27 0c d8 e8 16 3b aa ff <0f> 0b 59 c9 c3 80 3d 3f c2 37 d8 00 0f 85 67 ff ff ff c6 05 3f c2
>> [ 21.516998] Code: ff 8d b4 26 00 00 00 00 66 90 0f 0b 8b 43 10 85 c0 0f 84 a1 fd ff ff 8d 76 00 0f 0b 8b 43 28 85 c0 0f 84 9c fd ff ff 8d 76 00 <0f> 0b e9 92 fd ff ff 8d b4 26 00 00 00 00 66 90 c7 43 18 00 00 00
>> [ 21.517905] EAX: 00000026 EBX: c129d150 ECX: 00000040 EDX: 00000002
>> [ 21.518987] EAX: d78c8550 EBX: c129d134 ECX: c129d134 EDX: 00000001
>> [ 21.519337] ESI: c129d0bc EDI: f6f91200 EBP: c2b8bf18 ESP: c2b8bf14
>> [ 21.520617] ESI: c129d000 EDI: c126a7a0 EBP: c1839c24 ESP: c1839bec
>> [ 21.521546] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
>> [ 21.526154] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
>> [ 21.526162] CR0: 80050033 CR2: 00000000 CR3: 18506000 CR4: 00150ef0
>> [ 21.526166] Call Trace:
>> [ 21.526189] ? ww_mutex_unlock+0x3a/0x94
>> [ 21.530300] CR0: 80050033 CR2: ff9ff000 CR3: 18506000 CR4: 00150ef0
>> [ 21.531722] ? ttm_bo_cleanup_refs+0xc4/0x1e0
>> [ 21.533114] Call Trace:
>> [ 21.534516] ttm_mem_evict_first+0x3d3/0x568
>> [ 21.535901] ttm_bo_delayed_delete+0x9c/0xa4
>> [ 21.537391] ? kfree+0x6b/0xdc
>> [ 21.538901] process_one_work+0x21a/0x484
>> [ 21.540279] ? ttm_range_man_alloc+0xe0/0xec
>> [ 21.540854] worker_thread+0x14a/0x39c
>> [ 21.541714] ? ttm_range_man_fini_nocheck+0xe8/0xe8
>> [ 21.543332] kthread+0xea/0x10c
>> [ 21.544301] ttm_bo_mem_space+0x1d0/0x1e4
>> [ 21.544942] ? process_one_work+0x484/0x484
>> [ 21.545887] ttm_bo_validate+0xc5/0x19c
>> [ 21.546986] ? kthread_complete_and_exit+0x1c/0x1c
>> [ 21.547680] ttm_bo_init_reserved+0x15e/0x1fc
>> [ 21.548716] ret_from_fork+0x1c/0x28
>> [ 21.549650] qxl_bo_create+0x145/0x20c
> The qxl_bo_create() calls ttm_bo_init_reserved() as the object in question
> is about to be freed.
>
> I'm guessing what is happening here, is that an object was to be freed by
> the delayed_delete, and in the mean time, something else picked it up.
>
> What's protecting this from not being used again?

The reference count. This is pretty clearly an unbalanced reference
counting issue.

It's just that previously you wouldn't notice it much because we were
just silently removing the BO from the LRU list without checking if it
was already removed (and so just damaging a bit of memory).

While now we get tons of errors because the delayed worker actually runs
no matter if the BO is already freed or not.

Christian.

>
> -- Steve
>

2023-03-15 18:42:50

by Christian König

[permalink] [raw]

Subject: Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

Am 08.03.23 um 13:43 schrieb Steven Rostedt:
> On Wed, 8 Mar 2023 07:17:38 +0100
> Christian König <[email protected]> wrote:
>
>> What test case/environment do you run to trigger this?
> I'm running a 32bit x86 qemu instance. Attached is the config.
>
> The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
> and the VM image itself is here: https://rostedt.org/vm-images/tracetest-32.qcow2.bz2

I've started to download that, but it will take about an hour. So I
tried to avoid that for now.

But looks like there isn't any other way to reproduce this, the code
seems to work with both amdgpu and radeon.

My suspicion is that we just have a reference count issue in qxl or ttm
which was never noticed because it didn't caused any problems (except
for a minor memory corruption).

Now you get a rain of warnings because we try to grab the lock in the
delete worker.

Christian.

>
> It happened again in another test (it's not 100% reproducible).
>
> [ 23.234838] ------------[ cut here ]------------
> [ 23.236391] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> [ 23.236429] WARNING: CPU: 0 PID: 61 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x566/0xfec
> [ 23.240990] Modules linked in:
> [ 23.242368] CPU: 0 PID: 61 Comm: kworker/0:1H Not tainted 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
> [ 23.245106] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 23.247900] Workqueue: ttm ttm_bo_delayed_delete
> [ 23.249642] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
> [ 23.251563] Code: e8 2b 5a 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 18 71 3b c8 85 c9 0f 85 17 fb ff ff 68 c0 58 07 c8 68 07 77 05 c8 e8 e6 0a 40 ff <0f> 0b 58 5a e9 ff fa ff ff e8 f8 59 95 ff 85 c0 74 0e 8b 0d 18 71
> [ 23.256901] EAX: 00000028 EBX: 00000000 ECX: c1847dd8 EDX: 00000002
> [ 23.258849] ESI: 00000000 EDI: c12958bc EBP: c1847f00 ESP: c1847eac
> [ 23.260786] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [ 23.262840] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> [ 23.264781] Call Trace:
> [ 23.265899] ? lock_is_held_type+0xbe/0x10c
> [ 23.267434] ? ttm_bo_delayed_delete+0x30/0x94
> [ 23.268971] ww_mutex_lock+0x32/0x94
> [ 23.270327] ttm_bo_delayed_delete+0x30/0x94
> [ 23.271818] process_one_work+0x21a/0x538
> [ 23.273242] worker_thread+0x146/0x398
> [ 23.274616] kthread+0xea/0x10c
> [ 23.275859] ? process_one_work+0x538/0x538
> [ 23.277312] ? kthread_complete_and_exit+0x1c/0x1c
> [ 23.278899] ret_from_fork+0x1c/0x28
> [ 23.280223] irq event stamp: 33
> [ 23.281440] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
> [ 23.283860] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
> [ 23.286066] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
> [ 23.288220] softirqs last disabled at (0): [<00000000>] 0x0
> [ 23.289952] ---[ end trace 0000000000000000 ]---
> [ 23.291501] ------------[ cut here ]------------
> [ 23.293027] refcount_t: underflow; use-after-free.
> [ 23.294644] WARNING: CPU: 0 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
> [ 23.296959] Modules linked in:
> [ 23.298168] CPU: 0 PID: 61 Comm: kworker/0:1H Tainted: G W 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
> [ 23.301073] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [ 23.303642] Workqueue: ttm ttm_bo_delayed_delete
> [ 23.305190] EIP: refcount_warn_saturate+0xb6/0xfc
> [ 23.306767] Code: 68 70 e1 0c c8 e8 f6 d6 a9 ff 0f 0b 58 c9 c3 90 80 3d 8a 78 38 c8 00 75 8a c6 05 8a 78 38 c8 01 68 9c e1 0c c8 e8 d6 d6 a9 ff <0f> 0b 59 c9 c3 80 3d 88 78 38 c8 00 0f 85 67 ff ff ff c6 05 88 78
> [ 23.311935] EAX: 00000026 EBX: c1295950 ECX: c1847e40 EDX: 00000002
> [ 23.313884] ESI: c12958bc EDI: f7591100 EBP: c1847f18 ESP: c1847f14
> [ 23.315840] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
> [ 23.317887] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> [ 23.319859] Call Trace:
> [ 23.320978] ttm_bo_delayed_delete+0x8c/0x94
> [ 23.322492] process_one_work+0x21a/0x538
> [ 23.323959] worker_thread+0x146/0x398
> [ 23.325353] kthread+0xea/0x10c
> [ 23.326609] ? process_one_work+0x538/0x538
> [ 23.328081] ? kthread_complete_and_exit+0x1c/0x1c
> [ 23.329683] ret_from_fork+0x1c/0x28
> [ 23.331011] irq event stamp: 33
> [ 23.332251] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
> [ 23.334334] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
> [ 23.336176] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
> [ 23.337904] softirqs last disabled at (0): [<00000000>] 0x0
> [ 23.339313] ---[ end trace 0000000000000000 ]---
>
> -- Steve

2023-03-15 19:15:55

by Matthew Auld

[permalink] [raw]

Subject: Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, 15 Mar 2023 at 18:41, Christian König
<[email protected]> wrote:
>
> Am 08.03.23 um 13:43 schrieb Steven Rostedt:
> > On Wed, 8 Mar 2023 07:17:38 +0100
> > Christian König <[email protected]> wrote:
> >
> >> What test case/environment do you run to trigger this?
> > I'm running a 32bit x86 qemu instance. Attached is the config.
> >
> > The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
> > and the VM image itself is here: https://rostedt.org/vm-images/tracetest-32.qcow2.bz2
>
> I've started to download that, but it will take about an hour. So I
> tried to avoid that for now.
>
> But looks like there isn't any other way to reproduce this, the code
> seems to work with both amdgpu and radeon.
>
> My suspicion is that we just have a reference count issue in qxl or ttm
> which was never noticed because it didn't caused any problems (except
> for a minor memory corruption).

Why does ttm_bo_cleanup_refs() do a bo_put() at the end? It doesn't
make sense to me. Say if the BO is in the process of being delay freed
(bo->deleted = true), and we just did the kref_init() in
ttm_bo_release(), it might drop that ref hitting ttm_bo_release() yet
again, this time doing the actual bo->destroy(), which frees the
object. The worker then fires at some later point calling
ttm_bo_delayed_delete(), but the BO has already been freed.

>
> Now you get a rain of warnings because we try to grab the lock in the
> delete worker.
>
> Christian.
>
> >
> > It happened again in another test (it's not 100% reproducible).
> >
> > [ 23.234838] ------------[ cut here ]------------
> > [ 23.236391] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> > [ 23.236429] WARNING: CPU: 0 PID: 61 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x566/0xfec
> > [ 23.240990] Modules linked in:
> > [ 23.242368] CPU: 0 PID: 61 Comm: kworker/0:1H Not tainted 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
> > [ 23.245106] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> > [ 23.247900] Workqueue: ttm ttm_bo_delayed_delete
> > [ 23.249642] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
> > [ 23.251563] Code: e8 2b 5a 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 18 71 3b c8 85 c9 0f 85 17 fb ff ff 68 c0 58 07 c8 68 07 77 05 c8 e8 e6 0a 40 ff <0f> 0b 58 5a e9 ff fa ff ff e8 f8 59 95 ff 85 c0 74 0e 8b 0d 18 71
> > [ 23.256901] EAX: 00000028 EBX: 00000000 ECX: c1847dd8 EDX: 00000002
> > [ 23.258849] ESI: 00000000 EDI: c12958bc EBP: c1847f00 ESP: c1847eac
> > [ 23.260786] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> > [ 23.262840] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> > [ 23.264781] Call Trace:
> > [ 23.265899] ? lock_is_held_type+0xbe/0x10c
> > [ 23.267434] ? ttm_bo_delayed_delete+0x30/0x94
> > [ 23.268971] ww_mutex_lock+0x32/0x94
> > [ 23.270327] ttm_bo_delayed_delete+0x30/0x94
> > [ 23.271818] process_one_work+0x21a/0x538
> > [ 23.273242] worker_thread+0x146/0x398
> > [ 23.274616] kthread+0xea/0x10c
> > [ 23.275859] ? process_one_work+0x538/0x538
> > [ 23.277312] ? kthread_complete_and_exit+0x1c/0x1c
> > [ 23.278899] ret_from_fork+0x1c/0x28
> > [ 23.280223] irq event stamp: 33
> > [ 23.281440] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
> > [ 23.283860] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
> > [ 23.286066] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
> > [ 23.288220] softirqs last disabled at (0): [<00000000>] 0x0
> > [ 23.289952] ---[ end trace 0000000000000000 ]---
> > [ 23.291501] ------------[ cut here ]------------
> > [ 23.293027] refcount_t: underflow; use-after-free.
> > [ 23.294644] WARNING: CPU: 0 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
> > [ 23.296959] Modules linked in:
> > [ 23.298168] CPU: 0 PID: 61 Comm: kworker/0:1H Tainted: G W 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
> > [ 23.301073] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> > [ 23.303642] Workqueue: ttm ttm_bo_delayed_delete
> > [ 23.305190] EIP: refcount_warn_saturate+0xb6/0xfc
> > [ 23.306767] Code: 68 70 e1 0c c8 e8 f6 d6 a9 ff 0f 0b 58 c9 c3 90 80 3d 8a 78 38 c8 00 75 8a c6 05 8a 78 38 c8 01 68 9c e1 0c c8 e8 d6 d6 a9 ff <0f> 0b 59 c9 c3 80 3d 88 78 38 c8 00 0f 85 67 ff ff ff c6 05 88 78
> > [ 23.311935] EAX: 00000026 EBX: c1295950 ECX: c1847e40 EDX: 00000002
> > [ 23.313884] ESI: c12958bc EDI: f7591100 EBP: c1847f18 ESP: c1847f14
> > [ 23.315840] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
> > [ 23.317887] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> > [ 23.319859] Call Trace:
> > [ 23.320978] ttm_bo_delayed_delete+0x8c/0x94
> > [ 23.322492] process_one_work+0x21a/0x538
> > [ 23.323959] worker_thread+0x146/0x398
> > [ 23.325353] kthread+0xea/0x10c
> > [ 23.326609] ? process_one_work+0x538/0x538
> > [ 23.328081] ? kthread_complete_and_exit+0x1c/0x1c
> > [ 23.329683] ret_from_fork+0x1c/0x28
> > [ 23.331011] irq event stamp: 33
> > [ 23.332251] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
> > [ 23.334334] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
> > [ 23.336176] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
> > [ 23.337904] softirqs last disabled at (0): [<00000000>] 0x0
> > [ 23.339313] ---[ end trace 0000000000000000 ]---
> >
> > -- Steve
>

2023-03-15 19:52:13

by Christian König

[permalink] [raw]

Subject: Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

Am 15.03.23 um 20:15 schrieb Matthew Auld:
> On Wed, 15 Mar 2023 at 18:41, Christian König
> <[email protected]> wrote:
>> Am 08.03.23 um 13:43 schrieb Steven Rostedt:
>>> On Wed, 8 Mar 2023 07:17:38 +0100
>>> Christian König <[email protected]> wrote:
>>>
>>>> What test case/environment do you run to trigger this?
>>> I'm running a 32bit x86 qemu instance. Attached is the config.
>>>
>>> The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
>>> and the VM image itself is here: https://rostedt.org/vm-images/tracetest-32.qcow2.bz2
>> I've started to download that, but it will take about an hour. So I
>> tried to avoid that for now.
>>
>> But looks like there isn't any other way to reproduce this, the code
>> seems to work with both amdgpu and radeon.
>>
>> My suspicion is that we just have a reference count issue in qxl or ttm
>> which was never noticed because it didn't caused any problems (except
>> for a minor memory corruption).
> Why does ttm_bo_cleanup_refs() do a bo_put() at the end?

Yeah, that's the bug. I own you a beer!

In the old model we had an extra reference while the BOs where on the
deleted list and I've forgot to remove this put here.

Steven please try the attached patch.

Thanks,
Christian.

> It doesn't
> make sense to me. Say if the BO is in the process of being delay freed
> (bo->deleted = true), and we just did the kref_init() in
> ttm_bo_release(), it might drop that ref hitting ttm_bo_release() yet
> again, this time doing the actual bo->destroy(), which frees the
> object. The worker then fires at some later point calling
> ttm_bo_delayed_delete(), but the BO has already been freed.
>
>> Now you get a rain of warnings because we try to grab the lock in the
>> delete worker.
>>
>> Christian.
>>
>>> It happened again in another test (it's not 100% reproducible).
>>>
>>> [ 23.234838] ------------[ cut here ]------------
>>> [ 23.236391] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
>>> [ 23.236429] WARNING: CPU: 0 PID: 61 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x566/0xfec
>>> [ 23.240990] Modules linked in:
>>> [ 23.242368] CPU: 0 PID: 61 Comm: kworker/0:1H Not tainted 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
>>> [ 23.245106] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
>>> [ 23.247900] Workqueue: ttm ttm_bo_delayed_delete
>>> [ 23.249642] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
>>> [ 23.251563] Code: e8 2b 5a 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 18 71 3b c8 85 c9 0f 85 17 fb ff ff 68 c0 58 07 c8 68 07 77 05 c8 e8 e6 0a 40 ff <0f> 0b 58 5a e9 ff fa ff ff e8 f8 59 95 ff 85 c0 74 0e 8b 0d 18 71
>>> [ 23.256901] EAX: 00000028 EBX: 00000000 ECX: c1847dd8 EDX: 00000002
>>> [ 23.258849] ESI: 00000000 EDI: c12958bc EBP: c1847f00 ESP: c1847eac
>>> [ 23.260786] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
>>> [ 23.262840] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
>>> [ 23.264781] Call Trace:
>>> [ 23.265899] ? lock_is_held_type+0xbe/0x10c
>>> [ 23.267434] ? ttm_bo_delayed_delete+0x30/0x94
>>> [ 23.268971] ww_mutex_lock+0x32/0x94
>>> [ 23.270327] ttm_bo_delayed_delete+0x30/0x94
>>> [ 23.271818] process_one_work+0x21a/0x538
>>> [ 23.273242] worker_thread+0x146/0x398
>>> [ 23.274616] kthread+0xea/0x10c
>>> [ 23.275859] ? process_one_work+0x538/0x538
>>> [ 23.277312] ? kthread_complete_and_exit+0x1c/0x1c
>>> [ 23.278899] ret_from_fork+0x1c/0x28
>>> [ 23.280223] irq event stamp: 33
>>> [ 23.281440] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
>>> [ 23.283860] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
>>> [ 23.286066] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
>>> [ 23.288220] softirqs last disabled at (0): [<00000000>] 0x0
>>> [ 23.289952] ---[ end trace 0000000000000000 ]---
>>> [ 23.291501] ------------[ cut here ]------------
>>> [ 23.293027] refcount_t: underflow; use-after-free.
>>> [ 23.294644] WARNING: CPU: 0 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
>>> [ 23.296959] Modules linked in:
>>> [ 23.298168] CPU: 0 PID: 61 Comm: kworker/0:1H Tainted: G W 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
>>> [ 23.301073] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
>>> [ 23.303642] Workqueue: ttm ttm_bo_delayed_delete
>>> [ 23.305190] EIP: refcount_warn_saturate+0xb6/0xfc
>>> [ 23.306767] Code: 68 70 e1 0c c8 e8 f6 d6 a9 ff 0f 0b 58 c9 c3 90 80 3d 8a 78 38 c8 00 75 8a c6 05 8a 78 38 c8 01 68 9c e1 0c c8 e8 d6 d6 a9 ff <0f> 0b 59 c9 c3 80 3d 88 78 38 c8 00 0f 85 67 ff ff ff c6 05 88 78
>>> [ 23.311935] EAX: 00000026 EBX: c1295950 ECX: c1847e40 EDX: 00000002
>>> [ 23.313884] ESI: c12958bc EDI: f7591100 EBP: c1847f18 ESP: c1847f14
>>> [ 23.315840] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
>>> [ 23.317887] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
>>> [ 23.319859] Call Trace:
>>> [ 23.320978] ttm_bo_delayed_delete+0x8c/0x94
>>> [ 23.322492] process_one_work+0x21a/0x538
>>> [ 23.323959] worker_thread+0x146/0x398
>>> [ 23.325353] kthread+0xea/0x10c
>>> [ 23.326609] ? process_one_work+0x538/0x538
>>> [ 23.328081] ? kthread_complete_and_exit+0x1c/0x1c
>>> [ 23.329683] ret_from_fork+0x1c/0x28
>>> [ 23.331011] irq event stamp: 33
>>> [ 23.332251] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
>>> [ 23.334334] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
>>> [ 23.336176] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
>>> [ 23.337904] softirqs last disabled at (0): [<00000000>] 0x0
>>> [ 23.339313] ---[ end trace 0000000000000000 ]---
>>>
>>> -- Steve

Attachments:

0001-drm-ttm-drop-extra-ttm_bo_put-in-ttm_bo_cleanup_refs.patch (982.00 B)

2023-03-15 20:20:20

by Steven Rostedt

[permalink] [raw]

Subject: Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, 15 Mar 2023 20:51:49 +0100
Christian König <[email protected]> wrote:

> Steven please try the attached patch.

I applied it, but as it's not always reproducible, I'll have to give it
several runs before I give you my "tested-by" tag.

-- Steve

2023-03-16 00:21:44

by Steven Rostedt

[permalink] [raw]

Subject: Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, 15 Mar 2023 16:20:11 -0400
Steven Rostedt <[email protected]> wrote:

> On Wed, 15 Mar 2023 20:51:49 +0100
> Christian König <[email protected]> wrote:
>
> > Steven please try the attached patch.
>
> I applied it, but as it's not always reproducible, I'll have to give it
> several runs before I give you my "tested-by" tag.

I ran my tests a bunch of times with this applied and it didn't fail once.
As it got further than it ever did before (it usually took 1 to 3 runs to
trigger, and I ran it over 10 times).

Reported-by: Steven Rostedt (Google) <[email protected]>
Tested-by: Steven Rostedt (Google) <[email protected]>

Thanks!

-- Steve

2023-03-16 00:22:50

by Steven Rostedt

[permalink] [raw]

Subject: Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, 15 Mar 2023 20:21:33 -0400
Steven Rostedt <[email protected]> wrote:

> On Wed, 15 Mar 2023 16:20:11 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > On Wed, 15 Mar 2023 20:51:49 +0100
> > Christian König <[email protected]> wrote:
> >
> > > Steven please try the attached patch.
> >
> > I applied it, but as it's not always reproducible, I'll have to give it
> > several runs before I give you my "tested-by" tag.
>
> I ran my tests a bunch of times with this applied and it didn't fail once.
> As it got further than it ever did before (it usually took 1 to 3 runs to
> trigger, and I ran it over 10 times).
>
> Reported-by: Steven Rostedt (Google) <[email protected]>
> Tested-by: Steven Rostedt (Google) <[email protected]>

I hope that this gets in by -rc3, as I want to start basing my next branch
on that tag.

Thanks,

-- Steve

2023-03-17 17:43:29

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

On Wed, Mar 15, 2023 at 5:22 PM Steven Rostedt <[email protected]> wrote:
>
> I hope that this gets in by -rc3, as I want to start basing my next branch
> on that tag.

My tree should have it now as commit c00133a9e87e ("drm/ttm: drop
extra ttm_bo_put in ttm_bo_cleanup_refs").

Linus