LinuxLists.cc - [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

2024-03-07 22:17:05

Subject: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

An errant disk backup on my desktop got into debugfs and triggered the
following deadlock scenario in the amdgpu debugfs files. The machine
also hard-resets immediately after those lines are printed (although I
wasn't able to reproduce that part when reading by hand):

[ 1318.016074][ T1082] ======================================================
[ 1318.016607][ T1082] WARNING: possible circular locking dependency detected
[ 1318.017107][ T1082] 6.8.0-rc7-00015-ge0c8221b72c0 #17 Not tainted
[ 1318.017598][ T1082] ------------------------------------------------------
[ 1318.018096][ T1082] tar/1082 is trying to acquire lock:
[ 1318.018585][ T1082] ffff98c44175d6a0 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x40/0x80
[ 1318.019084][ T1082]
[ 1318.019084][ T1082] but task is already holding lock:
[ 1318.020052][ T1082] ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.020607][ T1082]
[ 1318.020607][ T1082] which lock already depends on the new lock.
[ 1318.020607][ T1082]
[ 1318.022081][ T1082]
[ 1318.022081][ T1082] the existing dependency chain (in reverse order) is:
[ 1318.023083][ T1082]
[ 1318.023083][ T1082] -> #2 (reservation_ww_class_mutex){+.+.}-{3:3}:
[ 1318.024114][ T1082] __ww_mutex_lock.constprop.0+0xe0/0x12f0
[ 1318.024639][ T1082] ww_mutex_lock+0x32/0x90
[ 1318.025161][ T1082] dma_resv_lockdep+0x18a/0x330
[ 1318.025683][ T1082] do_one_initcall+0x6a/0x350
[ 1318.026210][ T1082] kernel_init_freeable+0x1a3/0x310
[ 1318.026728][ T1082] kernel_init+0x15/0x1a0
[ 1318.027242][ T1082] ret_from_fork+0x2c/0x40
[ 1318.027759][ T1082] ret_from_fork_asm+0x11/0x20
[ 1318.028281][ T1082]
[ 1318.028281][ T1082] -> #1 (reservation_ww_class_acquire){+.+.}-{0:0}:
[ 1318.029297][ T1082] dma_resv_lockdep+0x16c/0x330
[ 1318.029790][ T1082] do_one_initcall+0x6a/0x350
[ 1318.030263][ T1082] kernel_init_freeable+0x1a3/0x310
[ 1318.030722][ T1082] kernel_init+0x15/0x1a0
[ 1318.031168][ T1082] ret_from_fork+0x2c/0x40
[ 1318.031598][ T1082] ret_from_fork_asm+0x11/0x20
[ 1318.032011][ T1082]
[ 1318.032011][ T1082] -> #0 (&mm->mmap_lock){++++}-{3:3}:
[ 1318.032778][ T1082] __lock_acquire+0x14bf/0x2680
[ 1318.033141][ T1082] lock_acquire+0xcd/0x2c0
[ 1318.033487][ T1082] __might_fault+0x58/0x80
[ 1318.033814][ T1082] amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu]
[ 1318.034181][ T1082] full_proxy_read+0x55/0x80
[ 1318.034487][ T1082] vfs_read+0xa7/0x360
[ 1318.034788][ T1082] ksys_read+0x70/0xf0
[ 1318.035085][ T1082] do_syscall_64+0x94/0x180
[ 1318.035375][ T1082] entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 1318.035664][ T1082]
[ 1318.035664][ T1082] other info that might help us debug this:
[ 1318.035664][ T1082]
[ 1318.036487][ T1082] Chain exists of:
[ 1318.036487][ T1082] &mm->mmap_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex
[ 1318.036487][ T1082]
[ 1318.037310][ T1082] Possible unsafe locking scenario:
[ 1318.037310][ T1082]
[ 1318.037838][ T1082] CPU0 CPU1
[ 1318.038101][ T1082] ---- ----
[ 1318.038350][ T1082] lock(reservation_ww_class_mutex);
[ 1318.038590][ T1082] lock(reservation_ww_class_acquire);
[ 1318.038839][ T1082] lock(reservation_ww_class_mutex);
[ 1318.039083][ T1082] rlock(&mm->mmap_lock);
[ 1318.039328][ T1082]
[ 1318.039328][ T1082] *** DEADLOCK ***
[ 1318.039328][ T1082]
[ 1318.040029][ T1082] 1 lock held by tar/1082:
[ 1318.040259][ T1082] #0: ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.040560][ T1082]
[ 1318.040560][ T1082] stack backtrace:
[ 1318.041053][ T1082] CPU: 22 PID: 1082 Comm: tar Not tainted 6.8.0-rc7-00015-ge0c8221b72c0 #17 3316c85d50e282c5643b075d1f01a4f6365e39c2
[ 1318.041329][ T1082] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023
[ 1318.041614][ T1082] Call Trace:
[ 1318.041895][ T1082] <TASK>
[ 1318.042175][ T1082] dump_stack_lvl+0x4a/0x80
[ 1318.042460][ T1082] check_noncircular+0x145/0x160
[ 1318.042743][ T1082] __lock_acquire+0x14bf/0x2680
[ 1318.043022][ T1082] lock_acquire+0xcd/0x2c0
[ 1318.043301][ T1082] ? __might_fault+0x40/0x80
[ 1318.043580][ T1082] ? __might_fault+0x40/0x80
[ 1318.043856][ T1082] __might_fault+0x58/0x80
[ 1318.044131][ T1082] ? __might_fault+0x40/0x80
[ 1318.044408][ T1082] amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu 8fe2afaa910cbd7654c8cab23563a94d6caebaab]
[ 1318.044749][ T1082] full_proxy_read+0x55/0x80
[ 1318.045042][ T1082] vfs_read+0xa7/0x360
[ 1318.045333][ T1082] ksys_read+0x70/0xf0
[ 1318.045623][ T1082] do_syscall_64+0x94/0x180
[ 1318.045913][ T1082] ? do_syscall_64+0xa0/0x180
[ 1318.046201][ T1082] ? lockdep_hardirqs_on+0x7d/0x100
[ 1318.046487][ T1082] ? do_syscall_64+0xa0/0x180
[ 1318.046773][ T1082] ? do_syscall_64+0xa0/0x180
[ 1318.047057][ T1082] ? do_syscall_64+0xa0/0x180
[ 1318.047337][ T1082] ? do_syscall_64+0xa0/0x180
[ 1318.047611][ T1082] entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 1318.047887][ T1082] RIP: 0033:0x7f480b70a39d
[ 1318.048162][ T1082] Code: 91 ba 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb b2 e8 18 a3 01 00 0f 1f 84 00 00 00 00 00 80 3d a9 3c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
[ 1318.048769][ T1082] RSP: 002b:00007ffde77f5c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 1318.049083][ T1082] RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f480b70a39d
[ 1318.049392][ T1082] RDX: 0000000000000800 RSI: 000055c9f2120c00 RDI: 0000000000000008
[ 1318.049703][ T1082] RBP: 0000000000000800 R08: 000055c9f2120a94 R09: 0000000000000007
[ 1318.050011][ T1082] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c9f2120c00
[ 1318.050324][ T1082] R13: 0000000000000008 R14: 0000000000000008 R15: 0000000000000800
[ 1318.050638][ T1082] </TASK>

amdgpu_debugfs_mqd_read() holds a reservation when it calls
put_user(), which may fault and acquire the mmap_sem. This violates
the established locking order.

Bounce the mqd data through a kernel buffer to get put_user() out of
the illegal section.

Fixes: 445d85e3c1df ("drm/amdgpu: add debugfs interface for reading MQDs")
Cc: [email protected] [v6.5+]
Signed-off-by: Johannes Weiner <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 46 +++++++++++++++---------
1 file changed, 29 insertions(+), 17 deletions(-)

This fixes the lockdep splat for me, and the hexdump of the output
looks sane after the patch. However, I'm not at all familiar with this
code to say for sure that this is the right solution. The mqd seems
small enough that the kmalloc won't get crazy. I'm also assuming that
ring->mqd_size is safe to access before the reserve & kmap. Lastly I
went with an open loop instead of a memcpy() as I wasn't sure if that
memory is safe to address a byte at at time.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 5505d646f43a..06f0a6534a94 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -524,46 +524,58 @@ static ssize_t amdgpu_debugfs_mqd_read(struct file *f, char __user *buf,
{
struct amdgpu_ring *ring = file_inode(f)->i_private;
volatile u32 *mqd;
- int r;
+ u32 *kbuf;
+ int r, i;
uint32_t value, result;

if (*pos & 3 || size & 3)
return -EINVAL;

- result = 0;
+ kbuf = kmalloc(ring->mqd_size, GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;

r = amdgpu_bo_reserve(ring->mqd_obj, false);
if (unlikely(r != 0))
- return r;
+ goto err_free;

r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&mqd);
- if (r) {
- amdgpu_bo_unreserve(ring->mqd_obj);
- return r;
- }
+ if (r)
+ goto err_unreserve;

+ /*
+ * Copy to local buffer to avoid put_user(), which might fault
+ * and acquire mmap_sem, under reservation_ww_class_mutex.
+ */
+ for (i = 0; i < ring->mqd_size/sizeof(u32); i++)
+ kbuf[i] = mqd[i];
+
+ amdgpu_bo_kunmap(ring->mqd_obj);
+ amdgpu_bo_unreserve(ring->mqd_obj);
+
+ result = 0;
while (size) {
if (*pos >= ring->mqd_size)
- goto done;
+ break;

- value = mqd[*pos/4];
+ value = kbuf[*pos/4];
r = put_user(value, (uint32_t *)buf);
if (r)
- goto done;
+ goto err_free;
buf += 4;
result += 4;
size -= 4;
*pos += 4;
}

-done:
- amdgpu_bo_kunmap(ring->mqd_obj);
- mqd = NULL;
- amdgpu_bo_unreserve(ring->mqd_obj);
- if (r)
- return r;
-
+ kfree(kbuf);
return result;
+
+err_unreserve:
+ amdgpu_bo_unreserve(ring->mqd_obj);
+err_free:
+ kfree(kbuf);
+ return r;
}

static const struct file_operations amdgpu_debugfs_mqd_fops = {
--
2.44.0

2024-03-08 11:32:48

by Christian König

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

Good catch, Shashank can you take a closer look?

Thanks,
Christian.

Am 07.03.24 um 23:07 schrieb Johannes Weiner:
> An errant disk backup on my desktop got into debugfs and triggered the
> following deadlock scenario in the amdgpu debugfs files. The machine
> also hard-resets immediately after those lines are printed (although I
> wasn't able to reproduce that part when reading by hand):
>
> [ 1318.016074][ T1082] ======================================================
> [ 1318.016607][ T1082] WARNING: possible circular locking dependency detected
> [ 1318.017107][ T1082] 6.8.0-rc7-00015-ge0c8221b72c0 #17 Not tainted
> [ 1318.017598][ T1082] ------------------------------------------------------
> [ 1318.018096][ T1082] tar/1082 is trying to acquire lock:
> [ 1318.018585][ T1082] ffff98c44175d6a0 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x40/0x80
> [ 1318.019084][ T1082]
> [ 1318.019084][ T1082] but task is already holding lock:
> [ 1318.020052][ T1082] ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
> [ 1318.020607][ T1082]
> [ 1318.020607][ T1082] which lock already depends on the new lock.
> [ 1318.020607][ T1082]
> [ 1318.022081][ T1082]
> [ 1318.022081][ T1082] the existing dependency chain (in reverse order) is:
> [ 1318.023083][ T1082]
> [ 1318.023083][ T1082] -> #2 (reservation_ww_class_mutex){+.+.}-{3:3}:
> [ 1318.024114][ T1082] __ww_mutex_lock.constprop.0+0xe0/0x12f0
> [ 1318.024639][ T1082] ww_mutex_lock+0x32/0x90
> [ 1318.025161][ T1082] dma_resv_lockdep+0x18a/0x330
> [ 1318.025683][ T1082] do_one_initcall+0x6a/0x350
> [ 1318.026210][ T1082] kernel_init_freeable+0x1a3/0x310
> [ 1318.026728][ T1082] kernel_init+0x15/0x1a0
> [ 1318.027242][ T1082] ret_from_fork+0x2c/0x40
> [ 1318.027759][ T1082] ret_from_fork_asm+0x11/0x20
> [ 1318.028281][ T1082]
> [ 1318.028281][ T1082] -> #1 (reservation_ww_class_acquire){+.+.}-{0:0}:
> [ 1318.029297][ T1082] dma_resv_lockdep+0x16c/0x330
> [ 1318.029790][ T1082] do_one_initcall+0x6a/0x350
> [ 1318.030263][ T1082] kernel_init_freeable+0x1a3/0x310
> [ 1318.030722][ T1082] kernel_init+0x15/0x1a0
> [ 1318.031168][ T1082] ret_from_fork+0x2c/0x40
> [ 1318.031598][ T1082] ret_from_fork_asm+0x11/0x20
> [ 1318.032011][ T1082]
> [ 1318.032011][ T1082] -> #0 (&mm->mmap_lock){++++}-{3:3}:
> [ 1318.032778][ T1082] __lock_acquire+0x14bf/0x2680
> [ 1318.033141][ T1082] lock_acquire+0xcd/0x2c0
> [ 1318.033487][ T1082] __might_fault+0x58/0x80
> [ 1318.033814][ T1082] amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu]
> [ 1318.034181][ T1082] full_proxy_read+0x55/0x80
> [ 1318.034487][ T1082] vfs_read+0xa7/0x360
> [ 1318.034788][ T1082] ksys_read+0x70/0xf0
> [ 1318.035085][ T1082] do_syscall_64+0x94/0x180
> [ 1318.035375][ T1082] entry_SYSCALL_64_after_hwframe+0x46/0x4e
> [ 1318.035664][ T1082]
> [ 1318.035664][ T1082] other info that might help us debug this:
> [ 1318.035664][ T1082]
> [ 1318.036487][ T1082] Chain exists of:
> [ 1318.036487][ T1082] &mm->mmap_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex
> [ 1318.036487][ T1082]
> [ 1318.037310][ T1082] Possible unsafe locking scenario:
> [ 1318.037310][ T1082]
> [ 1318.037838][ T1082] CPU0 CPU1
> [ 1318.038101][ T1082] ---- ----
> [ 1318.038350][ T1082] lock(reservation_ww_class_mutex);
> [ 1318.038590][ T1082] lock(reservation_ww_class_acquire);
> [ 1318.038839][ T1082] lock(reservation_ww_class_mutex);
> [ 1318.039083][ T1082] rlock(&mm->mmap_lock);
> [ 1318.039328][ T1082]
> [ 1318.039328][ T1082] *** DEADLOCK ***
> [ 1318.039328][ T1082]
> [ 1318.040029][ T1082] 1 lock held by tar/1082:
> [ 1318.040259][ T1082] #0: ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
> [ 1318.040560][ T1082]
> [ 1318.040560][ T1082] stack backtrace:
> [ 1318.041053][ T1082] CPU: 22 PID: 1082 Comm: tar Not tainted 6.8.0-rc7-00015-ge0c8221b72c0 #17 3316c85d50e282c5643b075d1f01a4f6365e39c2
> [ 1318.041329][ T1082] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023
> [ 1318.041614][ T1082] Call Trace:
> [ 1318.041895][ T1082] <TASK>
> [ 1318.042175][ T1082] dump_stack_lvl+0x4a/0x80
> [ 1318.042460][ T1082] check_noncircular+0x145/0x160
> [ 1318.042743][ T1082] __lock_acquire+0x14bf/0x2680
> [ 1318.043022][ T1082] lock_acquire+0xcd/0x2c0
> [ 1318.043301][ T1082] ? __might_fault+0x40/0x80
> [ 1318.043580][ T1082] ? __might_fault+0x40/0x80
> [ 1318.043856][ T1082] __might_fault+0x58/0x80
> [ 1318.044131][ T1082] ? __might_fault+0x40/0x80
> [ 1318.044408][ T1082] amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu 8fe2afaa910cbd7654c8cab23563a94d6caebaab]
> [ 1318.044749][ T1082] full_proxy_read+0x55/0x80
> [ 1318.045042][ T1082] vfs_read+0xa7/0x360
> [ 1318.045333][ T1082] ksys_read+0x70/0xf0
> [ 1318.045623][ T1082] do_syscall_64+0x94/0x180
> [ 1318.045913][ T1082] ? do_syscall_64+0xa0/0x180
> [ 1318.046201][ T1082] ? lockdep_hardirqs_on+0x7d/0x100
> [ 1318.046487][ T1082] ? do_syscall_64+0xa0/0x180
> [ 1318.046773][ T1082] ? do_syscall_64+0xa0/0x180
> [ 1318.047057][ T1082] ? do_syscall_64+0xa0/0x180
> [ 1318.047337][ T1082] ? do_syscall_64+0xa0/0x180
> [ 1318.047611][ T1082] entry_SYSCALL_64_after_hwframe+0x46/0x4e
> [ 1318.047887][ T1082] RIP: 0033:0x7f480b70a39d
> [ 1318.048162][ T1082] Code: 91 ba 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb b2 e8 18 a3 01 00 0f 1f 84 00 00 00 00 00 80 3d a9 3c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
> [ 1318.048769][ T1082] RSP: 002b:00007ffde77f5c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 1318.049083][ T1082] RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f480b70a39d
> [ 1318.049392][ T1082] RDX: 0000000000000800 RSI: 000055c9f2120c00 RDI: 0000000000000008
> [ 1318.049703][ T1082] RBP: 0000000000000800 R08: 000055c9f2120a94 R09: 0000000000000007
> [ 1318.050011][ T1082] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c9f2120c00
> [ 1318.050324][ T1082] R13: 0000000000000008 R14: 0000000000000008 R15: 0000000000000800
> [ 1318.050638][ T1082] </TASK>
>
> amdgpu_debugfs_mqd_read() holds a reservation when it calls
> put_user(), which may fault and acquire the mmap_sem. This violates
> the established locking order.
>
> Bounce the mqd data through a kernel buffer to get put_user() out of
> the illegal section.
>
> Fixes: 445d85e3c1df ("drm/amdgpu: add debugfs interface for reading MQDs")
> Cc: [email protected] [v6.5+]
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 46 +++++++++++++++---------
> 1 file changed, 29 insertions(+), 17 deletions(-)
>
> This fixes the lockdep splat for me, and the hexdump of the output
> looks sane after the patch. However, I'm not at all familiar with this
> code to say for sure that this is the right solution. The mqd seems
> small enough that the kmalloc won't get crazy. I'm also assuming that
> ring->mqd_size is safe to access before the reserve & kmap. Lastly I
> went with an open loop instead of a memcpy() as I wasn't sure if that
> memory is safe to address a byte at at time.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 5505d646f43a..06f0a6534a94 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -524,46 +524,58 @@ static ssize_t amdgpu_debugfs_mqd_read(struct file *f, char __user *buf,
> {
> struct amdgpu_ring *ring = file_inode(f)->i_private;
> volatile u32 *mqd;
> - int r;
> + u32 *kbuf;
> + int r, i;
> uint32_t value, result;
>
> if (*pos & 3 || size & 3)
> return -EINVAL;
>
> - result = 0;
> + kbuf = kmalloc(ring->mqd_size, GFP_KERNEL);
> + if (!kbuf)
> + return -ENOMEM;
>
> r = amdgpu_bo_reserve(ring->mqd_obj, false);
> if (unlikely(r != 0))
> - return r;
> + goto err_free;
>
> r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&mqd);
> - if (r) {
> - amdgpu_bo_unreserve(ring->mqd_obj);
> - return r;
> - }
> + if (r)
> + goto err_unreserve;
>
> + /*
> + * Copy to local buffer to avoid put_user(), which might fault
> + * and acquire mmap_sem, under reservation_ww_class_mutex.
> + */
> + for (i = 0; i < ring->mqd_size/sizeof(u32); i++)
> + kbuf[i] = mqd[i];
> +
> + amdgpu_bo_kunmap(ring->mqd_obj);
> + amdgpu_bo_unreserve(ring->mqd_obj);
> +
> + result = 0;
> while (size) {
> if (*pos >= ring->mqd_size)
> - goto done;
> + break;
>
> - value = mqd[*pos/4];
> + value = kbuf[*pos/4];
> r = put_user(value, (uint32_t *)buf);
> if (r)
> - goto done;
> + goto err_free;
> buf += 4;
> result += 4;
> size -= 4;
> *pos += 4;
> }
>
> -done:
> - amdgpu_bo_kunmap(ring->mqd_obj);
> - mqd = NULL;
> - amdgpu_bo_unreserve(ring->mqd_obj);
> - if (r)
> - return r;
> -
> + kfree(kbuf);
> return result;
> +
> +err_unreserve:
> + amdgpu_bo_unreserve(ring->mqd_obj);
> +err_free:
> + kfree(kbuf);
> + return r;
> }
>
> static const struct file_operations amdgpu_debugfs_mqd_fops = {

2024-03-14 17:10:10

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

Hello,

On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian K?nig wrote:
> Am 07.03.24 um 23:07 schrieb Johannes Weiner:
> > Lastly I went with an open loop instead of a memcpy() as I wasn't
> > sure if that memory is safe to address a byte at at time.

Shashank pointed out to me in private that byte access would indeed be
safe. However, after actually trying it it won't work because memcpy()
doesn't play nice with mqd being volatile:

/home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In function 'amdgpu_debugfs_mqd_read':
/home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22: warning: passing argument 1 of '__builtin_dynamic_object_size' discards 'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
550 | memcpy(kbuf, mqd, ring->mqd_size);

So I would propose leaving the patch as-is. Shashank, does that sound
good to you?

(Please keep me CC'd on replies, as I'm not subscribed to the graphics
lists.)

Thanks!

2024-03-23 14:53:08

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

On Thu, Mar 14, 2024 at 01:09:57PM -0400, Johannes Weiner wrote:
> Hello,
>
> On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian K?nig wrote:
> > Am 07.03.24 um 23:07 schrieb Johannes Weiner:
> > > Lastly I went with an open loop instead of a memcpy() as I wasn't
> > > sure if that memory is safe to address a byte at at time.
>
> Shashank pointed out to me in private that byte access would indeed be
> safe. However, after actually trying it it won't work because memcpy()
> doesn't play nice with mqd being volatile:
>
> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In function 'amdgpu_debugfs_mqd_read':
> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22: warning: passing argument 1 of '__builtin_dynamic_object_size' discards 'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
> 550 | memcpy(kbuf, mqd, ring->mqd_size);
>
> So I would propose leaving the patch as-is. Shashank, does that sound
> good to you?

Friendly ping :)

Shashank, is your Reviewed-by still good for this patch, given the
above?

Thanks

2024-03-23 20:21:50

by Sharma, Shashank

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

On 23/03/2024 15:52, Johannes Weiner wrote:
> On Thu, Mar 14, 2024 at 01:09:57PM -0400, Johannes Weiner wrote:
>> Hello,
>>
>> On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian König wrote:
>>> Am 07.03.24 um 23:07 schrieb Johannes Weiner:
>>>> Lastly I went with an open loop instead of a memcpy() as I wasn't
>>>> sure if that memory is safe to address a byte at at time.
>> Shashank pointed out to me in private that byte access would indeed be
>> safe. However, after actually trying it it won't work because memcpy()
>> doesn't play nice with mqd being volatile:
>>
>> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In function 'amdgpu_debugfs_mqd_read':
>> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22: warning: passing argument 1 of '__builtin_dynamic_object_size' discards 'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
>> 550 | memcpy(kbuf, mqd, ring->mqd_size);
>>
>> So I would propose leaving the patch as-is. Shashank, does that sound
>> good to you?
> Friendly ping :)
>
> Shashank, is your Reviewed-by still good for this patch, given the
> above?

Ah, sorry I missed this due to some parallel work, and just realized the
memcpy/volatile limitation.

I also feel the need of protecting MQD read under a lock to avoid
parallel change in MQD while we do byte-by-byte copy, but I will add
that in my to-do list.

Please feel free to use my R-b.

- Shashank

> Thanks

2024-03-25 06:38:15

by Alex Deucher

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

On Sat, Mar 23, 2024 at 4:47 PM Sharma, Shashank
<[email protected]> wrote:
>
>
> On 23/03/2024 15:52, Johannes Weiner wrote:
> > On Thu, Mar 14, 2024 at 01:09:57PM -0400, Johannes Weiner wrote:
> >> Hello,
> >>
> >> On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian König wrote:
> >>> Am 07.03.24 um 23:07 schrieb Johannes Weiner:
> >>>> Lastly I went with an open loop instead of a memcpy() as I wasn't
> >>>> sure if that memory is safe to address a byte at at time.
> >> Shashank pointed out to me in private that byte access would indeed be
> >> safe. However, after actually trying it it won't work because memcpy()
> >> doesn't play nice with mqd being volatile:
> >>
> >> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In function 'amdgpu_debugfs_mqd_read':
> >> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22: warning: passing argument 1 of '__builtin_dynamic_object_size' discards 'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
> >> 550 | memcpy(kbuf, mqd, ring->mqd_size);
> >>
> >> So I would propose leaving the patch as-is. Shashank, does that sound
> >> good to you?
> > Friendly ping :)
> >
> > Shashank, is your Reviewed-by still good for this patch, given the
> > above?
>
> Ah, sorry I missed this due to some parallel work, and just realized the
> memcpy/volatile limitation.
>
> I also feel the need of protecting MQD read under a lock to avoid
> parallel change in MQD while we do byte-by-byte copy, but I will add
> that in my to-do list.
>
> Please feel free to use my R-b.

Shashank, if the patch looks good, can you pick it up and apply it?

Alex

>
> - Shashank
>
> > Thanks

2024-03-26 15:25:33

by Sharma, Shashank

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix deadlock while reading mqd from debugfs

Thanks for the patch,

Patch pushed for staging.

Regards

Shashank

On 25/03/2024 00:23, Alex Deucher wrote:
> On Sat, Mar 23, 2024 at 4:47 PM Sharma, Shashank
> <[email protected]> wrote:
>>
>> On 23/03/2024 15:52, Johannes Weiner wrote:
>>> On Thu, Mar 14, 2024 at 01:09:57PM -0400, Johannes Weiner wrote:
>>>> Hello,
>>>>
>>>> On Fri, Mar 08, 2024 at 12:32:33PM +0100, Christian König wrote:
>>>>> Am 07.03.24 um 23:07 schrieb Johannes Weiner:
>>>>>> Lastly I went with an open loop instead of a memcpy() as I wasn't
>>>>>> sure if that memory is safe to address a byte at at time.
>>>> Shashank pointed out to me in private that byte access would indeed be
>>>> safe. However, after actually trying it it won't work because memcpy()
>>>> doesn't play nice with mqd being volatile:
>>>>
>>>> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c: In function 'amdgpu_debugfs_mqd_read':
>>>> /home/hannes/src/linux/linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:550:22: warning: passing argument 1 of '__builtin_dynamic_object_size' discards 'volatil' qualifier from pointer target type [-Wdiscarded-qualifiers]
>>>> 550 | memcpy(kbuf, mqd, ring->mqd_size);
>>>>
>>>> So I would propose leaving the patch as-is. Shashank, does that sound
>>>> good to you?
>>> Friendly ping :)
>>>
>>> Shashank, is your Reviewed-by still good for this patch, given the
>>> above?
>> Ah, sorry I missed this due to some parallel work, and just realized the
>> memcpy/volatile limitation.
>>
>> I also feel the need of protecting MQD read under a lock to avoid
>> parallel change in MQD while we do byte-by-byte copy, but I will add
>> that in my to-do list.
>>
>> Please feel free to use my R-b.
> Shashank, if the patch looks good, can you pick it up and apply it?
>
> Alex
>
>
>> - Shashank
>>
>>> Thanks