If the initial value of `num_entires` (calculated at line 1654) is not
an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
value greater than the initial value will be assigned to it. That causes
`start > last + 1` after line 1708. Then in the next iteration an
underflow happens at line 1654. It causes message
*ERROR* Couldn't update BO_VA (-12)
printed in kernel log, and GPU hanging.
Fortify the criteria of the loop to fix this issue.
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
Reported-by: Xi Ruoyao <[email protected]>
Reported-by: Dan Horák <[email protected]>
Cc: [email protected]
Signed-off-by: Xi Ruoyao <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index ad91c0c3c423..cee0cc9c8085 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
}
start = tmp;
- } while (unlikely(start != last + 1));
+ } while (unlikely(start < last + 1));
r = vm->update_funcs->commit(¶ms, fence);
base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
--
2.31.1
Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:
> If the initial value of `num_entires` (calculated at line 1654) is not
> an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
> value greater than the initial value will be assigned to it. That causes
> `start > last + 1` after line 1708. Then in the next iteration an
> underflow happens at line 1654. It causes message
>
> *ERROR* Couldn't update BO_VA (-12)
>
> printed in kernel log, and GPU hanging.
>
> Fortify the criteria of the loop to fix this issue.
NAK the value of num_entries must always be a multiple of
AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.
How do you trigger that?
Christian.
>
> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> Reported-by: Xi Ruoyao <[email protected]>
> Reported-by: Dan Horák <[email protected]>
> Cc: [email protected]
> Signed-off-by: Xi Ruoyao <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index ad91c0c3c423..cee0cc9c8085 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
> }
> start = tmp;
>
> - } while (unlikely(start != last + 1));
> + } while (unlikely(start < last + 1));
>
> r = vm->update_funcs->commit(¶ms, fence);
>
>
> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
On 2021-03-29 20:04 +0200, Christian König wrote:
> Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:
> > If the initial value of `num_entires` (calculated at line 1654) is not
> > an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
> > value greater than the initial value will be assigned to it. That causes
> > `start > last + 1` after line 1708. Then in the next iteration an
> > underflow happens at line 1654. It causes message
> >
> > *ERROR* Couldn't update BO_VA (-12)
> >
> > printed in kernel log, and GPU hanging.
> >
> > Fortify the criteria of the loop to fix this issue.
>
> NAK the value of num_entries must always be a multiple of
> AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.
>
> How do you trigger that?
Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL)
under Xorg, on MIPS64. See the BugLink.
> Christian.
>
> >
> > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > Reported-by: Xi Ruoyao <[email protected]>
> > Reported-by: Dan Horák <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Xi Ruoyao <[email protected]>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index ad91c0c3c423..cee0cc9c8085 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > amdgpu_device *adev,
> > }
> > start = tmp;
> >
> > - } while (unlikely(start != last + 1));
> > + } while (unlikely(start < last + 1));
> >
> > r = vm->update_funcs->commit(¶ms, fence);
> >
> >
> > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
>
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
Am 29.03.21 um 20:08 schrieb Xi Ruoyao:
> On 2021-03-29 20:04 +0200, Christian König wrote:
>> Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:
>>> If the initial value of `num_entires` (calculated at line 1654) is not
>>> an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
>>> value greater than the initial value will be assigned to it. That causes
>>> `start > last + 1` after line 1708. Then in the next iteration an
>>> underflow happens at line 1654. It causes message
>>>
>>> *ERROR* Couldn't update BO_VA (-12)
>>>
>>> printed in kernel log, and GPU hanging.
>>>
>>> Fortify the criteria of the loop to fix this issue.
>> NAK the value of num_entries must always be a multiple of
>> AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.
>>
>> How do you trigger that?
> Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL)
> under Xorg, on MIPS64. See the BugLink.
You need to identify the root cause of this, most likely start or last
are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
Christian.
>
>> Christian.
>>
>>> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
>>> Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
>>> Reported-by: Xi Ruoyao <[email protected]>
>>> Reported-by: Dan Horák <[email protected]>
>>> Cc: [email protected]
>>> Signed-off-by: Xi Ruoyao <[email protected]>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index ad91c0c3c423..cee0cc9c8085 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
>>> amdgpu_device *adev,
>>> }
>>> start = tmp;
>>>
>>> - } while (unlikely(start != last + 1));
>>> + } while (unlikely(start < last + 1));
>>>
>>> r = vm->update_funcs->commit(¶ms, fence);
>>>
>>>
>>> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
On 2021-03-29 20:10 +0200, Christian König wrote:
> You need to identify the root cause of this, most likely start or last
> are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
I printk'ed the value of start & last, they are all a multiple of 4
(AMDGPU_GPU_PAGES_IN_CPU_PAGE).
However... `num_entries = last - start + 1` so it became some irrational
thing... Either this line is wrong, or someone called
amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
is unexpected.
> > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > > > Reported-by: Xi Ruoyao <[email protected]>
> > > > Reported-by: Dan Horák <[email protected]>
> > > > Cc: [email protected]
> > > > Signed-off-by: Xi Ruoyao <[email protected]>
> > > > ---
> > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > index ad91c0c3c423..cee0cc9c8085 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > > > amdgpu_device *adev,
> > > > }
> > > > start = tmp;
> > > >
> > > > - } while (unlikely(start != last + 1));
> > > > + } while (unlikely(start < last + 1));
> > > >
> > > > r = vm->update_funcs->commit(¶ms, fence);
> > > >
> > > >
> > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
>
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:
> On 2021-03-29 20:10 +0200, Christian König wrote:
> > You need to identify the root cause of this, most likely start or last
> > are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
>
> I printk'ed the value of start & last, they are all a multiple of 4
> (AMDGPU_GPU_PAGES_IN_CPU_PAGE).
>
> However... `num_entries = last - start + 1` so it became some irrational
> thing... Either this line is wrong, or someone called
> amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
> is unexpected.
I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>]
> amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>]
> amdgpu_vm_bo_update+0x270/0x4c0
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>]
> amdgpu_gem_va_ioctl+0x40c/0x430
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>]
> drm_ioctl_kernel+0xcc/0x120
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>]
> drm_ioctl+0x220/0x408
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>]
> amdgpu_drm_ioctl+0x58/0x98
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>]
> syscall_common+0x34/0x58
>
> > > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > > > > Reported-by: Xi Ruoyao <[email protected]>
> > > > > Reported-by: Dan Horák <[email protected]>
> > > > > Cc: [email protected]
> > > > > Signed-off-by: Xi Ruoyao <[email protected]>
> > > > > ---
> > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > index ad91c0c3c423..cee0cc9c8085 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > > > > amdgpu_device *adev,
> > > > > }
> > > > > start = tmp;
> > > > >
> > > > > - } while (unlikely(start != last + 1));
> > > > > + } while (unlikely(start < last + 1));
> > > > >
> > > > > r = vm->update_funcs->commit(¶ms, fence);
> > > > >
> > > > >
> > > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
> >
>
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
Hi Christian,
I don't think there is any constraint implemented to ensure `num_entries %
AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
/* validate the parameters */
if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK ||
size == 0 || size & AMDGPU_GPU_PAGE_MASK)
return -EINVAL;
/* snip */
saddr /= AMDGPU_GPU_PAGE_SIZE;
eaddr /= AMDGPU_GPU_PAGE_SIZE;
/* snip */
mapping->start = saddr;
mapping->last = eaddr;
If we really want to ensure (mapping->last - mapping->start + 1) %
AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK"
in "validate the parameters" with "PAGE_MASK".
I tried it and it broke userspace: Xorg startup fails with EINVAL with this
change.
On 2021-03-30 02:30 +0800, Xi Ruoyao wrote:
> On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:
> > On 2021-03-29 20:10 +0200, Christian König wrote:
> > > You need to identify the root cause of this, most likely start or last
> > > are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
> >
> > I printk'ed the value of start & last, they are all a multiple of 4
> > (AMDGPU_GPU_PAGES_IN_CPU_PAGE).
> >
> > However... `num_entries = last - start + 1` so it became some irrational
> > thing... Either this line is wrong, or someone called
> > amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
> > is unexpected.
>
> I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:
>
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>]
> > amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>]
> > amdgpu_vm_bo_update+0x270/0x4c0
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>]
> > amdgpu_gem_va_ioctl+0x40c/0x430
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>]
> > drm_ioctl_kernel+0xcc/0x120
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>]
> > drm_ioctl+0x220/0x408
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>]
> > amdgpu_drm_ioctl+0x58/0x98
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>]
> > syscall_common+0x34/0x58
> >
>
> > > > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > > > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > > > > > Reported-by: Xi Ruoyao <[email protected]>
> > > > > > Reported-by: Dan Horák <[email protected]>
> > > > > > Cc: [email protected]
> > > > > > Signed-off-by: Xi Ruoyao <[email protected]>
> > > > > > ---
> > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > index ad91c0c3c423..cee0cc9c8085 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > > > > > amdgpu_device *adev,
> > > > > > }
> > > > > > start = tmp;
> > > > > >
> > > > > > - } while (unlikely(start != last + 1));
> > > > > > + } while (unlikely(start < last + 1));
> > > > > >
> > > > > > r = vm->update_funcs->commit(¶ms, fence);
> > > > > >
> > > > > >
> > > > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
> > >
> >
>
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> Hi Christian,
>
> I don't think there is any constraint implemented to ensure `num_entries %
> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
>
> /* validate the parameters */
> if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK ||
> size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> return -EINVAL;
>
> /* snip */
>
> saddr /= AMDGPU_GPU_PAGE_SIZE;
> eaddr /= AMDGPU_GPU_PAGE_SIZE;
>
> /* snip */
>
> mapping->start = saddr;
> mapping->last = eaddr;
>
>
> If we really want to ensure (mapping->last - mapping->start + 1) %
> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK"
> in "validate the parameters" with "PAGE_MASK".
Yeah, good point.
> I tried it and it broke userspace: Xorg startup fails with EINVAL with this
> change.
Well in theory it is possible that we always fill the GPUVM on a 4k
basis while the native page size of the CPU is larger. Let me double
check the code.
BTW: What code base are you based on? The code your post here is quite
outdated.
Christian.
>
> On 2021-03-30 02:30 +0800, Xi Ruoyao wrote:
>> On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:
>>> On 2021-03-29 20:10 +0200, Christian König wrote:
>>>> You need to identify the root cause of this, most likely start or last
>>>> are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
>>> I printk'ed the value of start & last, they are all a multiple of 4
>>> (AMDGPU_GPU_PAGES_IN_CPU_PAGE).
>>>
>>> However... `num_entries = last - start + 1` so it became some irrational
>>> thing... Either this line is wrong, or someone called
>>> amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
>>> is unexpected.
>> I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:
>>
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>]
>>> amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>]
>>> amdgpu_vm_bo_update+0x270/0x4c0
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>]
>>> amdgpu_gem_va_ioctl+0x40c/0x430
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>]
>>> drm_ioctl_kernel+0xcc/0x120
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>]
>>> drm_ioctl+0x220/0x408
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>]
>>> amdgpu_drm_ioctl+0x58/0x98
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>]
>>> syscall_common+0x34/0x58
>>>
>>>>>>> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
>>>>>>> Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
>>>>>>> Reported-by: Xi Ruoyao <[email protected]>
>>>>>>> Reported-by: Dan Horák <[email protected]>
>>>>>>> Cc: [email protected]
>>>>>>> Signed-off-by: Xi Ruoyao <[email protected]>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>> index ad91c0c3c423..cee0cc9c8085 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>> @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
>>>>>>> amdgpu_device *adev,
>>>>>>> }
>>>>>>> start = tmp;
>>>>>>>
>>>>>>> - } while (unlikely(start != last + 1));
>>>>>>> + } while (unlikely(start < last + 1));
>>>>>>>
>>>>>>> r = vm->update_funcs->commit(¶ms, fence);
>>>>>>>
>>>>>>>
>>>>>>> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
On 2021-03-29 21:36 +0200, Christian König wrote:
> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> > Hi Christian,
> >
> > I don't think there is any constraint implemented to ensure `num_entries %
> > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
> >
> > /* validate the parameters */
> > if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
> > ||
> > size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> > return -EINVAL;
> >
> > /* snip */
> >
> > saddr /= AMDGPU_GPU_PAGE_SIZE;
> > eaddr /= AMDGPU_GPU_PAGE_SIZE;
> >
> > /* snip */
> >
> > mapping->start = saddr;
> > mapping->last = eaddr;
> >
> >
> > If we really want to ensure (mapping->last - mapping->start + 1) %
> > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> > "AMDGPU_GPU_PAGE_MASK"
> > in "validate the parameters" with "PAGE_MASK".
>
> Yeah, good point.
>
> > I tried it and it broke userspace: Xorg startup fails with EINVAL with this
> > change.
>
> Well in theory it is possible that we always fill the GPUVM on a 4k
> basis while the native page size of the CPU is larger. Let me double
> check the code.
>
> BTW: What code base are you based on? The code your post here is quite
> outdated.
Linus' tree.
I'll go to sleep now (it's 03:39 here :( ), when I wake up I can try to fetch
drm-next or something.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
> On 2021-03-29 21:36 +0200, Christian König wrote:
> > Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> > > Hi Christian,
> > >
> > > I don't think there is any constraint implemented to ensure `num_entries %
> > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
> > >
> > > /* validate the parameters */
> > > if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
> > > > >
> > > size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> > > return -EINVAL;
> > >
> > > /* snip */
> > >
> > > saddr /= AMDGPU_GPU_PAGE_SIZE;
> > > eaddr /= AMDGPU_GPU_PAGE_SIZE;
> > >
> > > /* snip */
> > >
> > > mapping->start = saddr;
> > > mapping->last = eaddr;
> > >
> > >
> > > If we really want to ensure (mapping->last - mapping->start + 1) %
> > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> > > "AMDGPU_GPU_PAGE_MASK"
> > > in "validate the parameters" with "PAGE_MASK".
It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
"AMDGPU_GPU_PAGE_MASK" :(.
> > Yeah, good point.
> >
> > > I tried it and it broke userspace: Xorg startup fails with EINVAL with
> > > this
> > > change.
> >
> > Well in theory it is possible that we always fill the GPUVM on a 4k
> > basis while the native page size of the CPU is larger. Let me double
> > check the code.
On my platform, there are two issues both causing the VM corruption. One is
fixed by agd5f/linux@fe001e7. Another is in Mesa from userspace: it uses
`dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
expects it to use `dev_info->virtual_address_alignment`.
If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
and make virtual_address_alignment = 4K. Otherwise, we should fortify the
parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
userspace will just get an EINVAL, instead of a slient VM corruption. And
someone should tell Mesa developers to fix the code in this case.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
> On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
>> On 2021-03-29 21:36 +0200, Christian König wrote:
>>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
>>>> Hi Christian,
>>>>
>>>> I don't think there is any constraint implemented to ensure `num_entries %
>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
>>>>
>>>> /* validate the parameters */
>>>> if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
>>>> size == 0 || size & AMDGPU_GPU_PAGE_MASK)
>>>> return -EINVAL;
>>>>
>>>> /* snip */
>>>>
>>>> saddr /= AMDGPU_GPU_PAGE_SIZE;
>>>> eaddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>
>>>> /* snip */
>>>>
>>>> mapping->start = saddr;
>>>> mapping->last = eaddr;
>>>>
>>>>
>>>> If we really want to ensure (mapping->last - mapping->start + 1) %
>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
>>>> "AMDGPU_GPU_PAGE_MASK"
>>>> in "validate the parameters" with "PAGE_MASK".
> It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
> "AMDGPU_GPU_PAGE_MASK" :(.
>
>>> Yeah, good point.
>>>
>>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
>>>> this
>>>> change.
>>> Well in theory it is possible that we always fill the GPUVM on a 4k
>>> basis while the native page size of the CPU is larger. Let me double
>>> check the code.
> On my platform, there are two issues both causing the VM corruption. One is
> fixed by agd5f/linux@fe001e7.
What is fe001e7? A commit id? If yes then that is to short and I can't
find it.
> Another is in Mesa from userspace: it uses
> `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
> expects it to use `dev_info->virtual_address_alignment`.
Mhm, looking at the kernel code I would rather say Mesa is correct and
the kernel driver is broken.
The gart_page_size is limited by the CPU page size, but the
virtual_address_alignment isn't.
> If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
> and make virtual_address_alignment = 4K. Otherwise, we should fortify the
> parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
> userspace will just get an EINVAL, instead of a slient VM corruption. And
> someone should tell Mesa developers to fix the code in this case.
I rather see this as a kernel bug. Can you test if this code fragment
fixes your issue:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 64beb3399604..e1260b517e1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
*data, struct drm_file *filp)
}
dev_info->virtual_address_alignment =
max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
dev_info->pte_fragment_size = (1 <<
adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
- dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
+ dev_info->gart_page_size =
dev_info->virtual_address_alignment;
dev_info->cu_active_number = adev->gfx.cu_info.number;
dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
dev_info->ce_ram_size = adev->gfx.ce_ram_size;
Thanks,
Christian.
> --
> Xi Ruoyao <[email protected]>
> School of Aerospace Science and Technology, Xidian University
>
Am 30.03.21 um 15:00 schrieb Dan Horák:
> On Tue, 30 Mar 2021 14:55:01 +0200
> Christian König <[email protected]> wrote:
>
>> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
>>> On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
>>>> On 2021-03-29 21:36 +0200, Christian König wrote:
>>>>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
>>>>>> Hi Christian,
>>>>>>
>>>>>> I don't think there is any constraint implemented to ensure `num_entries %
>>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
>>>>>>
>>>>>> /* validate the parameters */
>>>>>> if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
>>>>>> size == 0 || size & AMDGPU_GPU_PAGE_MASK)
>>>>>> return -EINVAL;
>>>>>>
>>>>>> /* snip */
>>>>>>
>>>>>> saddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>>> eaddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>>>
>>>>>> /* snip */
>>>>>>
>>>>>> mapping->start = saddr;
>>>>>> mapping->last = eaddr;
>>>>>>
>>>>>>
>>>>>> If we really want to ensure (mapping->last - mapping->start + 1) %
>>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
>>>>>> "AMDGPU_GPU_PAGE_MASK"
>>>>>> in "validate the parameters" with "PAGE_MASK".
>>> It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
>>> "AMDGPU_GPU_PAGE_MASK" :(.
>>>
>>>>> Yeah, good point.
>>>>>
>>>>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
>>>>>> this
>>>>>> change.
>>>>> Well in theory it is possible that we always fill the GPUVM on a 4k
>>>>> basis while the native page size of the CPU is larger. Let me double
>>>>> check the code.
>>> On my platform, there are two issues both causing the VM corruption. One is
>>> fixed by agd5f/linux@fe001e7.
>> What is fe001e7? A commit id? If yes then that is to short and I can't
>> find it.
> it's a gitlab shortcut for
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Ffe001e70a55d0378328612be1fdc3abfc68b9ccc&data=04%7C01%7Cchristian.koenig%40amd.com%7Cd16d123aaa01420ebd0808d8f37bbf2f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527060812278536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5rFVLxSRnfHUGjhoiqx1e6SeROqbg4ZPef%2BxEvgv%2BTg%3D&reserved=0
Ah! Yes I would expect that this patch is fixing something in this use case.
Thanks,
Christian.
>
>
> Dan
>>> Another is in Mesa from userspace: it uses
>>> `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
>>> expects it to use `dev_info->virtual_address_alignment`.
>> Mhm, looking at the kernel code I would rather say Mesa is correct and
>> the kernel driver is broken.
>>
>> The gart_page_size is limited by the CPU page size, but the
>> virtual_address_alignment isn't.
>>
>>> If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
>>> and make virtual_address_alignment = 4K. Otherwise, we should fortify the
>>> parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
>>> userspace will just get an EINVAL, instead of a slient VM corruption. And
>>> someone should tell Mesa developers to fix the code in this case.
>> I rather see this as a kernel bug. Can you test if this code fragment
>> fixes your issue:
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 64beb3399604..e1260b517e1b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
>> *data, struct drm_file *filp)
>> }
>> dev_info->virtual_address_alignment =
>> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>> dev_info->pte_fragment_size = (1 <<
>> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
>> - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
>> + dev_info->gart_page_size =
>> dev_info->virtual_address_alignment;
>> dev_info->cu_active_number = adev->gfx.cu_info.number;
>> dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>> dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>>
>>
>> Thanks,
>> Christian.
>>
>>> --
>>> Xi Ruoyao <[email protected]>
>>> School of Aerospace Science and Technology, Xidian University
>>>
On 2021-03-30 14:55 +0200, Christian König wrote:
> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
> > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
> > > On 2021-03-29 21:36 +0200, Christian König wrote:
> > > > Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> > > > > Hi Christian,
> > > > >
> > > > > I don't think there is any constraint implemented to ensure `num_entries
> > > > > %
> > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in
> > > > > `amdgpu_vm_bo_map()`:
> > > > >
> > > > > /* validate the parameters */
> > > > > if (saddr & AMDGPU_GPU_PAGE_MASK || offset &
> > > > > AMDGPU_GPU_PAGE_MASK
> > > > > size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> > > > > return -EINVAL;
> > > > >
> > > > > /* snip */
> > > > >
> > > > > saddr /= AMDGPU_GPU_PAGE_SIZE;
> > > > > eaddr /= AMDGPU_GPU_PAGE_SIZE;
> > > > >
> > > > > /* snip */
> > > > >
> > > > > mapping->start = saddr;
> > > > > mapping->last = eaddr;
> > > > >
> > > > >
> > > > > If we really want to ensure (mapping->last - mapping->start + 1) %
> > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> > > > > "AMDGPU_GPU_PAGE_MASK"
> > > > > in "validate the parameters" with "PAGE_MASK".
> > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
> > "AMDGPU_GPU_PAGE_MASK" :(.
> >
> > > > Yeah, good point.
> > > >
> > > > > I tried it and it broke userspace: Xorg startup fails with EINVAL with
> > > > > this
> > > > > change.
> > > > Well in theory it is possible that we always fill the GPUVM on a 4k
> > > > basis while the native page size of the CPU is larger. Let me double
> > > > check the code.
> > On my platform, there are two issues both causing the VM corruption. One is
> > fixed by agd5f/linux@fe001e7.
>
> What is fe001e7? A commit id? If yes then that is to short and I can't
> find it.
>
> > Another is in Mesa from userspace: it uses
> > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
> > expects it to use `dev_info->virtual_address_alignment`.
>
> Mhm, looking at the kernel code I would rather say Mesa is correct and
> the kernel driver is broken.
>
> The gart_page_size is limited by the CPU page size, but the
> virtual_address_alignment isn't.
>
> > If we can make the change to fill the GPUVM on a 4k basis, we can fix this
> > issue
> > and make virtual_address_alignment = 4K. Otherwise, we should fortify the
> > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then
> > the
> > userspace will just get an EINVAL, instead of a slient VM corruption. And
> > someone should tell Mesa developers to fix the code in this case.
>
> I rather see this as a kernel bug. Can you test if this code fragment
> fixes your issue:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 64beb3399604..e1260b517e1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> *data, struct drm_file *filp)
> }
> dev_info->virtual_address_alignment =
> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
> dev_info->pte_fragment_size = (1 <<
> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> + dev_info->gart_page_size =
> dev_info->virtual_address_alignment;
> dev_info->cu_active_number = adev->gfx.cu_info.number;
> dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
> dev_info->ce_ram_size = adev->gfx.ce_ram_size;
It works. I've seen it at
https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191
before (with a common sub-expression, though :).
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
> On 2021-03-30 14:55 +0200, Christian König wrote:
> >
> > I rather see this as a kernel bug. Can you test if this code fragment
> > fixes your issue:
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > index 64beb3399604..e1260b517e1b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> > *data, struct drm_file *filp)
> > }
> > dev_info->virtual_address_alignment =
> > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
> > dev_info->pte_fragment_size = (1 <<
> > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> > - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> > + dev_info->gart_page_size =
> > dev_info->virtual_address_alignment;
> > dev_info->cu_active_number = adev->gfx.cu_info.number;
> > dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
> > dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>
> It works. I've seen it at
> https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191
> before (with a common sub-expression, though :).
Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs
without this commit. But on the system I built from source, I didn't see any
issue before Linux 5.11. So I misbelieved that it was something already fixed.
Dan: you can try it on your PPC 64 with non-4K page as well.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University
On Tue, 30 Mar 2021 14:55:01 +0200
Christian König <[email protected]> wrote:
> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
> > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
> >> On 2021-03-29 21:36 +0200, Christian König wrote:
> >>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> >>>> Hi Christian,
> >>>>
> >>>> I don't think there is any constraint implemented to ensure `num_entries %
> >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
> >>>>
> >>>> /* validate the parameters */
> >>>> if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
> >>>> size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> >>>> return -EINVAL;
> >>>>
> >>>> /* snip */
> >>>>
> >>>> saddr /= AMDGPU_GPU_PAGE_SIZE;
> >>>> eaddr /= AMDGPU_GPU_PAGE_SIZE;
> >>>>
> >>>> /* snip */
> >>>>
> >>>> mapping->start = saddr;
> >>>> mapping->last = eaddr;
> >>>>
> >>>>
> >>>> If we really want to ensure (mapping->last - mapping->start + 1) %
> >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> >>>> "AMDGPU_GPU_PAGE_MASK"
> >>>> in "validate the parameters" with "PAGE_MASK".
> > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
> > "AMDGPU_GPU_PAGE_MASK" :(.
> >
> >>> Yeah, good point.
> >>>
> >>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
> >>>> this
> >>>> change.
> >>> Well in theory it is possible that we always fill the GPUVM on a 4k
> >>> basis while the native page size of the CPU is larger. Let me double
> >>> check the code.
> > On my platform, there are two issues both causing the VM corruption. One is
> > fixed by agd5f/linux@fe001e7.
>
> What is fe001e7? A commit id? If yes then that is to short and I can't
> find it.
it's a gitlab shortcut for
https://gitlab.freedesktop.org/agd5f/linux/-/commit/fe001e70a55d0378328612be1fdc3abfc68b9ccc
Dan
>
> > Another is in Mesa from userspace: it uses
> > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
> > expects it to use `dev_info->virtual_address_alignment`.
>
> Mhm, looking at the kernel code I would rather say Mesa is correct and
> the kernel driver is broken.
>
> The gart_page_size is limited by the CPU page size, but the
> virtual_address_alignment isn't.
>
> > If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
> > and make virtual_address_alignment = 4K. Otherwise, we should fortify the
> > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
> > userspace will just get an EINVAL, instead of a slient VM corruption. And
> > someone should tell Mesa developers to fix the code in this case.
>
> I rather see this as a kernel bug. Can you test if this code fragment
> fixes your issue:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 64beb3399604..e1260b517e1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> *data, struct drm_file *filp)
> }
> dev_info->virtual_address_alignment =
> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
> dev_info->pte_fragment_size = (1 <<
> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> + dev_info->gart_page_size =
> dev_info->virtual_address_alignment;
> dev_info->cu_active_number = adev->gfx.cu_info.number;
> dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
> dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>
>
> Thanks,
> Christian.
>
> > --
> > Xi Ruoyao <[email protected]>
> > School of Aerospace Science and Technology, Xidian University
> >
>
On Tue, 30 Mar 2021 21:09:12 +0800
Xi Ruoyao <[email protected]> wrote:
> On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
> > On 2021-03-30 14:55 +0200, Christian König wrote:
> > >
> > > I rather see this as a kernel bug. Can you test if this code fragment
> > > fixes your issue:
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > index 64beb3399604..e1260b517e1b 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> > > *data, struct drm_file *filp)
> > > }
> > > dev_info->virtual_address_alignment =
> > > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
> > > dev_info->pte_fragment_size = (1 <<
> > > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> > > - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> > > + dev_info->gart_page_size =
> > > dev_info->virtual_address_alignment;
> > > dev_info->cu_active_number = adev->gfx.cu_info.number;
> > > dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
> > > dev_info->ce_ram_size = adev->gfx.ce_ram_size;
> >
> > It works. I've seen it at
> > https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191
> > before (with a common sub-expression, though :).
>
> Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs
> without this commit. But on the system I built from source, I didn't see any
> issue before Linux 5.11. So I misbelieved that it was something already fixed.
>
> Dan: you can try it on your PPC 64 with non-4K page as well.
yup, looks good here as well, ppc64le (Power9) system with 64KB pages
Dan
Am 30.03.21 um 15:23 schrieb Dan Horák:
> On Tue, 30 Mar 2021 21:09:12 +0800
> Xi Ruoyao <[email protected]> wrote:
>
>> On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
>>> On 2021-03-30 14:55 +0200, Christian König wrote:
>>>> I rather see this as a kernel bug. Can you test if this code fragment
>>>> fixes your issue:
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> index 64beb3399604..e1260b517e1b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
>>>> *data, struct drm_file *filp)
>>>> }
>>>> dev_info->virtual_address_alignment =
>>>> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>>>> dev_info->pte_fragment_size = (1 <<
>>>> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
>>>> - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
>>>> + dev_info->gart_page_size =
>>>> dev_info->virtual_address_alignment;
>>>> dev_info->cu_active_number = adev->gfx.cu_info.number;
>>>> dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>>>> dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>>> It works. I've seen it at
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxen0n%2Flinux%2Fcommit%2F84ada72983838bd7ce54bc32f5d34ac5b5aae191&data=04%7C01%7Cchristian.koenig%40amd.com%7Cf37fddf20a8847edf67808d8f37ef23c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527074118791321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DZnmee38NGpiWRMX5LmlxOhxAzBMhAusnAWNnCxXTJ0%3D&reserved=0
>>> before (with a common sub-expression, though :).
>> Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs
>> without this commit. But on the system I built from source, I didn't see any
>> issue before Linux 5.11. So I misbelieved that it was something already fixed.
>>
>> Dan: you can try it on your PPC 64 with non-4K page as well.
> yup, looks good here as well, ppc64le (Power9) system with 64KB pages
Mhm, as far as I can say this patch never made it to us.
Can you please send it to the mailing list with me on CC?
Thanks,
Christian.
>
>
> Dan
On 2021-03-30 15:24 +0200, Christian König wrote:
> Am 30.03.21 um 15:23 schrieb Dan Horák:
> > On Tue, 30 Mar 2021 21:09:12 +0800
> > Xi Ruoyao <[email protected]> wrote:
> >
> > > On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
> > > > On 2021-03-30 14:55 +0200, Christian König wrote:
> > > > > I rather see this as a kernel bug. Can you test if this code fragment
> > > > > fixes your issue:
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > index 64beb3399604..e1260b517e1b 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> > > > > *data, struct drm_file *filp)
> > > > > }
> > > > > dev_info->virtual_address_alignment =
> > > > > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
> > > > > dev_info->pte_fragment_size = (1 <<
> > > > > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> > > > > - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> > > > > + dev_info->gart_page_size =
> > > > > dev_info->virtual_address_alignment;
> > > > > dev_info->cu_active_number = adev-
> > > > > >gfx.cu_info.number;
> > > > > dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
> > > > > dev_info->ce_ram_size = adev->gfx.ce_ram_size;
> > > > It works. I've seen it at
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxen0n%2Flinux%2Fcommit%2F84ada72983838bd7ce54bc32f5d34ac5b5aae191&data=04%7C01%7Cchristian.koenig%40amd.com%7Cf37fddf20a8847edf67808d8f37ef23c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527074118791321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DZnmee38NGpiWRMX5LmlxOhxAzBMhAusnAWNnCxXTJ0%3D&reserved=0
> > > > before (with a common sub-expression, though :).
> > > Some comment: on an old version of Fedora ported by Loongson, Xorg just
> > > hangs
> > > without this commit. But on the system I built from source, I didn't see
> > > any
> > > issue before Linux 5.11. So I misbelieved that it was something already
> > > fixed.
> > >
> > > Dan: you can try it on your PPC 64 with non-4K page as well.
> > yup, looks good here as well, ppc64le (Power9) system with 64KB pages
>
> Mhm, as far as I can say this patch never made it to us.
I think maybe its author considers it a "workaround" like me, as there is
already a "virtual_address_alignment" which seems correct.
> Can you please send it to the mailing list with me on CC?
I've sent it, together with my patch for using ~PAGE_MASK in parameter
validation.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University