LinuxLists.cc - [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

2021-03-29 18:03:28

Subject: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

If the initial value of `num_entires` (calculated at line 1654) is not
an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
value greater than the initial value will be assigned to it. That causes
`start > last + 1` after line 1708. Then in the next iteration an
underflow happens at line 1654. It causes message

*ERROR* Couldn't update BO_VA (-12)

printed in kernel log, and GPU hanging.

Fortify the criteria of the loop to fix this issue.

BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
Reported-by: Xi Ruoyao <[email protected]>
Reported-by: Dan Horák <[email protected]>
Cc: [email protected]
Signed-off-by: Xi Ruoyao <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index ad91c0c3c423..cee0cc9c8085 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
}
start = tmp;

- } while (unlikely(start != last + 1));
+ } while (unlikely(start < last + 1));

r = vm->update_funcs->commit(&params, fence);

base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
--
2.31.1

2021-03-29 18:05:57

by Christian König

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:
> If the initial value of `num_entires` (calculated at line 1654) is not
> an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
> value greater than the initial value will be assigned to it. That causes
> `start > last + 1` after line 1708. Then in the next iteration an
> underflow happens at line 1654. It causes message
>
> *ERROR* Couldn't update BO_VA (-12)
>
> printed in kernel log, and GPU hanging.
>
> Fortify the criteria of the loop to fix this issue.

NAK the value of num_entries must always be a multiple of
AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.

How do you trigger that?

Christian.

>
> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> Reported-by: Xi Ruoyao <[email protected]>
> Reported-by: Dan Horák <[email protected]>
> Cc: [email protected]
> Signed-off-by: Xi Ruoyao <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index ad91c0c3c423..cee0cc9c8085 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
> }
> start = tmp;
>
> - } while (unlikely(start != last + 1));
> + } while (unlikely(start < last + 1));
>
> r = vm->update_funcs->commit(&params, fence);
>
>
> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

2021-03-29 18:10:40

by Xi Ruoyao

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On 2021-03-29 20:04 +0200, Christian König wrote:
> Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:
> > If the initial value of `num_entires` (calculated at line 1654) is not
> > an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
> > value greater than the initial value will be assigned to it. That causes
> > `start > last + 1` after line 1708. Then in the next iteration an
> > underflow happens at line 1654. It causes message
> >
> >      *ERROR* Couldn't update BO_VA (-12)
> >
> > printed in kernel log, and GPU hanging.
> >
> > Fortify the criteria of the loop to fix this issue.
>
> NAK the value of num_entries must always be a multiple of
> AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.
>
> How do you trigger that?

Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL)
under Xorg, on MIPS64. See the BugLink.

> Christian.
>
> >
> > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > Reported-by: Xi Ruoyao <[email protected]>
> > Reported-by: Dan Horák <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Xi Ruoyao <[email protected]>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index ad91c0c3c423..cee0cc9c8085 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > amdgpu_device *adev,
> >                 }
> >                 start = tmp;
> >
> > -       } while (unlikely(start != last + 1));
> > +       } while (unlikely(start < last + 1));
> >
> >         r = vm->update_funcs->commit(&params, fence);
> >
> >
> > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
>

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2021-03-29 18:12:41

by Christian König

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

Am 29.03.21 um 20:08 schrieb Xi Ruoyao:
> On 2021-03-29 20:04 +0200, Christian König wrote:
>> Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:
>>> If the initial value of `num_entires` (calculated at line 1654) is not
>>> an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
>>> value greater than the initial value will be assigned to it. That causes
>>> `start > last + 1` after line 1708. Then in the next iteration an
>>> underflow happens at line 1654. It causes message
>>>
>>>      *ERROR* Couldn't update BO_VA (-12)
>>>
>>> printed in kernel log, and GPU hanging.
>>>
>>> Fortify the criteria of the loop to fix this issue.
>> NAK the value of num_entries must always be a multiple of
>> AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.
>>
>> How do you trigger that?
> Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL)
> under Xorg, on MIPS64. See the BugLink.

You need to identify the root cause of this, most likely start or last
are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.

Christian.

>
>> Christian.
>>
>>> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
>>> Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
>>> Reported-by: Xi Ruoyao <[email protected]>
>>> Reported-by: Dan Horák <[email protected]>
>>> Cc: [email protected]
>>> Signed-off-by: Xi Ruoyao <[email protected]>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index ad91c0c3c423..cee0cc9c8085 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
>>> amdgpu_device *adev,
>>>                 }
>>>                 start = tmp;
>>>
>>> -       } while (unlikely(start != last + 1));
>>> +       } while (unlikely(start < last + 1));
>>>
>>>         r = vm->update_funcs->commit(&params, fence);
>>>
>>>
>>> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

2021-03-29 18:24:01

by Xi Ruoyao

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On 2021-03-29 20:10 +0200, Christian König wrote:
> You need to identify the root cause of this, most likely start or last
> are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.

I printk'ed the value of start & last, they are all a multiple of 4
(AMDGPU_GPU_PAGES_IN_CPU_PAGE).

However... `num_entries = last - start + 1` so it became some irrational
thing... Either this line is wrong, or someone called
amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
is unexpected.

> > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > > > Reported-by: Xi Ruoyao <[email protected]>
> > > > Reported-by: Dan Horák <[email protected]>
> > > > Cc: [email protected]
> > > > Signed-off-by: Xi Ruoyao <[email protected]>
> > > > ---
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > index ad91c0c3c423..cee0cc9c8085 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > > > amdgpu_device *adev,
> > > >                  }
> > > >                  start = tmp;
> > > >
> > > > -       } while (unlikely(start != last + 1));
> > > > +       } while (unlikely(start < last + 1));
> > > >
> > > >          r = vm->update_funcs->commit(&params, fence);
> > > >
> > > >
> > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
>

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2021-03-29 18:32:44

by Xi Ruoyao

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:
> On 2021-03-29 20:10 +0200, Christian König wrote:
> > You need to identify the root cause of this, most likely start or last
> > are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
>
> I printk'ed the value of start & last, they are all a multiple of 4
> (AMDGPU_GPU_PAGES_IN_CPU_PAGE).
>
> However... `num_entries = last - start + 1` so it became some irrational
> thing... Either this line is wrong, or someone called
> amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
> is unexpected.

I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:

> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>]
> amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>]
> amdgpu_vm_bo_update+0x270/0x4c0
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>]
> amdgpu_gem_va_ioctl+0x40c/0x430
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>]
> drm_ioctl_kernel+0xcc/0x120
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>]
> drm_ioctl+0x220/0x408
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>]
> amdgpu_drm_ioctl+0x58/0x98
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>]
> syscall_common+0x34/0x58
>

> > > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > > > > Reported-by: Xi Ruoyao <[email protected]>
> > > > > Reported-by: Dan Horák <[email protected]>
> > > > > Cc: [email protected]
> > > > > Signed-off-by: Xi Ruoyao <[email protected]>
> > > > > ---
> > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > index ad91c0c3c423..cee0cc9c8085 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > > > > amdgpu_device *adev,
> > > > >                  }
> > > > >                  start = tmp;
> > > > >
> > > > > -       } while (unlikely(start != last + 1));
> > > > > +       } while (unlikely(start < last + 1));
> > > > >
> > > > >          r = vm->update_funcs->commit(&params, fence);
> > > > >
> > > > >
> > > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
> >
>

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2021-03-29 19:29:16

by Xi Ruoyao

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries %
AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:

/* validate the parameters */
if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK ||
size == 0 || size & AMDGPU_GPU_PAGE_MASK)
return -EINVAL;

/* snip */

saddr /= AMDGPU_GPU_PAGE_SIZE;
eaddr /= AMDGPU_GPU_PAGE_SIZE;

/* snip */

mapping->start = saddr;
mapping->last = eaddr;

If we really want to ensure (mapping->last - mapping->start + 1) %
AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK"
in "validate the parameters" with "PAGE_MASK".

I tried it and it broke userspace: Xorg startup fails with EINVAL with this
change.

On 2021-03-30 02:30 +0800, Xi Ruoyao wrote:
> On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:
> > On 2021-03-29 20:10 +0200, Christian König wrote:
> > > You need to identify the root cause of this, most likely start or last
> > > are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
> >
> > I printk'ed the value of start & last, they are all a multiple of 4
> > (AMDGPU_GPU_PAGES_IN_CPU_PAGE).
> >
> > However... `num_entries = last - start + 1` so it became some irrational
> > thing... Either this line is wrong, or someone called
> > amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
> > is unexpected.
>
> I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:
>
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>]
> > amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>]
> > amdgpu_vm_bo_update+0x270/0x4c0
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>]
> > amdgpu_gem_va_ioctl+0x40c/0x430
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>]
> > drm_ioctl_kernel+0xcc/0x120
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>]
> > drm_ioctl+0x220/0x408
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>]
> > amdgpu_drm_ioctl+0x58/0x98
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>]
> > syscall_common+0x34/0x58
> >
>
> > > > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > > > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > > > > > Reported-by: Xi Ruoyao <[email protected]>
> > > > > > Reported-by: Dan Horák <[email protected]>
> > > > > > Cc: [email protected]
> > > > > > Signed-off-by: Xi Ruoyao <[email protected]>
> > > > > > ---
> > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > index ad91c0c3c423..cee0cc9c8085 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > > > > > amdgpu_device *adev,
> > > > > >                  }
> > > > > >                  start = tmp;
> > > > > >
> > > > > > -       } while (unlikely(start != last + 1));
> > > > > > +       } while (unlikely(start < last + 1));
> > > > > >
> > > > > >          r = vm->update_funcs->commit(&params, fence);
> > > > > >
> > > > > >
> > > > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
> > >
> >
>

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2021-03-29 19:38:20

by Christian König

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> Hi Christian,
>
> I don't think there is any constraint implemented to ensure `num_entries %
> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
>
> /* validate the parameters */
> if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK ||
> size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> return -EINVAL;
>
> /* snip */
>
> saddr /= AMDGPU_GPU_PAGE_SIZE;
> eaddr /= AMDGPU_GPU_PAGE_SIZE;
>
> /* snip */
>
> mapping->start = saddr;
> mapping->last = eaddr;
>
>
> If we really want to ensure (mapping->last - mapping->start + 1) %
> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK"
> in "validate the parameters" with "PAGE_MASK".

Yeah, good point.

> I tried it and it broke userspace: Xorg startup fails with EINVAL with this
> change.

Well in theory it is possible that we always fill the GPUVM on a 4k
basis while the native page size of the CPU is larger. Let me double
check the code.

BTW: What code base are you based on? The code your post here is quite
outdated.

Christian.

>
> On 2021-03-30 02:30 +0800, Xi Ruoyao wrote:
>> On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:
>>> On 2021-03-29 20:10 +0200, Christian König wrote:
>>>> You need to identify the root cause of this, most likely start or last
>>>> are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
>>> I printk'ed the value of start & last, they are all a multiple of 4
>>> (AMDGPU_GPU_PAGES_IN_CPU_PAGE).
>>>
>>> However... `num_entries = last - start + 1` so it became some irrational
>>> thing... Either this line is wrong, or someone called
>>> amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
>>> is unexpected.
>> I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:
>>
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>]
>>> amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>]
>>> amdgpu_vm_bo_update+0x270/0x4c0
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>]
>>> amdgpu_gem_va_ioctl+0x40c/0x430
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>]
>>> drm_ioctl_kernel+0xcc/0x120
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>]
>>> drm_ioctl+0x220/0x408
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>]
>>> amdgpu_drm_ioctl+0x58/0x98
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8
>>> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>]
>>> syscall_common+0x34/0x58
>>>
>>>>>>> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
>>>>>>> Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
>>>>>>> Reported-by: Xi Ruoyao <[email protected]>
>>>>>>> Reported-by: Dan Horák <[email protected]>
>>>>>>> Cc: [email protected]
>>>>>>> Signed-off-by: Xi Ruoyao <[email protected]>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>> index ad91c0c3c423..cee0cc9c8085 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>>> @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>                  }
>>>>>>>                  start = tmp;
>>>>>>>
>>>>>>> -       } while (unlikely(start != last + 1));
>>>>>>> +       } while (unlikely(start < last + 1));
>>>>>>>
>>>>>>>          r = vm->update_funcs->commit(&params, fence);
>>>>>>>
>>>>>>>
>>>>>>> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

2021-03-29 19:42:07

by Xi Ruoyao

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On 2021-03-29 21:36 +0200, Christian König wrote:
> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> > Hi Christian,
> >
> > I don't think there is any constraint implemented to ensure `num_entries %
> > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
> >
> >          /* validate the parameters */
> >          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
> > ||
> >              size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> >                  return -EINVAL;
> >
> > /* snip */
> >
> >          saddr /= AMDGPU_GPU_PAGE_SIZE;
> >          eaddr /= AMDGPU_GPU_PAGE_SIZE;
> >
> > /* snip */
> >
> >          mapping->start = saddr;
> >          mapping->last = eaddr;
> >
> >
> > If we really want to ensure (mapping->last - mapping->start + 1) %
> > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> > "AMDGPU_GPU_PAGE_MASK"
> > in "validate the parameters" with "PAGE_MASK".
>
> Yeah, good point.
>
> > I tried it and it broke userspace: Xorg startup fails with EINVAL with this
> > change.
>
> Well in theory it is possible that we always fill the GPUVM on a 4k
> basis while the native page size of the CPU is larger. Let me double
> check the code.
>
> BTW: What code base are you based on? The code your post here is quite
> outdated.

Linus' tree.

I'll go to sleep now (it's 03:39 here :( ), when I wake up I can try to fetch
drm-next or something.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2021-03-30 12:06:40

by Xi Ruoyao

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
> On 2021-03-29 21:36 +0200, Christian König wrote:
> > Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> > > Hi Christian,
> > >
> > > I don't think there is any constraint implemented to ensure `num_entries %
> > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
> > >
> > >          /* validate the parameters */
> > >          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
> > > > >
> > >              size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> > >                  return -EINVAL;
> > >
> > > /* snip */
> > >
> > >          saddr /= AMDGPU_GPU_PAGE_SIZE;
> > >          eaddr /= AMDGPU_GPU_PAGE_SIZE;
> > >
> > > /* snip */
> > >
> > >          mapping->start = saddr;
> > >          mapping->last = eaddr;
> > >
> > >
> > > If we really want to ensure (mapping->last - mapping->start + 1) %
> > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> > > "AMDGPU_GPU_PAGE_MASK"
> > > in "validate the parameters" with "PAGE_MASK".

It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
"AMDGPU_GPU_PAGE_MASK" :(.

> > Yeah, good point.
> >
> > > I tried it and it broke userspace: Xorg startup fails with EINVAL with
> > > this
> > > change.
> >
> > Well in theory it is possible that we always fill the GPUVM on a 4k
> > basis while the native page size of the CPU is larger. Let me double
> > check the code.

On my platform, there are two issues both causing the VM corruption. One is
fixed by agd5f/linux@fe001e7. Another is in Mesa from userspace: it uses
`dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
expects it to use `dev_info->virtual_address_alignment`.

If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
and make virtual_address_alignment = 4K. Otherwise, we should fortify the
parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
userspace will just get an EINVAL, instead of a slient VM corruption. And
someone should tell Mesa developers to fix the code in this case.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2021-03-30 12:57:01

by Christian König

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
> On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
>> On 2021-03-29 21:36 +0200, Christian König wrote:
>>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
>>>> Hi Christian,
>>>>
>>>> I don't think there is any constraint implemented to ensure `num_entries %
>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
>>>>
>>>>          /* validate the parameters */
>>>>          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
>>>>              size == 0 || size & AMDGPU_GPU_PAGE_MASK)
>>>>                  return -EINVAL;
>>>>
>>>> /* snip */
>>>>
>>>>          saddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>          eaddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>
>>>> /* snip */
>>>>
>>>>          mapping->start = saddr;
>>>>          mapping->last = eaddr;
>>>>
>>>>
>>>> If we really want to ensure (mapping->last - mapping->start + 1) %
>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
>>>> "AMDGPU_GPU_PAGE_MASK"
>>>> in "validate the parameters" with "PAGE_MASK".
> It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
> "AMDGPU_GPU_PAGE_MASK" :(.
>
>>> Yeah, good point.
>>>
>>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
>>>> this
>>>> change.
>>> Well in theory it is possible that we always fill the GPUVM on a 4k
>>> basis while the native page size of the CPU is larger. Let me double
>>> check the code.
> On my platform, there are two issues both causing the VM corruption. One is
> fixed by agd5f/linux@fe001e7.

What is fe001e7? A commit id? If yes then that is to short and I can't
find it.

> Another is in Mesa from userspace: it uses
> `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
> expects it to use `dev_info->virtual_address_alignment`.

Mhm, looking at the kernel code I would rather say Mesa is correct and
the kernel driver is broken.

The gart_page_size is limited by the CPU page size, but the
virtual_address_alignment isn't.

> If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
> and make virtual_address_alignment = 4K. Otherwise, we should fortify the
> parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
> userspace will just get an EINVAL, instead of a slient VM corruption. And
> someone should tell Mesa developers to fix the code in this case.

I rather see this as a kernel bug. Can you test if this code fragment
fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 64beb3399604..e1260b517e1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
*data, struct drm_file *filp)
                }
                dev_info->virtual_address_alignment =
max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
                dev_info->pte_fragment_size = (1 <<
adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
-               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
+               dev_info->gart_page_size =
dev_info->virtual_address_alignment;
                dev_info->cu_active_number = adev->gfx.cu_info.number;
                dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
                dev_info->ce_ram_size = adev->gfx.ce_ram_size;

Thanks,
Christian.

> --
> Xi Ruoyao <[email protected]>
> School of Aerospace Science and Technology, Xidian University
>

2021-03-30 13:04:46

by Christian König

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

Am 30.03.21 um 15:00 schrieb Dan Horák:
> On Tue, 30 Mar 2021 14:55:01 +0200
> Christian König <[email protected]> wrote:
>
>> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
>>> On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
>>>> On 2021-03-29 21:36 +0200, Christian König wrote:
>>>>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
>>>>>> Hi Christian,
>>>>>>
>>>>>> I don't think there is any constraint implemented to ensure `num_entries %
>>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
>>>>>>
>>>>>>          /* validate the parameters */
>>>>>>          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
>>>>>>              size == 0 || size & AMDGPU_GPU_PAGE_MASK)
>>>>>>                  return -EINVAL;
>>>>>>
>>>>>> /* snip */
>>>>>>
>>>>>>          saddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>>>          eaddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>>>
>>>>>> /* snip */
>>>>>>
>>>>>>          mapping->start = saddr;
>>>>>>          mapping->last = eaddr;
>>>>>>
>>>>>>
>>>>>> If we really want to ensure (mapping->last - mapping->start + 1) %
>>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
>>>>>> "AMDGPU_GPU_PAGE_MASK"
>>>>>> in "validate the parameters" with "PAGE_MASK".
>>> It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
>>> "AMDGPU_GPU_PAGE_MASK" :(.
>>>
>>>>> Yeah, good point.
>>>>>
>>>>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
>>>>>> this
>>>>>> change.
>>>>> Well in theory it is possible that we always fill the GPUVM on a 4k
>>>>> basis while the native page size of the CPU is larger. Let me double
>>>>> check the code.
>>> On my platform, there are two issues both causing the VM corruption. One is
>>> fixed by agd5f/linux@fe001e7.
>> What is fe001e7? A commit id? If yes then that is to short and I can't
>> find it.
> it's a gitlab shortcut for
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Ffe001e70a55d0378328612be1fdc3abfc68b9ccc&data=04%7C01%7Cchristian.koenig%40amd.com%7Cd16d123aaa01420ebd0808d8f37bbf2f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527060812278536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5rFVLxSRnfHUGjhoiqx1e6SeROqbg4ZPef%2BxEvgv%2BTg%3D&reserved=0

Ah! Yes I would expect that this patch is fixing something in this use case.

Thanks,
Christian.

>
>
> Dan
>>> Another is in Mesa from userspace: it uses
>>> `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
>>> expects it to use `dev_info->virtual_address_alignment`.
>> Mhm, looking at the kernel code I would rather say Mesa is correct and
>> the kernel driver is broken.
>>
>> The gart_page_size is limited by the CPU page size, but the
>> virtual_address_alignment isn't.
>>
>>> If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
>>> and make virtual_address_alignment = 4K. Otherwise, we should fortify the
>>> parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
>>> userspace will just get an EINVAL, instead of a slient VM corruption. And
>>> someone should tell Mesa developers to fix the code in this case.
>> I rather see this as a kernel bug. Can you test if this code fragment
>> fixes your issue:
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 64beb3399604..e1260b517e1b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
>> *data, struct drm_file *filp)
>>                 }
>>                 dev_info->virtual_address_alignment =
>> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>>                 dev_info->pte_fragment_size = (1 <<
>> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
>> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
>> +               dev_info->gart_page_size =
>> dev_info->virtual_address_alignment;
>>                 dev_info->cu_active_number = adev->gfx.cu_info.number;
>>                 dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>>                 dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>>
>>
>> Thanks,
>> Christian.
>>
>>> --
>>> Xi Ruoyao <[email protected]>
>>> School of Aerospace Science and Technology, Xidian University
>>>

2021-03-30 13:05:34

by Xi Ruoyao

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On 2021-03-30 14:55 +0200, Christian König wrote:
> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
> > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
> > > On 2021-03-29 21:36 +0200, Christian König wrote:
> > > > Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> > > > > Hi Christian,
> > > > >
> > > > > I don't think there is any constraint implemented to ensure `num_entries
> > > > > %
> > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in
> > > > > `amdgpu_vm_bo_map()`:
> > > > >
> > > > >           /* validate the parameters */
> > > > >           if (saddr & AMDGPU_GPU_PAGE_MASK || offset &
> > > > > AMDGPU_GPU_PAGE_MASK
> > > > >               size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> > > > >                   return -EINVAL;
> > > > >
> > > > > /* snip */
> > > > >
> > > > >           saddr /= AMDGPU_GPU_PAGE_SIZE;
> > > > >           eaddr /= AMDGPU_GPU_PAGE_SIZE;
> > > > >
> > > > > /* snip */
> > > > >
> > > > >           mapping->start = saddr;
> > > > >           mapping->last = eaddr;
> > > > >
> > > > >
> > > > > If we really want to ensure (mapping->last - mapping->start + 1) %
> > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> > > > > "AMDGPU_GPU_PAGE_MASK"
> > > > > in "validate the parameters" with "PAGE_MASK".
> > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
> > "AMDGPU_GPU_PAGE_MASK" :(.
> >
> > > > Yeah, good point.
> > > >
> > > > > I tried it and it broke userspace: Xorg startup fails with EINVAL with
> > > > > this
> > > > > change.
> > > > Well in theory it is possible that we always fill the GPUVM on a 4k
> > > > basis while the native page size of the CPU is larger. Let me double
> > > > check the code.
> > On my platform, there are two issues both causing the VM corruption. One is
> > fixed by agd5f/linux@fe001e7.
>
> What is fe001e7? A commit id? If yes then that is to short and I can't
> find it.
>
> >    Another is in Mesa from userspace: it uses
> > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
> > expects it to use `dev_info->virtual_address_alignment`.
>
> Mhm, looking at the kernel code I would rather say Mesa is correct and
> the kernel driver is broken.
>
> The gart_page_size is limited by the CPU page size, but the
> virtual_address_alignment isn't.
>
> > If we can make the change to fill the GPUVM on a 4k basis, we can fix this
> > issue
> > and make virtual_address_alignment = 4K. Otherwise, we should fortify the
> > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then
> > the
> > userspace will just get an EINVAL, instead of a slient VM corruption. And
> > someone should tell Mesa developers to fix the code in this case.
>
> I rather see this as a kernel bug. Can you test if this code fragment
> fixes your issue:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 64beb3399604..e1260b517e1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> *data, struct drm_file *filp)
>                  }
>                  dev_info->virtual_address_alignment =
> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>                  dev_info->pte_fragment_size = (1 <<
> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> +               dev_info->gart_page_size =
> dev_info->virtual_address_alignment;
>                  dev_info->cu_active_number = adev->gfx.cu_info.number;
>                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>                  dev_info->ce_ram_size = adev->gfx.ce_ram_size;

It works. I've seen it at
https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191
before (with a common sub-expression, though :).

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2021-03-30 13:11:25

by Xi Ruoyao

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
> On 2021-03-30 14:55 +0200, Christian König wrote:
> >
> > I rather see this as a kernel bug. Can you test if this code fragment
> > fixes your issue:
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > index 64beb3399604..e1260b517e1b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> > *data, struct drm_file *filp)
> >                  }
> >                  dev_info->virtual_address_alignment =
> > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
> >                  dev_info->pte_fragment_size = (1 <<
> > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> > -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> > +               dev_info->gart_page_size =
> > dev_info->virtual_address_alignment;
> >                  dev_info->cu_active_number = adev->gfx.cu_info.number;
> >                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
> >                  dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>
> It works. I've seen it at
> https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191
> before (with a common sub-expression, though :).

Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs
without this commit. But on the system I built from source, I didn't see any
issue before Linux 5.11. So I misbelieved that it was something already fixed.

Dan: you can try it on your PPC 64 with non-4K page as well.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2021-03-30 13:12:13

by Dan Horák

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On Tue, 30 Mar 2021 14:55:01 +0200
Christian König <[email protected]> wrote:

> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
> > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
> >> On 2021-03-29 21:36 +0200, Christian König wrote:
> >>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> >>>> Hi Christian,
> >>>>
> >>>> I don't think there is any constraint implemented to ensure `num_entries %
> >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
> >>>>
> >>>>          /* validate the parameters */
> >>>>          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
> >>>>              size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> >>>>                  return -EINVAL;
> >>>>
> >>>> /* snip */
> >>>>
> >>>>          saddr /= AMDGPU_GPU_PAGE_SIZE;
> >>>>          eaddr /= AMDGPU_GPU_PAGE_SIZE;
> >>>>
> >>>> /* snip */
> >>>>
> >>>>          mapping->start = saddr;
> >>>>          mapping->last = eaddr;
> >>>>
> >>>>
> >>>> If we really want to ensure (mapping->last - mapping->start + 1) %
> >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> >>>> "AMDGPU_GPU_PAGE_MASK"
> >>>> in "validate the parameters" with "PAGE_MASK".
> > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
> > "AMDGPU_GPU_PAGE_MASK" :(.
> >
> >>> Yeah, good point.
> >>>
> >>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
> >>>> this
> >>>> change.
> >>> Well in theory it is possible that we always fill the GPUVM on a 4k
> >>> basis while the native page size of the CPU is larger. Let me double
> >>> check the code.
> > On my platform, there are two issues both causing the VM corruption. One is
> > fixed by agd5f/linux@fe001e7.
>
> What is fe001e7? A commit id? If yes then that is to short and I can't
> find it.

it's a gitlab shortcut for
https://gitlab.freedesktop.org/agd5f/linux/-/commit/fe001e70a55d0378328612be1fdc3abfc68b9ccc

Dan
>
> > Another is in Mesa from userspace: it uses
> > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
> > expects it to use `dev_info->virtual_address_alignment`.
>
> Mhm, looking at the kernel code I would rather say Mesa is correct and
> the kernel driver is broken.
>
> The gart_page_size is limited by the CPU page size, but the
> virtual_address_alignment isn't.
>
> > If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
> > and make virtual_address_alignment = 4K. Otherwise, we should fortify the
> > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
> > userspace will just get an EINVAL, instead of a slient VM corruption. And
> > someone should tell Mesa developers to fix the code in this case.
>
> I rather see this as a kernel bug. Can you test if this code fragment
> fixes your issue:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 64beb3399604..e1260b517e1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> *data, struct drm_file *filp)
>                 }
>                 dev_info->virtual_address_alignment =
> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>                 dev_info->pte_fragment_size = (1 <<
> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> +               dev_info->gart_page_size =
> dev_info->virtual_address_alignment;
>                 dev_info->cu_active_number = adev->gfx.cu_info.number;
>                 dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>                 dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>
>
> Thanks,
> Christian.
>
> > --
> > Xi Ruoyao <[email protected]>
> > School of Aerospace Science and Technology, Xidian University
> >
>

2021-03-30 13:24:52

by Dan Horák

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On Tue, 30 Mar 2021 21:09:12 +0800
Xi Ruoyao <[email protected]> wrote:

> On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
> > On 2021-03-30 14:55 +0200, Christian König wrote:
> > >
> > > I rather see this as a kernel bug. Can you test if this code fragment
> > > fixes your issue:
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > index 64beb3399604..e1260b517e1b 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> > > *data, struct drm_file *filp)
> > >                  }
> > >                  dev_info->virtual_address_alignment =
> > > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
> > >                  dev_info->pte_fragment_size = (1 <<
> > > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> > > -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> > > +               dev_info->gart_page_size =
> > > dev_info->virtual_address_alignment;
> > >                  dev_info->cu_active_number = adev->gfx.cu_info.number;
> > >                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
> > >                  dev_info->ce_ram_size = adev->gfx.ce_ram_size;
> >
> > It works. I've seen it at
> > https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191
> > before (with a common sub-expression, though :).
>
> Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs
> without this commit. But on the system I built from source, I didn't see any
> issue before Linux 5.11. So I misbelieved that it was something already fixed.
>
> Dan: you can try it on your PPC 64 with non-4K page as well.

yup, looks good here as well, ppc64le (Power9) system with 64KB pages

Dan

2021-03-30 13:26:54

by Christian König

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

Am 30.03.21 um 15:23 schrieb Dan Horák:
> On Tue, 30 Mar 2021 21:09:12 +0800
> Xi Ruoyao <[email protected]> wrote:
>
>> On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
>>> On 2021-03-30 14:55 +0200, Christian König wrote:
>>>> I rather see this as a kernel bug. Can you test if this code fragment
>>>> fixes your issue:
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> index 64beb3399604..e1260b517e1b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
>>>> *data, struct drm_file *filp)
>>>>                  }
>>>>                  dev_info->virtual_address_alignment =
>>>> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>>>>                  dev_info->pte_fragment_size = (1 <<
>>>> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
>>>> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
>>>> +               dev_info->gart_page_size =
>>>> dev_info->virtual_address_alignment;
>>>>                  dev_info->cu_active_number = adev->gfx.cu_info.number;
>>>>                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>>>>                  dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>>> It works. I've seen it at
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxen0n%2Flinux%2Fcommit%2F84ada72983838bd7ce54bc32f5d34ac5b5aae191&data=04%7C01%7Cchristian.koenig%40amd.com%7Cf37fddf20a8847edf67808d8f37ef23c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527074118791321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DZnmee38NGpiWRMX5LmlxOhxAzBMhAusnAWNnCxXTJ0%3D&reserved=0
>>> before (with a common sub-expression, though :).
>> Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs
>> without this commit. But on the system I built from source, I didn't see any
>> issue before Linux 5.11. So I misbelieved that it was something already fixed.
>>
>> Dan: you can try it on your PPC 64 with non-4K page as well.
> yup, looks good here as well, ppc64le (Power9) system with 64KB pages

Mhm, as far as I can say this patch never made it to us.

Can you please send it to the mailing list with me on CC?

Thanks,
Christian.

>
>
> Dan

2021-03-30 15:50:31

by Xi Ruoyao

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

On 2021-03-30 15:24 +0200, Christian König wrote:
> Am 30.03.21 um 15:23 schrieb Dan Horák:
> > On Tue, 30 Mar 2021 21:09:12 +0800
> > Xi Ruoyao <[email protected]> wrote:
> >
> > > On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
> > > > On 2021-03-30 14:55 +0200, Christian König wrote:
> > > > > I rather see this as a kernel bug. Can you test if this code fragment
> > > > > fixes your issue:
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > index 64beb3399604..e1260b517e1b 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> > > > > *data, struct drm_file *filp)
> > > > >                   }
> > > > >                   dev_info->virtual_address_alignment =
> > > > > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
> > > > >                   dev_info->pte_fragment_size = (1 <<
> > > > > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> > > > > -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> > > > > +               dev_info->gart_page_size =
> > > > > dev_info->virtual_address_alignment;
> > > > >                   dev_info->cu_active_number = adev-
> > > > > >gfx.cu_info.number;
> > > > >                   dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
> > > > >                   dev_info->ce_ram_size = adev->gfx.ce_ram_size;
> > > > It works. I've seen it at
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxen0n%2Flinux%2Fcommit%2F84ada72983838bd7ce54bc32f5d34ac5b5aae191&data=04%7C01%7Cchristian.koenig%40amd.com%7Cf37fddf20a8847edf67808d8f37ef23c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527074118791321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DZnmee38NGpiWRMX5LmlxOhxAzBMhAusnAWNnCxXTJ0%3D&reserved=0
> > > > before (with a common sub-expression, though :).
> > > Some comment: on an old version of Fedora ported by Loongson, Xorg just
> > > hangs
> > > without this commit. But on the system I built from source, I didn't see
> > > any
> > > issue before Linux 5.11. So I misbelieved that it was something already
> > > fixed.
> > >
> > > Dan: you can try it on your PPC 64 with non-4K page as well.
> > yup, looks good here as well, ppc64le (Power9) system with 64KB pages
>
> Mhm, as far as I can say this patch never made it to us.

I think maybe its author considers it a "workaround" like me, as there is
already a "virtual_address_alignment" which seems correct.

> Can you please send it to the mailing list with me on CC?

I've sent it, together with my patch for using ~PAGE_MASK in parameter
validation.

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University