In the amdgpu_amdkfd_get_xgmi_bandwidth_mbytes function,
the peer_adev pointer can be NULL and is passed to amdgpu_xgmi_get_num_links.
In amdgpu_xgmi_get_num_links, peer_adev pointer is dereferenced
without any checks: peer_adev->gmc.xgmi.node_id .
Signed-off-by: Grigory Vasilyev <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index be1a55f8b8c5..1a1006b18016 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -541,11 +541,10 @@ int amdgpu_amdkfd_get_xgmi_bandwidth_mbytes(struct amdgpu_device *dst,
struct amdgpu_device *adev = dst, *peer_adev;
int num_links;
- if (adev->asic_type != CHIP_ALDEBARAN)
+ if (!src || adev->asic_type != CHIP_ALDEBARAN)
return 0;
- if (src)
- peer_adev = src;
+ peer_adev = src;
/* num links returns 0 for indirect peers since indirect route is unknown. */
num_links = is_min ? 1 : amdgpu_xgmi_get_num_links(adev, peer_adev);
--
2.35.1
Am 2022-04-04 um 18:21 schrieb Grigory Vasilyev:
> In the amdgpu_amdkfd_get_xgmi_bandwidth_mbytes function,
> the peer_adev pointer can be NULL and is passed to amdgpu_xgmi_get_num_links.
>
> In amdgpu_xgmi_get_num_links, peer_adev pointer is dereferenced
> without any checks: peer_adev->gmc.xgmi.node_id .
What's worse, peer_adev is uninitialized with an undefined value if src
is NULL. So that code was definitely bogus.
However, I think your patch will result in incorrect results. Currently
amdgpu_amdkfd_get_xgmi_bandwidth is always called with is_min=true if
src==NULL, and with is_min=false if src!=NULL. The intention is, that we
assume a single XGMI link in the case that src==NULL. That means the
is_min parameter is redundant. What we should do instead is, assume that
num_links==1 if src==NULL, and drop the is_min parameter.
That would keep things working the way they do now, and prevent
potential problems in the future.
Regards,
Felix
>
> Signed-off-by: Grigory Vasilyev <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index be1a55f8b8c5..1a1006b18016 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -541,11 +541,10 @@ int amdgpu_amdkfd_get_xgmi_bandwidth_mbytes(struct amdgpu_device *dst,
> struct amdgpu_device *adev = dst, *peer_adev;
> int num_links;
>
> - if (adev->asic_type != CHIP_ALDEBARAN)
> + if (!src || adev->asic_type != CHIP_ALDEBARAN)
> return 0;
>
> - if (src)
> - peer_adev = src;
> + peer_adev = src;
>
> /* num links returns 0 for indirect peers since indirect route is unknown. */
> num_links = is_min ? 1 : amdgpu_xgmi_get_num_links(adev, peer_adev);
Thanks. You are right. I found a potential bug, and as I understand
it, the code only applies to the Aldebaran GPU and I can not check the
correctness of the code. I only test code on my navi 10 and run GPU
stress tests.
My knowledge of amdgpu is limited, and fixing potential bugs allows me
to learn more about amdgpu code. But there are many that I still don't
understand. In any case, we need to fix the code to eliminate
problems in the future.
Regards, Grigory.
вт, 5 апр. 2022 г. в 20:00, Felix Kuehling <[email protected]>:
>
> Am 2022-04-04 um 18:21 schrieb Grigory Vasilyev:
> > In the amdgpu_amdkfd_get_xgmi_bandwidth_mbytes function,
> > the peer_adev pointer can be NULL and is passed to amdgpu_xgmi_get_num_links.
> >
> > In amdgpu_xgmi_get_num_links, peer_adev pointer is dereferenced
> > without any checks: peer_adev->gmc.xgmi.node_id .
>
> What's worse, peer_adev is uninitialized with an undefined value if src
> is NULL. So that code was definitely bogus.
>
> However, I think your patch will result in incorrect results. Currently
> amdgpu_amdkfd_get_xgmi_bandwidth is always called with is_min=true if
> src==NULL, and with is_min=false if src!=NULL. The intention is, that we
> assume a single XGMI link in the case that src==NULL. That means the
> is_min parameter is redundant. What we should do instead is, assume that
> num_links==1 if src==NULL, and drop the is_min parameter.
>
> That would keep things working the way they do now, and prevent
> potential problems in the future.
>
> Regards,
> Felix
>
>
> >
> > Signed-off-by: Grigory Vasilyev <[email protected]>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 ++---
> > 1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > index be1a55f8b8c5..1a1006b18016 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> > @@ -541,11 +541,10 @@ int amdgpu_amdkfd_get_xgmi_bandwidth_mbytes(struct amdgpu_device *dst,
> > struct amdgpu_device *adev = dst, *peer_adev;
> > int num_links;
> >
> > - if (adev->asic_type != CHIP_ALDEBARAN)
> > + if (!src || adev->asic_type != CHIP_ALDEBARAN)
> > return 0;
> >
> > - if (src)
> > - peer_adev = src;
> > + peer_adev = src;
> >
> > /* num links returns 0 for indirect peers since indirect route is unknown. */
> > num_links = is_min ? 1 : amdgpu_xgmi_get_num_links(adev, peer_adev);