2021-03-26 10:11:42

by Qu Huang

[permalink] [raw]
Subject: Re: [PATCH] drm/amdkfd: dqm fence memory corruption

On 2021/1/28 5:50, Felix Kuehling wrote:
> Am 2021-01-27 um 7:33 a.m. schrieb Qu Huang:
>> Amdgpu driver uses 4-byte data type as DQM fence memory,
>> and transmits GPU address of fence memory to microcode
>> through query status PM4 message. However, query status
>> PM4 message definition and microcode processing are all
>> processed according to 8 bytes. Fence memory only allocates
>> 4 bytes of memory, but microcode does write 8 bytes of memory,
>> so there is a memory corruption.
>
> Thank you for pointing out that discrepancy. That's a good catch!
>
> I'd prefer to fix this properly by making dqm->fence_addr a u64 pointer.
> We should probably also fix up the query_status and
> amdkfd_fence_wait_timeout function interfaces to use a 64 bit fence
> values everywhere to be consistent.
>
> Regards,
>   Felix
Hi Felix, Thanks for your advice, please check v2 at
https://lore.kernel.org/patchwork/patch/1372584/
Thanks,
Qu.
>
>
>>
>> Signed-off-by: Qu Huang <[email protected]>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> index e686ce2..8b38d0c 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> @@ -1161,7 +1161,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
>> pr_debug("Allocating fence memory\n");
>>
>> /* allocate fence memory on the gart */
>> - retval = kfd_gtt_sa_allocate(dqm->dev, sizeof(*dqm->fence_addr),
>> + retval = kfd_gtt_sa_allocate(dqm->dev, sizeof(uint64_t),
>> &dqm->fence_mem);
>>
>> if (retval)


2021-03-26 19:25:38

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH] drm/amdkfd: dqm fence memory corruption

Am 2021-03-26 um 5:38 a.m. schrieb Qu Huang:
> On 2021/1/28 5:50, Felix Kuehling wrote:
>> Am 2021-01-27 um 7:33 a.m. schrieb Qu Huang:
>>> Amdgpu driver uses 4-byte data type as DQM fence memory,
>>> and transmits GPU address of fence memory to microcode
>>> through query status PM4 message. However, query status
>>> PM4 message definition and microcode processing are all
>>> processed according to 8 bytes. Fence memory only allocates
>>> 4 bytes of memory, but microcode does write 8 bytes of memory,
>>> so there is a memory corruption.
>>
>> Thank you for pointing out that discrepancy. That's a good catch!
>>
>> I'd prefer to fix this properly by making dqm->fence_addr a u64 pointer.
>> We should probably also fix up the query_status and
>> amdkfd_fence_wait_timeout function interfaces to use a 64 bit fence
>> values everywhere to be consistent.
>>
>> Regards,
>>    Felix
> Hi Felix, Thanks for your advice, please check v2 at
> https://lore.kernel.org/patchwork/patch/1372584/

Thank you for the reminder. I somehow missed your v2 patch on the
mailing list. I have reviewed and applied it to amd-staging-drm-next now.

Regards,
  Felix


> Thanks,
> Qu.
>>
>>
>>>
>>> Signed-off-by: Qu Huang <[email protected]>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> index e686ce2..8b38d0c 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> @@ -1161,7 +1161,7 @@ static int start_cpsch(struct
>>> device_queue_manager *dqm)
>>>       pr_debug("Allocating fence memory\n");
>>>         /* allocate fence memory on the gart */
>>> -    retval = kfd_gtt_sa_allocate(dqm->dev, sizeof(*dqm->fence_addr),
>>> +    retval = kfd_gtt_sa_allocate(dqm->dev, sizeof(uint64_t),
>>>                       &dqm->fence_mem);
>>>         if (retval)
>