2022-02-02 07:22:35

by Jia-Ju Bai

[permalink] [raw]
Subject: [BUG] gpu: drm: radeon: two possible deadlocks involving locking and waiting

Hello,

My static analysis tool reports a possible deadlock in the radeon driver
in Linux 5.16:

#BUG 1
radeon_dpm_change_power_state_locked()
  mutex_lock(&rdev->ring_lock); --> Line 1133 (Lock A)
  radeon_fence_wait_empty()
    radeon_fence_wait_seq_timeout()
      wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)

radeon_ring_backup()
  mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
  radeon_fence_count_emitted()
    radeon_fence_process()
      wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)

When radeon_dpm_change_power_state_locked() is executed, "Wait X" is
performed by holding "Lock A". If radeon_ring_backup() is executed at
this time, "Wake X" cannot be performed to wake up "Wait X" in
radeon_dpm_change_power_state_locked(), because "Lock A" has been
already hold by radeon_dpm_change_power_state_locked(), causing a
possible deadlock.
I find that "Wait X" is performed with a timeout MAX_SCHEDULE_TIMEOUT,
to relieve the possible deadlock; but I think this timeout can cause
inefficient execution.

#BUG 2
radeon_ring_lock()
  mutex_lock(&rdev->ring_lock); --> Line 147 (Lock A)
  radeon_ring_alloc()
    radeon_fence_wait_next()
      radeon_fence_wait_seq_timeout()
        wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)

radeon_ring_backup()
  mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
  radeon_fence_count_emitted()
    radeon_fence_process()
      wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)

When radeon_ring_lock() is executed, "Wait X" is performed by holding
"Lock A". If radeon_ring_backup() is executed at this time, "Wake X"
cannot be performed to wake up "Wait X" in radeon_ring_lock(), because
"Lock A" has been already hold by radeon_ring_lock(), causing a possible
deadlock.
I find that "Wait X" is performed with a timeout MAX_SCHEDULE_TIMEOUT,
to relieve the possible deadlock; but I think this timeout can cause
inefficient execution.

I am not quite sure whether these possible problems are real and how to
fix them if they are real.
Any feedback would be appreciated, thanks :)


Best wishes,
Jia-Ju Bai


2022-02-03 07:25:18

by Christian König

[permalink] [raw]
Subject: Re: [BUG] gpu: drm: radeon: two possible deadlocks involving locking and waiting

Hi Jia-Ju,

interesting that you have found those issues with an automated tool.

And yes that is a well design flaw within the radeon driver which can
happen on hardware faults, e.g. when radeon_ring_backup() needs to be
called.

But that happens so rarely and the driver is not developed further that
we decided to not address this any more.

Regards,
Christian.

Am 01.02.22 um 08:40 schrieb Jia-Ju Bai:
> Hello,
>
> My static analysis tool reports a possible deadlock in the radeon
> driver in Linux 5.16:
>
> #BUG 1
> radeon_dpm_change_power_state_locked()
>   mutex_lock(&rdev->ring_lock); --> Line 1133 (Lock A)
>   radeon_fence_wait_empty()
>     radeon_fence_wait_seq_timeout()
>       wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>
> radeon_ring_backup()
>   mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
>   radeon_fence_count_emitted()
>     radeon_fence_process()
>       wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>
> When radeon_dpm_change_power_state_locked() is executed, "Wait X" is
> performed by holding "Lock A". If radeon_ring_backup() is executed at
> this time, "Wake X" cannot be performed to wake up "Wait X" in
> radeon_dpm_change_power_state_locked(), because "Lock A" has been
> already hold by radeon_dpm_change_power_state_locked(), causing a
> possible deadlock.
> I find that "Wait X" is performed with a timeout MAX_SCHEDULE_TIMEOUT,
> to relieve the possible deadlock; but I think this timeout can cause
> inefficient execution.
>
> #BUG 2
> radeon_ring_lock()
>   mutex_lock(&rdev->ring_lock); --> Line 147 (Lock A)
>   radeon_ring_alloc()
>     radeon_fence_wait_next()
>       radeon_fence_wait_seq_timeout()
>         wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>
> radeon_ring_backup()
>   mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
>   radeon_fence_count_emitted()
>     radeon_fence_process()
>       wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>
> When radeon_ring_lock() is executed, "Wait X" is performed by holding
> "Lock A". If radeon_ring_backup() is executed at this time, "Wake X"
> cannot be performed to wake up "Wait X" in radeon_ring_lock(), because
> "Lock A" has been already hold by radeon_ring_lock(), causing a
> possible deadlock.
> I find that "Wait X" is performed with a timeout MAX_SCHEDULE_TIMEOUT,
> to relieve the possible deadlock; but I think this timeout can cause
> inefficient execution.
>
> I am not quite sure whether these possible problems are real and how
> to fix them if they are real.
> Any feedback would be appreciated, thanks :)
>
>
> Best wishes,
> Jia-Ju Bai
>

2022-02-05 05:11:00

by Jia-Ju Bai

[permalink] [raw]
Subject: Re: [BUG] gpu: drm: radeon: two possible deadlocks involving locking and waiting

Hi Christian,

Thanks for the reply :)

On 2022/2/1 15:56, Christian König wrote:
> Hi Jia-Ju,
>
> interesting that you have found those issues with an automated tool.
>
> And yes that is a well design flaw within the radeon driver which can
> happen on hardware faults, e.g. when radeon_ring_backup() needs to be
> called.

In fact, my tool finds dozens of similar possible deadlocks caused by
wait_event_timeout() in radeon_fence_wait_seq_timeout().
There are three other examples in Linux 5.16:

#BUG 1
radeon_dpm_change_power_state_locked()
  mutex_lock(&rdev->ring_lock); --> Line 1133 (Lock A)
  radeon_fence_wait_empty()
    radeon_fence_wait_seq_timeout()
      wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)

radeon_fence_driver_fini()
  mutex_lock(&rdev->ring_lock); --> Line 917 (Lock A)
  wake_up_all(&rdev->fence_queue); --> Line 927 (Wake X)

#BUG 2
radeon_set_pm_profile()
  mutex_lock(&rdev->pm.mutex); --> Line 382 (Lock A)
  radeon_pm_set_clocks()
    radeon_fence_wait_empty()
      radeon_fence_wait_seq_timeout()
        wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)

radeon_dynpm_idle_work_handler()
  mutex_lock(&rdev->pm.mutex); --> Line 1861 (Lock A)
  radeon_fence_count_emitted()
    radeon_fence_process()
      wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)

#BUG 3
radeon_pm_fini_old()
  mutex_lock(&rdev->pm.mutex); --> Line 1642 (Lock A)
  radeon_pm_set_clocks()
    radeon_fence_wait_empty()
      radeon_fence_wait_seq_timeout()
        wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)

radeon_dynpm_idle_work_handler()
  mutex_lock(&rdev->pm.mutex); --> Line 1861 (Lock A)
  radeon_fence_count_emitted()
    radeon_fence_process()
      wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)

Thus, to fix these possible deadlocks, we could moditify the code
related to radeon_fence_wait_seq_timeout().
But I am not quite familar with the radeon driver, so I am not sure how
to moditify the code properly.

>
> But that happens so rarely and the driver is not developed further
> that we decided to not address this any more.

Ah, okay.

>
> Regards,
> Christian.
>
> Am 01.02.22 um 08:40 schrieb Jia-Ju Bai:
>> Hello,
>>
>> My static analysis tool reports a possible deadlock in the radeon
>> driver in Linux 5.16:
>>
>> #BUG 1
>> radeon_dpm_change_power_state_locked()
>>   mutex_lock(&rdev->ring_lock); --> Line 1133 (Lock A)
>>   radeon_fence_wait_empty()
>>     radeon_fence_wait_seq_timeout()
>>       wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>>
>> radeon_ring_backup()
>>   mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
>>   radeon_fence_count_emitted()
>>     radeon_fence_process()
>>       wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>>
>> When radeon_dpm_change_power_state_locked() is executed, "Wait X" is
>> performed by holding "Lock A". If radeon_ring_backup() is executed at
>> this time, "Wake X" cannot be performed to wake up "Wait X" in
>> radeon_dpm_change_power_state_locked(), because "Lock A" has been
>> already hold by radeon_dpm_change_power_state_locked(), causing a
>> possible deadlock.
>> I find that "Wait X" is performed with a timeout
>> MAX_SCHEDULE_TIMEOUT, to relieve the possible deadlock; but I think
>> this timeout can cause inefficient execution.
>>
>> #BUG 2
>> radeon_ring_lock()
>>   mutex_lock(&rdev->ring_lock); --> Line 147 (Lock A)
>>   radeon_ring_alloc()
>>     radeon_fence_wait_next()
>>       radeon_fence_wait_seq_timeout()
>>         wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>>
>> radeon_ring_backup()
>>   mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
>>   radeon_fence_count_emitted()
>>     radeon_fence_process()
>>       wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>>
>> When radeon_ring_lock() is executed, "Wait X" is performed by holding
>> "Lock A". If radeon_ring_backup() is executed at this time, "Wake X"
>> cannot be performed to wake up "Wait X" in radeon_ring_lock(),
>> because "Lock A" has been already hold by radeon_ring_lock(), causing
>> a possible deadlock.
>> I find that "Wait X" is performed with a timeout
>> MAX_SCHEDULE_TIMEOUT, to relieve the possible deadlock; but I think
>> this timeout can cause inefficient execution.
>>
>> I am not quite sure whether these possible problems are real and how
>> to fix them if they are real.
>> Any feedback would be appreciated, thanks :)
>>
>>
>> Best wishes,
>> Jia-Ju Bai
>>
>

2022-02-06 18:00:13

by Christian König

[permalink] [raw]
Subject: Re: [BUG] gpu: drm: radeon: two possible deadlocks involving locking and waiting



Am 05.02.22 um 02:55 schrieb Jia-Ju Bai:
> Hi Christian,
>
> Thanks for the reply :)
>
> On 2022/2/1 15:56, Christian König wrote:
>> Hi Jia-Ju,
>>
>> interesting that you have found those issues with an automated tool.
>>
>> And yes that is a well design flaw within the radeon driver which can
>> happen on hardware faults, e.g. when radeon_ring_backup() needs to be
>> called.
>
> In fact, my tool finds dozens of similar possible deadlocks caused by
> wait_event_timeout() in radeon_fence_wait_seq_timeout().

Those are false positives.

The call to radeon_fence_process() from radeon_fence_count_emitted() for
example is just to speed things up, it's not mandatory for correct
operation and so it doesn't matter if it isn't called because of the
thread is blocked on the pm.mutex.

But I also don't see how your tool should be able to figure that out
automated.

Regards,
Christian.

> There are three other examples in Linux 5.16:
>
> #BUG 1
> radeon_dpm_change_power_state_locked()
>   mutex_lock(&rdev->ring_lock); --> Line 1133 (Lock A)
>   radeon_fence_wait_empty()
>     radeon_fence_wait_seq_timeout()
>       wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>
> radeon_fence_driver_fini()
>   mutex_lock(&rdev->ring_lock); --> Line 917 (Lock A)
>   wake_up_all(&rdev->fence_queue); --> Line 927 (Wake X)
>
> #BUG 2
> radeon_set_pm_profile()
>   mutex_lock(&rdev->pm.mutex); --> Line 382 (Lock A)
>   radeon_pm_set_clocks()
>     radeon_fence_wait_empty()
>       radeon_fence_wait_seq_timeout()
>         wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>
> radeon_dynpm_idle_work_handler()
>   mutex_lock(&rdev->pm.mutex); --> Line 1861 (Lock A)
>   radeon_fence_count_emitted()
>     radeon_fence_process()
>       wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>
> #BUG 3
> radeon_pm_fini_old()
>   mutex_lock(&rdev->pm.mutex); --> Line 1642 (Lock A)
>   radeon_pm_set_clocks()
>     radeon_fence_wait_empty()
>       radeon_fence_wait_seq_timeout()
>         wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>
> radeon_dynpm_idle_work_handler()
>   mutex_lock(&rdev->pm.mutex); --> Line 1861 (Lock A)
>   radeon_fence_count_emitted()
>     radeon_fence_process()
>       wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>
> Thus, to fix these possible deadlocks, we could moditify the code
> related to radeon_fence_wait_seq_timeout().
> But I am not quite familar with the radeon driver, so I am not sure
> how to moditify the code properly.
>
>>
>> But that happens so rarely and the driver is not developed further
>> that we decided to not address this any more.
>
> Ah, okay.
>
>>
>> Regards,
>> Christian.
>>
>> Am 01.02.22 um 08:40 schrieb Jia-Ju Bai:
>>> Hello,
>>>
>>> My static analysis tool reports a possible deadlock in the radeon
>>> driver in Linux 5.16:
>>>
>>> #BUG 1
>>> radeon_dpm_change_power_state_locked()
>>>   mutex_lock(&rdev->ring_lock); --> Line 1133 (Lock A)
>>>   radeon_fence_wait_empty()
>>>     radeon_fence_wait_seq_timeout()
>>>       wait_event_timeout(rdev->fence_queue, ...) --> Line 504 (Wait X)
>>>
>>> radeon_ring_backup()
>>>   mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
>>>   radeon_fence_count_emitted()
>>>     radeon_fence_process()
>>>       wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>>>
>>> When radeon_dpm_change_power_state_locked() is executed, "Wait X" is
>>> performed by holding "Lock A". If radeon_ring_backup() is executed
>>> at this time, "Wake X" cannot be performed to wake up "Wait X" in
>>> radeon_dpm_change_power_state_locked(), because "Lock A" has been
>>> already hold by radeon_dpm_change_power_state_locked(), causing a
>>> possible deadlock.
>>> I find that "Wait X" is performed with a timeout
>>> MAX_SCHEDULE_TIMEOUT, to relieve the possible deadlock; but I think
>>> this timeout can cause inefficient execution.
>>>
>>> #BUG 2
>>> radeon_ring_lock()
>>>   mutex_lock(&rdev->ring_lock); --> Line 147 (Lock A)
>>>   radeon_ring_alloc()
>>>     radeon_fence_wait_next()
>>>       radeon_fence_wait_seq_timeout()
>>>         wait_event_timeout(rdev->fence_queue, ...) --> Line 504
>>> (Wait X)
>>>
>>> radeon_ring_backup()
>>>   mutex_lock(&rdev->ring_lock); --> Line 289(Lock A)
>>>   radeon_fence_count_emitted()
>>>     radeon_fence_process()
>>>       wake_up_all(&rdev->fence_queue); --> Line 323 (Wake X)
>>>
>>> When radeon_ring_lock() is executed, "Wait X" is performed by
>>> holding "Lock A". If radeon_ring_backup() is executed at this time,
>>> "Wake X" cannot be performed to wake up "Wait X" in
>>> radeon_ring_lock(), because "Lock A" has been already hold by
>>> radeon_ring_lock(), causing a possible deadlock.
>>> I find that "Wait X" is performed with a timeout
>>> MAX_SCHEDULE_TIMEOUT, to relieve the possible deadlock; but I think
>>> this timeout can cause inefficient execution.
>>>
>>> I am not quite sure whether these possible problems are real and how
>>> to fix them if they are real.
>>> Any feedback would be appreciated, thanks :)
>>>
>>>
>>> Best wishes,
>>> Jia-Ju Bai
>>>
>>
>