Hello,
My static analysis tool reports a possible deadlock in the mlx4 driver
in Linux 5.16:
mlx4_xdp_set()
mutex_lock(&mdev->state_lock); --> Line 2778 (Lock A)
mlx4_en_try_alloc_resources()
mlx4_en_alloc_resources()
mlx4_en_destroy_tx_ring()
mlx4_qp_free()
wait_for_completion(&qp->free); --> Line 528 (Wait X)
mlx4_en_reset_config()
mutex_lock(&mdev->state_lock); --> Line 3522 (Lock A)
mlx4_en_try_alloc_resources()
mlx4_en_alloc_resources()
mlx4_en_destroy_tx_ring()
mlx4_qp_free()
complete(&qp->free); --> Line 527 (Wake X)
When mlx4_xdp_set() is executed, "Wait X" is performed by holding "Lock
A". If mlx4_en_reset_config() is executed at this time, "Wake X" cannot
be performed to wake up "Wait X" in mlx4_xdp_set(), because "Lock A" has
been already hold by mlx4_xdp_set(), causing a possible deadlock.
I am not quite sure whether this possible problem is real and how to fix
it if it is real.
Any feedback would be appreciated, thanks :)
Best wishes,
Jia-Ju Bai
On 2/7/2022 5:16 PM, Jia-Ju Bai wrote:
> Hello,
>
> My static analysis tool reports a possible deadlock in the mlx4 driver
> in Linux 5.16:
>
Hi Jia-Ju,
Thanks for your email.
Which static analysis tool do you use? Is it standard one?
> mlx4_xdp_set()
> mutex_lock(&mdev->state_lock); --> Line 2778 (Lock A)
> mlx4_en_try_alloc_resources()
> mlx4_en_alloc_resources()
> mlx4_en_destroy_tx_ring()
> mlx4_qp_free()
> wait_for_completion(&qp->free); --> Line 528 (Wait X)
The refcount_dec_and_test(&qp->refcount)) in mlx4_qp_free() pairs with
refcount_set(&qp->refcount, 1); in mlx4_qp_alloc.
mlx4_qp_event increases and decreasing the refcount while running
qp->event(qp, event_type); to protect it from being freed.
>
> mlx4_en_reset_config()
> mutex_lock(&mdev->state_lock); --> Line 3522 (Lock A)
> mlx4_en_try_alloc_resources()
> mlx4_en_alloc_resources()
> mlx4_en_destroy_tx_ring()
> mlx4_qp_free()
> complete(&qp->free); --> Line 527 (Wake X)
>
> When mlx4_xdp_set() is executed, "Wait X" is performed by holding "Lock
> A". If mlx4_en_reset_config() is executed at this time, "Wake X" cannot
> be performed to wake up "Wait X" in mlx4_xdp_set(), because "Lock A" has
> been already hold by mlx4_xdp_set(), causing a possible deadlock.
>
> I am not quite sure whether this possible problem is real and how to fix
> it if it is real.
> Any feedback would be appreciated, thanks :)
>
Not possible.
These are two different qps, maintaining two different instances of
refcount and complete, following the behavior I described above.
>
> Best wishes,
> Jia-Ju Bai
Thanks,
Tariq
On 2022/2/9 18:21, Tariq Toukan wrote:
>
>
> On 2/7/2022 5:16 PM, Jia-Ju Bai wrote:
>> Hello,
>>
>> My static analysis tool reports a possible deadlock in the mlx4
>> driver in Linux 5.16:
>>
>
> Hi Jia-Ju,
> Thanks for your email.
>
> Which static analysis tool do you use? Is it standard one?
Hi Tariq,
Thanks for the reply and explanation :)
I developed this tool by myself, based on LLVM.
>
>> mlx4_xdp_set()
>> mutex_lock(&mdev->state_lock); --> Line 2778 (Lock A)
>> mlx4_en_try_alloc_resources()
>> mlx4_en_alloc_resources()
>> mlx4_en_destroy_tx_ring()
>> mlx4_qp_free()
>> wait_for_completion(&qp->free); --> Line 528 (Wait X)
>
> The refcount_dec_and_test(&qp->refcount)) in mlx4_qp_free() pairs with
> refcount_set(&qp->refcount, 1); in mlx4_qp_alloc.
> mlx4_qp_event increases and decreasing the refcount while running
> qp->event(qp, event_type); to protect it from being freed.
>
>>
>> mlx4_en_reset_config()
>> mutex_lock(&mdev->state_lock); --> Line 3522 (Lock A)
>> mlx4_en_try_alloc_resources()
>> mlx4_en_alloc_resources()
>> mlx4_en_destroy_tx_ring()
>> mlx4_qp_free()
>> complete(&qp->free); --> Line 527 (Wake X)
>>
>> When mlx4_xdp_set() is executed, "Wait X" is performed by holding
>> "Lock A". If mlx4_en_reset_config() is executed at this time, "Wake
>> X" cannot be performed to wake up "Wait X" in mlx4_xdp_set(), because
>> "Lock A" has been already hold by mlx4_xdp_set(), causing a possible
>> deadlock.
>>
>> I am not quite sure whether this possible problem is real and how to
>> fix it if it is real.
>> Any feedback would be appreciated, thanks :)
>>
>
> Not possible.
> These are two different qps, maintaining two different instances of
> refcount and complete, following the behavior I described above.
Okay, "there are two different qps" should be the reason of this false
positive, and my tool cannot identify this reason in static analysis...
Best wishes,
Jia-Ju Bai