Hello everyone,
I am seeing a deadlock between cgroup_threadgroup_rwsem and cpu_hotplug_lock in
5.4 kernel.
Due to some missing drivers I don't have this test setup for latest upstream
kernel but looking at the code the issue seems to be present in the latest
kernel as well. If needed I can provide stack traces and other relevant info
from the vmcore that I have got from 5.4 setup.
The description of the problem is as follows (I am using 5.19-rc7 as reference
below):
__cgroup_procs_write acquires cgroup_threadgroup_rwsem via
cgroup_procs_write_start and then invokes cgroup_attach_task. Now
cgroup_attach_task can invoke following call chain:
cgroup_attach_task --> cgroup_migrate --> cgroup_migrate_execute --> cpuset_attach
Here cpuset_attach tries to take cpu_hotplug_lock.
But by this time if some other context
1. is already in the middle of cpu hotplug and has acquired cpu_hotplug_lock in
_cpu_up but
2. has not yet reached CPUHP_ONLINE state and
3. one of the intermediate hotplug states (in my case CPUHP_AP_ONLINE_DYN ) has
a callback which involves creation of a thread (or invocation of copy_process
via some other path) the invoked copy_process will get blocked on
cgroup_threadgroup_rwsem in following call chain:
copy_process --> cgroup_can_fork --> cgroup_css_set_fork -->
cgroup_threadgroup_change_begin
I am looking for suggestions to fix this deadlock.
Or if I am missing something in the above analysis and the above mention
scenario can't happen in latest upstream kernel, then please let me know as that
would help me in back porting relevant changes to 5.4 kernel because the issue
definitely exists in 5.4 kernel.
Thanks,
-- Imran
Hi,
On 7/20/2022 8:57 AM, Imran Khan wrote:
> Hello everyone,
>
> I am seeing a deadlock between cgroup_threadgroup_rwsem and cpu_hotplug_lock in
> 5.4 kernel.
>
> Due to some missing drivers I don't have this test setup for latest upstream
> kernel but looking at the code the issue seems to be present in the latest
> kernel as well. If needed I can provide stack traces and other relevant info
> from the vmcore that I have got from 5.4 setup.
>
> The description of the problem is as follows (I am using 5.19-rc7 as reference
> below):
>
> __cgroup_procs_write acquires cgroup_threadgroup_rwsem via
> cgroup_procs_write_start and then invokes cgroup_attach_task. Now
> cgroup_attach_task can invoke following call chain:
>
> cgroup_attach_task --> cgroup_migrate --> cgroup_migrate_execute --> cpuset_attach
>
> Here cpuset_attach tries to take cpu_hotplug_lock.
>
> But by this time if some other context
>
> 1. is already in the middle of cpu hotplug and has acquired cpu_hotplug_lock in
> _cpu_up but
> 2. has not yet reached CPUHP_ONLINE state and
> 3. one of the intermediate hotplug states (in my case CPUHP_AP_ONLINE_DYN ) has
> a callback which involves creation of a thread (or invocation of copy_process
> via some other path) the invoked copy_process will get blocked on
> cgroup_threadgroup_rwsem in following call chain:
>
> copy_process --> cgroup_can_fork --> cgroup_css_set_fork -->
> cgroup_threadgroup_change_begin
Similar discussion is at [1], not sure on the conclusion.
[1]
https://lore.kernel.org/lkml/[email protected]/
-Mukesh
>
>
> I am looking for suggestions to fix this deadlock.
>
> Or if I am missing something in the above analysis and the above mention
> scenario can't happen in latest upstream kernel, then please let me know as that
> would help me in back porting relevant changes to 5.4 kernel because the issue
> definitely exists in 5.4 kernel.
>
> Thanks,
> -- Imran
Looks like these patches are the fixes.
https://lore.kernel.org/all/[email protected]/#r
Would let Tejun confirm this .
-Mukesh
On 7/20/2022 4:36 PM, Mukesh Ojha wrote:
> Hi,
>
> On 7/20/2022 8:57 AM, Imran Khan wrote:
>> Hello everyone,
>>
>> I am seeing a deadlock between cgroup_threadgroup_rwsem and
>> cpu_hotplug_lock in
>> 5.4 kernel.
>>
>> Due to some missing drivers I don't have this test setup for latest
>> upstream
>> kernel but looking at the code the issue seems to be present in the
>> latest
>> kernel as well. If needed I can provide stack traces and other
>> relevant info
>> from the vmcore that I have got from 5.4 setup.
>>
>> The description of the problem is as follows (I am using 5.19-rc7 as
>> reference
>> below):
>>
>> __cgroup_procs_write acquires cgroup_threadgroup_rwsem via
>> cgroup_procs_write_start and then invokes cgroup_attach_task. Now
>> cgroup_attach_task can invoke following call chain:
>>
>> cgroup_attach_task --> cgroup_migrate --> cgroup_migrate_execute -->
>> cpuset_attach
>>
>> Here cpuset_attach tries to take cpu_hotplug_lock.
>>
>> But by this time if some other context
>>
>> 1. is already in the middle of cpu hotplug and has acquired
>> cpu_hotplug_lock in
>> _cpu_up but
>> 2. has not yet reached CPUHP_ONLINE state and
>> 3. one of the intermediate hotplug states (in my case
>> CPUHP_AP_ONLINE_DYN ) has
>> a callback which involves creation of a thread (or invocation of
>> copy_process
>> via some other path) the invoked copy_process will get blocked on
>> cgroup_threadgroup_rwsem in following call chain:
>>
>> copy_process --> cgroup_can_fork --> cgroup_css_set_fork -->
>> cgroup_threadgroup_change_begin
>
> Similar discussion is at [1], not sure on the conclusion.
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/
>
> -Mukesh
>
>>
>>
>> I am looking for suggestions to fix this deadlock.
>>
>> Or if I am missing something in the above analysis and the above mention
>> scenario can't happen in latest upstream kernel, then please let me
>> know as that
>> would help me in back porting relevant changes to 5.4 kernel because
>> the issue
>> definitely exists in 5.4 kernel.
>>
>> Thanks,
>> -- Imran
On Wed, Jul 20, 2022 at 05:31:51PM +0530, Mukesh Ojha wrote:
> Looks like these patches are the fixes.
>
> https://lore.kernel.org/all/[email protected]/#r
>
> Would let Tejun confirm this .
Yeah, looks like the same issue. I'll write up a patch later this week /
early next unless someone beats me to it.
Thanks.
--
tejun
On Wed, Jul 20, 2022 at 08:05:03AM -1000, Tejun Heo wrote:
> On Wed, Jul 20, 2022 at 05:31:51PM +0530, Mukesh Ojha wrote:
> > Looks like these patches are the fixes.
> >
> > https://lore.kernel.org/all/[email protected]/#r
> >
> > Would let Tejun confirm this .
>
> Yeah, looks like the same issue. I'll write up a patch later this week /
> early next unless someone beats me to it.
https://lore.kernel.org/lkml/[email protected]/ is
the thread with the same issue. Let's follow up there.
Thanks.
--
tejun
Hi Tejun,
On 7/28/2022 1:03 AM, Tejun Heo wrote:
> On Wed, Jul 20, 2022 at 08:05:03AM -1000, Tejun Heo wrote:
>> On Wed, Jul 20, 2022 at 05:31:51PM +0530, Mukesh Ojha wrote:
>>> Looks like these patches are the fixes.
>>>
>>> https://lore.kernel.org/all/[email protected]/#r
>>>
>>> Would let Tejun confirm this .
>>
>> Yeah, looks like the same issue. I'll write up a patch later this week /
>> early next unless someone beats me to it.
>
> https://lore.kernel.org/lkml/[email protected]/ is
> the thread with the same issue. Let's follow up there.
Since, i am not part of the above thread, is the reason i am commenting
here.
The original patch of yours [1] and the revert of [2] is fixing the
issue and it is also confirmed here [3].
Can we get proper fix merge on your tree?
[1] https://lore.kernel.org/lkml/[email protected]/
[2]
https://lore.kernel.org/all/[email protected]/
[3]
https://lore.kernel.org/lkml/CAB8ipk-72V-bYRfL-VcSRSyXTeQqkBVj+1d5MHSVV5CTar9a0Q@mail.gmail.com/
-Mukesh
>
> Thanks.
>
+Cc: Zhao Gongyi <[email protected]>, Zhang Qiao <[email protected]>
On Fri, Aug 12, 2022 at 03:57:00PM +0530, Mukesh Ojha <[email protected]> wrote:
> The original patch of yours [1] and the revert of [2] is fixing the issue
> and it is also confirmed here [3].
> Can we get proper fix merge on your tree?
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> [2]
> https://lore.kernel.org/all/[email protected]/
The revert + Tejun's patch looks fine wrt the problem of the reverted
patch (just moves cpus_read_lock to upper callers).
I'd just suggest a comment that'd explicitly document also the lock
order that we stick to, IIUC, it should be:
cpu_hotplug_lock // cpus_read_lock
cgroup_threadgroup_rwsem
cpuset_rwsem
Michal
>
> [3] https://lore.kernel.org/lkml/CAB8ipk-72V-bYRfL-VcSRSyXTeQqkBVj+1d5MHSVV5CTar9a0Q@mail.gmail.com/
>
> -Mukesh
Hi Michal
On Mon, Aug 15, 2022 at 5:06 PM Michal Koutný <[email protected]> wrote:
>
> +Cc: Zhao Gongyi <[email protected]>, Zhang Qiao <[email protected]>
>
> On Fri, Aug 12, 2022 at 03:57:00PM +0530, Mukesh Ojha <[email protected]> wrote:
> > The original patch of yours [1] and the revert of [2] is fixing the issue
> > and it is also confirmed here [3].
> > Can we get proper fix merge on your tree?
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> >
> > [2]
> > https://lore.kernel.org/all/[email protected]/
>
> The revert + Tejun's patch looks fine wrt the problem of the reverted
> patch (just moves cpus_read_lock to upper callers).
Your means is that the problem should be fixed by [1]+[2]'s revert ?
I just tested the case which reverted the [2]. Need I test with [1] and [2]?
Thanks!
>
> I'd just suggest a comment that'd explicitly document also the lock
> order that we stick to, IIUC, it should be:
>
> cpu_hotplug_lock // cpus_read_lock
> cgroup_threadgroup_rwsem
> cpuset_rwsem
>
> Michal
>
> >
> > [3] https://lore.kernel.org/lkml/CAB8ipk-72V-bYRfL-VcSRSyXTeQqkBVj+1d5MHSVV5CTar9a0Q@mail.gmail.com/
> >
> > -Mukesh