LinuxLists.cc - [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato

2022-06-27 07:05:43

Subject: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low

Hi all,

I'm working on debuging a problem.
The testcase does follew operations:
1) create a test task cgroup, set cpu.cfs_quota_us=2000,cpu.cfs_period_us=100000.
2) run 20 test_fork[1] test process in the test task cgroup.
3) create 100 new containers:
for i in {1..100}; do docker run -itd --health-cmd="ls" --health-interval=1s ubuntu:latest bash; done

These operations are expected to succeed and 100 containers create success. however, when creating containers,
the system will get stuck and create container failed.

After debug this, I found the test_fork process frequently sleep in freezer_fork()->mutex_lock()->might_sleep()
with taking the cgroup_threadgroup_rw_sem lock, as follow:

copy_process():
cgroup_can_fork() ---> lock cgroup_threadgroup_rw_sem
sched_cgroup_fork();
->task_fork_fair(){
->update_curr(){
->__account_cfs_rq_runtime() {
resched_curr(); ---> the quota is used up, and set flag TIF_NEED_RESCHED to current
}
cgroup_post_fork();
->feezer_fork()
->mutex_lock() {
->might_sleep() ---> schedule() and the current task will be throttled long time.

->cgroup_css_set_put_fork() ---> unlock cgroup_threadgroup_rw_sem

Becuase the task cgroup's cpu.cfs_quota_us is very small and test_fork's load is very heavy, the test_fork
may be throttled long time, therefore, the cgroup_threadgroup_rw_sem read lock is held for a long time, other
processes will get stuck waiting for the lock:

1) a task fork child, will wait at copy_process()->cgroup_can_fork();

2) a task exiting will wait at exit_signals();

3) a task write cgroup.procs file will wait at cgroup_file_write()->__cgroup1_procs_write();
...

even the whole system will get stuck.

Anyone know how to slove this? Except for changing the cpu.cfs_quota_us.

[1] test_fork.c

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/wait.h>

int main(int argc, char **argv)
{
pid_t pid;
int count = 20;

while(1) {
for (int i = 0; i < count; i++) {
if ((pid = fork()) <0) {
printf("fork error");
return 1;
} else if (pid ==0) {
exit(0);
}
}

for (int i = 0; i < count; i++) {
wait(NULL);
}
sleep(1);
}
return 0;
}

Thanks a lot.
-Qiao
-

2022-06-27 09:16:27

by Tejun Heo

[permalink] [raw]

Subject: Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low

Hello,

On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote:
> Becuase the task cgroup's cpu.cfs_quota_us is very small and
> test_fork's load is very heavy, the test_fork may be throttled long
> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for
> a long time, other processes will get stuck waiting for the lock:

Yeah, this is a known problem and can happen with other locks too. The
solution prolly is only throttling while in or when about to return to
userspace. There is one really important and wide-spread assumption in
the kernel:

If things get blocked on some shared resource, whatever is holding
the resource ends up using more of the system to exit the critical
section faster and thus unblocks others ASAP. IOW, things running in
kernel are work-conserving.

The cpu bw controller gives the userspace a rather easy way to break
this assumption and thus is rather fundamentally broken. This is
basically the same problem we had with the old cgroup freezer
implementation which trapped threads in random locations in the
kernel.

So, right now, it's rather broken and can easily be used as an dos
attack vector.

Thanks.

--
tejun

2022-07-01 07:37:25

by Zhang Qiao

[permalink] [raw]

Subject: Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low

Hi, tejun

Thanks for your reply.

在 2022/6/27 16:32, Tejun Heo 写道:
> Hello,
>
> On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote:
>> Becuase the task cgroup's cpu.cfs_quota_us is very small and
>> test_fork's load is very heavy, the test_fork may be throttled long
>> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for
>> a long time, other processes will get stuck waiting for the lock:
>
> Yeah, this is a known problem and can happen with other locks too. The
> solution prolly is only throttling while in or when about to return to
> userspace. There is one really important and wide-spread assumption in
> the kernel:
>
> If things get blocked on some shared resource, whatever is holding
> the resource ends up using more of the system to exit the critical
> section faster and thus unblocks others ASAP. IOW, things running in
> kernel are work-conserving.
>
> The cpu bw controller gives the userspace a rather easy way to break
> this assumption and thus is rather fundamentally broken. This is
> basically the same problem we had with the old cgroup freezer
> implementation which trapped threads in random locations in the
> kernel.
>

so, if we want to completely slove this problem, is the best way to
change the cfs bw controller throttle mechanism? for example, throttle
tasks in a safe location.

Thanks.
Qiao

> So, right now, it's rather broken and can easily be used as an dos
> attack vector.
>
> Thanks.
>

2022-07-01 20:25:45

by Benjamin Segall

[permalink] [raw]

Subject: Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low

Zhang Qiao <[email protected]> writes:

> Hi, tejun
>
> Thanks for your reply.
>
> 在 2022/6/27 16:32, Tejun Heo 写道:
>> Hello,
>>
>> On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote:
>>> Becuase the task cgroup's cpu.cfs_quota_us is very small and
>>> test_fork's load is very heavy, the test_fork may be throttled long
>>> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for
>>> a long time, other processes will get stuck waiting for the lock:
>>
>> Yeah, this is a known problem and can happen with other locks too. The
>> solution prolly is only throttling while in or when about to return to
>> userspace. There is one really important and wide-spread assumption in
>> the kernel:
>>
>> If things get blocked on some shared resource, whatever is holding
>> the resource ends up using more of the system to exit the critical
>> section faster and thus unblocks others ASAP. IOW, things running in
>> kernel are work-conserving.
>>
>> The cpu bw controller gives the userspace a rather easy way to break
>> this assumption and thus is rather fundamentally broken. This is
>> basically the same problem we had with the old cgroup freezer
>> implementation which trapped threads in random locations in the
>> kernel.
>>
>
> so, if we want to completely slove this problem, is the best way to
> change the cfs bw controller throttle mechanism? for example, throttle
> tasks in a safe location.

Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a
serious reworking of how it works, because it would need to dequeue
tasks individually rather than doing the entire cfs_rq at a time (and
would require some effort to avoid pinging every throttling task to get
it into the kernel).

2022-07-01 20:48:27

by Tejun Heo

[permalink] [raw]

Subject: Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low

On Fri, Jul 01, 2022 at 01:08:21PM -0700, Benjamin Segall wrote:
> Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a
> serious reworking of how it works, because it would need to dequeue
> tasks individually rather than doing the entire cfs_rq at a time (and
> would require some effort to avoid pinging every throttling task to get
> it into the kernel).

Right, I don't have a good idea on evolving the current implementation
into something correct. As you pointed out, we need to account along
the sched_group tree but conditionally enforce on each thread.

Thanks.

--
tejun

2022-07-07 07:02:30

by Zhang Qiao

[permalink] [raw]

Subject: Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low

在 2022/7/2 4:15, Tejun Heo 写道:
> On Fri, Jul 01, 2022 at 01:08:21PM -0700, Benjamin Segall wrote:
>> Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a
>> serious reworking of how it works, because it would need to dequeue
>> tasks individually rather than doing the entire cfs_rq at a time (and
>> would require some effort to avoid pinging every throttling task to get
>> it into the kernel).
>
> Right, I don't have a good idea on evolving the current implementation
> into something correct. As you pointed out, we need to account along
> the sched_group tree but conditionally enforce on each thread.
>
> Thanks.
>

Understood. Thanks for your detailed explanation.

Thanks.
--
Qiao
.