LinuxLists.cc - Re: [RFC 15/16] sched/fair: Account kthread runtime debt for CFS bandwidth

2022-01-13 01:09:53

Subject: Re: [RFC 15/16] sched/fair: Account kthread runtime debt for CFS bandwidth

Hello,

On Tue, Jan 11, 2022 at 11:29:50AM -0500, Daniel Jordan wrote:
...
> This problem arises with multithreaded jobs, but is also an issue in other
> places. CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be
> accounted to the cgroup that the memory belongs to, and similarly CPU activity
> from net rx should be accounted to the task groups that correspond to the
> packets being received. There are also vague complaints from Android[6].

These are pretty big holes in CPU cycle accounting right now and I think
spend-first-and-backcharge is the right solution for most of them given
experiences from other controllers. That said,

> Each use case has its own requirements[7]. In padata and reclaim, the task
> group to account to is known ahead of time, but net rx has to spend cycles
> processing a packet before its destination task group is known, so any solution
> should be able to work without knowing the task group in advance. Furthermore,
> the CPU controller shouldn't throttle reclaim or net rx in real time since both
> are doing high priority work. These make approaches that run kthreads directly
> in a task group, like cgroup-aware workqueues[8] or a kernel path for
> CLONE_INTO_CGROUP, infeasible. Running kthreads directly in cgroups also has a
> downside for padata because helpers' MAX_NICE priority is "shadowed" by the
> priority of the group entities they're running under.
>
> The proposed solution of remote charging can accrue debt to a task group to be
> paid off or forgiven later, addressing all these issues. A kthread calls the
> interface
>
> void cpu_cgroup_remote_begin(struct task_struct *p,
> struct cgroup_subsys_state *css);
>
> to begin remote charging to @css, causing @p's current sum_exec_runtime to be
> updated and saved. The @css arg isn't required and can be removed later to
> facilitate the unknown cgroup case mentioned above. Then the kthread calls
> another interface
>
> void cpu_cgroup_remote_charge(struct task_struct *p,
> struct cgroup_subsys_state *css);
>
> to account the sum_exec_runtime that @p has used since the first call.
> Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid
> debt that's only used when the debt exceeds the quota in the current period.
>
> Weight-based control isn't implemented for now since padata helpers run at
> MAX_NICE and so always yield to anything higher priority, meaning they would
> rarely compete with other task groups.

If we're gonna do this, let's please do it right and make weight based
control work too. Otherwise, its usefulness is pretty limited.

Thanks.

--
tejun

2022-01-13 21:10:02

by Daniel Jordan

[permalink] [raw]

Subject: Re: [RFC 15/16] sched/fair: Account kthread runtime debt for CFS bandwidth

On Wed, Jan 12, 2022 at 10:18:16AM -1000, Tejun Heo wrote:
> Hello,

Hi, Tejun.

> On Tue, Jan 11, 2022 at 11:29:50AM -0500, Daniel Jordan wrote:
> ...
> > This problem arises with multithreaded jobs, but is also an issue in other
> > places. CPU activity from async memory reclaim (kswapd, cswapd?[5]) should be
> > accounted to the cgroup that the memory belongs to, and similarly CPU activity
> > from net rx should be accounted to the task groups that correspond to the
> > packets being received. There are also vague complaints from Android[6].
>
> These are pretty big holes in CPU cycle accounting right now and I think
> spend-first-and-backcharge is the right solution for most of them given
> experiences from other controllers. That said,
>
> > Each use case has its own requirements[7]. In padata and reclaim, the task
> > group to account to is known ahead of time, but net rx has to spend cycles
> > processing a packet before its destination task group is known, so any solution
> > should be able to work without knowing the task group in advance. Furthermore,
> > the CPU controller shouldn't throttle reclaim or net rx in real time since both
> > are doing high priority work. These make approaches that run kthreads directly
> > in a task group, like cgroup-aware workqueues[8] or a kernel path for
> > CLONE_INTO_CGROUP, infeasible. Running kthreads directly in cgroups also has a
> > downside for padata because helpers' MAX_NICE priority is "shadowed" by the
> > priority of the group entities they're running under.
> >
> > The proposed solution of remote charging can accrue debt to a task group to be
> > paid off or forgiven later, addressing all these issues. A kthread calls the
> > interface
> >
> > void cpu_cgroup_remote_begin(struct task_struct *p,
> > struct cgroup_subsys_state *css);
> >
> > to begin remote charging to @css, causing @p's current sum_exec_runtime to be
> > updated and saved. The @css arg isn't required and can be removed later to
> > facilitate the unknown cgroup case mentioned above. Then the kthread calls
> > another interface
> >
> > void cpu_cgroup_remote_charge(struct task_struct *p,
> > struct cgroup_subsys_state *css);
> >
> > to account the sum_exec_runtime that @p has used since the first call.
> > Internally, a new field cfs_bandwidth::debt is added to keep track of unpaid
> > debt that's only used when the debt exceeds the quota in the current period.
> >
> > Weight-based control isn't implemented for now since padata helpers run at
> > MAX_NICE and so always yield to anything higher priority, meaning they would
> > rarely compete with other task groups.
>
> If we're gonna do this, let's please do it right and make weight based
> control work too. Otherwise, its usefulness is pretty limited.

Ok, understood.

Doing it as presented is an incremental step and all that's required for
this. I figured weight could be added later with the first user that
actually needs it.

I did prototype weight too, though, just to see if it was all gonna work
together, so given how the discussion elsewhere in the thread is going,
I might respin the scheduler part of this with another use case and
weight-based control included.

I got this far, do the interface and CFS skeleton seem sane? Both are
basically unchanged with weight-based control included, the weight parts
are just more code on top.

Thanks for looking.

2022-01-13 21:12:06

by Daniel Jordan

[permalink] [raw]

Subject: Re: [RFC 15/16] sched/fair: Account kthread runtime debt for CFS bandwidth

On Thu, Jan 13, 2022 at 04:08:57PM -0500, Daniel Jordan wrote:
> On Wed, Jan 12, 2022 at 10:18:16AM -1000, Tejun Heo wrote:
> > If we're gonna do this, let's please do it right and make weight based
> > control work too. Otherwise, its usefulness is pretty limited.
>
> Ok, understood.
>
> Doing it as presented is an incremental step and all that's required for
> this. I figured weight could be added later with the first user that
> actually needs it.
>
> I did prototype weight too, though, just to see if it was all gonna work
> together, so given how the discussion elsewhere in the thread is going,
> I might respin the scheduler part of this with another use case and
> weight-based control included.
>
> I got this far, do the interface and CFS skeleton seem sane? Both are

s/CFS/CFS bandwidth/

> basically unchanged with weight-based control included, the weight parts
> are just more code on top.
>
> Thanks for looking.