LinuxLists.cc - [PATCH v2] sched: async unthrottling for cfs bandwidth

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hey Peter,

On Mon, Oct 31, 2022 at 6:04 AM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote:
> > CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's
> > inline in an hrtimer callback. Runtime distribution is a per-cpu
> > operation, and unthrottling is a per-cgroup operation, since a tg walk
> > is required. On machines with a large number of cpus and large cgroup
> > hierarchies, this cpus*cgroups work can be too much to do in a single
> > hrtimer callback: since IRQ are disabled, hard lockups may easily occur.
> > Specifically, we've found this scalability issue on configurations with
> > 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high
> > memory bandwidth usage.
> >
> > To fix this, we can instead unthrottle cfs_rq's asynchronously via a
> > CSD. Each cpu is responsible for unthrottling itself, thus sharding the
> > total work more fairly across the system, and avoiding hard lockups.
>
> So, TJ has been complaining about us throttling in kernel-space, causing
> grief when we also happen to hold a mutex or some other resource and has
> been prodding us to only throttle at the return-to-user boundary.

Yea, we've been having similar priority inversion issues. It isn't
limited to CFS bandwidth though, such problems are also pretty easy to
hit with configurations of shares, cpumasks, and SCHED_IDLE. I've
chatted with the folks working on the proxy execution patch series,
and it seems like that could be a better generic solution to these
types of issues.

Throttle at return-to-user seems only mildly beneficial, and then only
really with preemptive kernels. Still pretty easy to get inversion
issues, e.g. a thread holding a kernel mutex wake back up into a
hierarchy that is currently throttled, or a thread holding a kernel
mutex exists in the hierarchy being throttled but is currently waiting
to run.

> Would this be an opportune moment to do this? That is, what if we
> replace this CSD with a task_work that's ran on the return-to-user path
> instead?

The above comment is about when we throttle, whereas this patch is
about the unthrottle case. I think you're asking why don't we
unthrottle using e.g. task_work assigned to whatever the current task
is? That would work around the issue of keeping IRQ disabled for long
periods, but still forces one cpu to process everything, which can
take quite a while.

Thanks,
Josh

2022-10-31 21:59:52

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hello,

On Mon, Oct 31, 2022 at 02:22:42PM -0700, Josh Don wrote:
> > So, TJ has been complaining about us throttling in kernel-space, causing
> > grief when we also happen to hold a mutex or some other resource and has
> > been prodding us to only throttle at the return-to-user boundary.
>
> Yea, we've been having similar priority inversion issues. It isn't
> limited to CFS bandwidth though, such problems are also pretty easy to
> hit with configurations of shares, cpumasks, and SCHED_IDLE. I've

We need to distinguish between work-conserving and non-work-conserving
control schemes. Work-conserving ones - such as shares and idle - shouldn't
affect the aggregate amount of work the system can perform. There may be
local and temporary priority inversions but they shouldn't affect the
throughput of the system and the scheduler should be able to make the
eventual resource distribution conform to the configured targtes.

CPU affinity and bw control are not work conserving and thus cause a
different class of problems. While it is possible to slow down a system with
overly restrictive CPU affinities, it's a lot harder to do so severely
compared to BW control because no matter what you do, there's still at least
one CPU which can make full forward progress. BW control, it's really easy
to stall the entire system almost completely because we're giving userspace
the ability to stall tasks for an arbitrary amount of time at random places
in the kernel. This is what cgroup1 freezer did which had exactly the same
problems.

> chatted with the folks working on the proxy execution patch series,
> and it seems like that could be a better generic solution to these
> types of issues.

Care to elaborate?

> Throttle at return-to-user seems only mildly beneficial, and then only
> really with preemptive kernels. Still pretty easy to get inversion
> issues, e.g. a thread holding a kernel mutex wake back up into a
> hierarchy that is currently throttled, or a thread holding a kernel
> mutex exists in the hierarchy being throttled but is currently waiting
> to run.

I don't follow. If you only throttle at predefined safe spots, the easiest
place being the kernel-user boundary, you cannot get system-wide stalls from
BW restrictions, which is something the kernel shouldn't allow userspace to
cause. In your example, a thread holding a kernel mutex waking back up into
a hierarchy that is currently throttled should keep running in the kernel
until it encounters such safe throttling point where it would have released
the kernel mutex and then throttle.

Thanks.

--
tejun

2022-10-31 22:02:16

by Benjamin Segall

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Peter Zijlstra <[email protected]> writes:

> On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote:
>> CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's
>> inline in an hrtimer callback. Runtime distribution is a per-cpu
>> operation, and unthrottling is a per-cgroup operation, since a tg walk
>> is required. On machines with a large number of cpus and large cgroup
>> hierarchies, this cpus*cgroups work can be too much to do in a single
>> hrtimer callback: since IRQ are disabled, hard lockups may easily occur.
>> Specifically, we've found this scalability issue on configurations with
>> 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high
>> memory bandwidth usage.
>>
>> To fix this, we can instead unthrottle cfs_rq's asynchronously via a
>> CSD. Each cpu is responsible for unthrottling itself, thus sharding the
>> total work more fairly across the system, and avoiding hard lockups.
>
> So, TJ has been complaining about us throttling in kernel-space, causing
> grief when we also happen to hold a mutex or some other resource and has
> been prodding us to only throttle at the return-to-user boundary.
>
> Would this be an opportune moment to do this? That is, what if we
> replace this CSD with a task_work that's ran on the return-to-user path
> instead?

This is unthrottle, not throttle, but it would probably be
straightfoward enough to do what you said for throttle. I'd expect this
to not help all that much though, because throttle hits the entire
cfs_rq, not individual threads.

I'm currently trying something more invasive, which doesn't throttle a
cfs_rq while it has any kernel tasks, and prioritizes kernel tasks / ses
containing kernel tasks when a cfs_rq "should" be throttled. "Invasive"
is a key word though, as it needs to do the sort of h_nr_kernel_tasks
tracking on put_prev/set_next in ways we currently only need to do on
enqueue/dequeue.

2022-10-31 23:39:06

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Mon, Oct 31, 2022 at 2:50 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Mon, Oct 31, 2022 at 02:22:42PM -0700, Josh Don wrote:
> > > So, TJ has been complaining about us throttling in kernel-space, causing
> > > grief when we also happen to hold a mutex or some other resource and has
> > > been prodding us to only throttle at the return-to-user boundary.
> >
> > Yea, we've been having similar priority inversion issues. It isn't
> > limited to CFS bandwidth though, such problems are also pretty easy to
> > hit with configurations of shares, cpumasks, and SCHED_IDLE. I've
>
> We need to distinguish between work-conserving and non-work-conserving
> control schemes. Work-conserving ones - such as shares and idle - shouldn't
> affect the aggregate amount of work the system can perform. There may be
> local and temporary priority inversions but they shouldn't affect the
> throughput of the system and the scheduler should be able to make the
> eventual resource distribution conform to the configured targtes.
>
> CPU affinity and bw control are not work conserving and thus cause a
> different class of problems. While it is possible to slow down a system with
> overly restrictive CPU affinities, it's a lot harder to do so severely
> compared to BW control because no matter what you do, there's still at least
> one CPU which can make full forward progress. BW control, it's really easy
> to stall the entire system almost completely because we're giving userspace
> the ability to stall tasks for an arbitrary amount of time at random places
> in the kernel. This is what cgroup1 freezer did which had exactly the same
> problems.

Yes, but schemes such as shares and idle can still end up creating
some severe inversions. For example, a SCHED_IDLE thread on a cpu with
many other threads. Eventually the SCHED_IDLE thread will get run, but
the round robin times can easily get pushes out to several hundred ms
(or even into the seconds range), due to min granularity. cpusets
combined with the load balancer's struggle to find low weight tasks
exacerbates such situations.

> > chatted with the folks working on the proxy execution patch series,
> > and it seems like that could be a better generic solution to these
> > types of issues.
>
> Care to elaborate?

https://lwn.net/Articles/793502/ gives some historical context, see
also https://lwn.net/Articles/910302/.

> > Throttle at return-to-user seems only mildly beneficial, and then only
> > really with preemptive kernels. Still pretty easy to get inversion
> > issues, e.g. a thread holding a kernel mutex wake back up into a
> > hierarchy that is currently throttled, or a thread holding a kernel
> > mutex exists in the hierarchy being throttled but is currently waiting
> > to run.
>
> I don't follow. If you only throttle at predefined safe spots, the easiest
> place being the kernel-user boundary, you cannot get system-wide stalls from
> BW restrictions, which is something the kernel shouldn't allow userspace to
> cause. In your example, a thread holding a kernel mutex waking back up into
> a hierarchy that is currently throttled should keep running in the kernel
> until it encounters such safe throttling point where it would have released
> the kernel mutex and then throttle.

Agree except that for the task waking back up, it isn't on cpu, so
there is no "wait to throttle it until it returns to user", since
throttling happens in the context of the entire cfs_rq. We'd have to
treat threads in a bandwidth hierarchy that are also in kernel mode
specially. Mechanically, it is more straightforward to implement the
mechanism to wait to throttle until the cfs_rq has no more threads in
kernel mode, than it is to exclude a woken task from the currently
throttled period of its cfs_rq, though this is incomplete.

What you're suggesting would also require that we find a way to
preempt the current thread to start running the thread that woke up in
kernel (and this becomes more complex when the current thread is also
in kernel, or if there are n other waiting threads that are also in
kernel).

>
> Thanks.
>
> --
> tejun

Best,
Josh

2022-11-01 00:19:30

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hello,

On Mon, Oct 31, 2022 at 04:15:54PM -0700, Josh Don wrote:
> On Mon, Oct 31, 2022 at 2:50 PM Tejun Heo <[email protected]> wrote:
> Yes, but schemes such as shares and idle can still end up creating
> some severe inversions. For example, a SCHED_IDLE thread on a cpu with
> many other threads. Eventually the SCHED_IDLE thread will get run, but
> the round robin times can easily get pushes out to several hundred ms
> (or even into the seconds range), due to min granularity. cpusets
> combined with the load balancer's struggle to find low weight tasks
> exacerbates such situations.

Yeah, especially with narrow cpuset (or task cpu affinity) configurations,
it can get pretty bad. Outside that tho, at least I haven't seen a lot of
problematic cases as long as the low priority one isn't tightly entangled
with high priority tasks, mostly because 1. if the resource the low pri one
is holding affects large part of the system, the problem is self-solving as
the system quickly runs out of other things to do 2. if the resource isn't
affecting large part of the system, their blast radius is usually reasonably
confined to things tightly coupled with it. I'm sure there are exceptions
and we definitely wanna improve the situation where it makes sense.

> > > chatted with the folks working on the proxy execution patch series,
> > > and it seems like that could be a better generic solution to these
> > > types of issues.
> >
> > Care to elaborate?
>
> https://lwn.net/Articles/793502/ gives some historical context, see
> also https://lwn.net/Articles/910302/.

Ah, full blown priority inheritance. They're great to pursue but I think we
wanna fix cpu bw control regardless. It's such an obvious and basic issue
and given how much problem we have with actually understanding resource and
control dependencies with all the custom synchronization contstructs in the
kernel, fixing it will be useful even in the future where we have a better
priority inheritance mechanism.

> > I don't follow. If you only throttle at predefined safe spots, the easiest
> > place being the kernel-user boundary, you cannot get system-wide stalls from
> > BW restrictions, which is something the kernel shouldn't allow userspace to
> > cause. In your example, a thread holding a kernel mutex waking back up into
> > a hierarchy that is currently throttled should keep running in the kernel
> > until it encounters such safe throttling point where it would have released
> > the kernel mutex and then throttle.
>
> Agree except that for the task waking back up, it isn't on cpu, so
> there is no "wait to throttle it until it returns to user", since
> throttling happens in the context of the entire cfs_rq. We'd have to

Oh yeah, we'd have to be able to allow threads running in kernel regardless
of cfq_rq throttled state and then force charge the cpu cycles to be paid
later. It would definitely require quite a bit of work.

> treat threads in a bandwidth hierarchy that are also in kernel mode
> specially. Mechanically, it is more straightforward to implement the
> mechanism to wait to throttle until the cfs_rq has no more threads in
> kernel mode, than it is to exclude a woken task from the currently
> throttled period of its cfs_rq, though this is incomplete.

My hunch is that bunching them together is likely gonna create too many
escape scenarios and control artifacts and it'd be better to always push
throttling decisions to the leaves (tasks) so that each task can be
controlled separately. That'd involve architectural changes but the eventual
behavior would be a lot better.

> What you're suggesting would also require that we find a way to
> preempt the current thread to start running the thread that woke up in
> kernel (and this becomes more complex when the current thread is also
> in kernel, or if there are n other waiting threads that are also in
> kernel).

I don't think it needs that. What allows userspace to easily trigger
pathological scenarios is the ability to force the machine idle when there's
something which is ready to run in the kernel. If you take that away, most
of the problems disappear. It's not perfect but reasonable enough and not
worse than a system without cpu bw control.

Thanks.

--
tejun

2022-11-01 01:33:54

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Mon, Oct 31, 2022 at 4:53 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Mon, Oct 31, 2022 at 04:15:54PM -0700, Josh Don wrote:
> > On Mon, Oct 31, 2022 at 2:50 PM Tejun Heo <[email protected]> wrote:
> > Yes, but schemes such as shares and idle can still end up creating
> > some severe inversions. For example, a SCHED_IDLE thread on a cpu with
> > many other threads. Eventually the SCHED_IDLE thread will get run, but
> > the round robin times can easily get pushes out to several hundred ms
> > (or even into the seconds range), due to min granularity. cpusets
> > combined with the load balancer's struggle to find low weight tasks
> > exacerbates such situations.
>
> Yeah, especially with narrow cpuset (or task cpu affinity) configurations,
> it can get pretty bad. Outside that tho, at least I haven't seen a lot of
> problematic cases as long as the low priority one isn't tightly entangled
> with high priority tasks, mostly because 1. if the resource the low pri one
> is holding affects large part of the system, the problem is self-solving as
> the system quickly runs out of other things to do 2. if the resource isn't
> affecting large part of the system, their blast radius is usually reasonably
> confined to things tightly coupled with it. I'm sure there are exceptions
> and we definitely wanna improve the situation where it makes sense.

cgroup_mutex and kernfs rwsem beg to differ :) These are shared with
control plane threads, so it is pretty easy to starve those out even
while the system has plenty of work to do.

> > > > chatted with the folks working on the proxy execution patch series,
> > > > and it seems like that could be a better generic solution to these
> > > > types of issues.
> > >
> > > Care to elaborate?
> >
> > https://lwn.net/Articles/793502/ gives some historical context, see
> > also https://lwn.net/Articles/910302/.
>
> Ah, full blown priority inheritance. They're great to pursue but I think we
> wanna fix cpu bw control regardless. It's such an obvious and basic issue
> and given how much problem we have with actually understanding resource and
> control dependencies with all the custom synchronization contstructs in the
> kernel, fixing it will be useful even in the future where we have a better
> priority inheritance mechanism.

Sure, even something like not throttling when there exist threads in
kernel mode (while not a complete solution), helps get some of the way
towards improving that case.

> > > I don't follow. If you only throttle at predefined safe spots, the easiest
> > > place being the kernel-user boundary, you cannot get system-wide stalls from
> > > BW restrictions, which is something the kernel shouldn't allow userspace to
> > > cause. In your example, a thread holding a kernel mutex waking back up into
> > > a hierarchy that is currently throttled should keep running in the kernel
> > > until it encounters such safe throttling point where it would have released
> > > the kernel mutex and then throttle.
> >
> > Agree except that for the task waking back up, it isn't on cpu, so
> > there is no "wait to throttle it until it returns to user", since
> > throttling happens in the context of the entire cfs_rq. We'd have to
>
> Oh yeah, we'd have to be able to allow threads running in kernel regardless
> of cfq_rq throttled state and then force charge the cpu cycles to be paid
> later. It would definitely require quite a bit of work.
>
> > treat threads in a bandwidth hierarchy that are also in kernel mode
> > specially. Mechanically, it is more straightforward to implement the
> > mechanism to wait to throttle until the cfs_rq has no more threads in
> > kernel mode, than it is to exclude a woken task from the currently
> > throttled period of its cfs_rq, though this is incomplete.
>
> My hunch is that bunching them together is likely gonna create too many
> escape scenarios and control artifacts and it'd be better to always push
> throttling decisions to the leaves (tasks) so that each task can be
> controlled separately. That'd involve architectural changes but the eventual
> behavior would be a lot better.

Also a tradeoff, since it is extra overhead to handle individually at
the leaf level vs dequeuing a single cfs_rq.

> > What you're suggesting would also require that we find a way to
> > preempt the current thread to start running the thread that woke up in
> > kernel (and this becomes more complex when the current thread is also
> > in kernel, or if there are n other waiting threads that are also in
> > kernel).
>
> I don't think it needs that. What allows userspace to easily trigger
> pathological scenarios is the ability to force the machine idle when there's
> something which is ready to run in the kernel. If you take that away, most
> of the problems disappear. It's not perfect but reasonable enough and not
> worse than a system without cpu bw control.
>
> Thanks.
>
> --
> tejun

2022-11-01 02:19:22

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Mon, Oct 31, 2022 at 06:01:19PM -0700, Josh Don wrote:
> > Yeah, especially with narrow cpuset (or task cpu affinity) configurations,
> > it can get pretty bad. Outside that tho, at least I haven't seen a lot of
> > problematic cases as long as the low priority one isn't tightly entangled
> > with high priority tasks, mostly because 1. if the resource the low pri one
> > is holding affects large part of the system, the problem is self-solving as
> > the system quickly runs out of other things to do 2. if the resource isn't
> > affecting large part of the system, their blast radius is usually reasonably
> > confined to things tightly coupled with it. I'm sure there are exceptions
> > and we definitely wanna improve the situation where it makes sense.
>
> cgroup_mutex and kernfs rwsem beg to differ :) These are shared with
> control plane threads, so it is pretty easy to starve those out even
> while the system has plenty of work to do.

Hahaha yeah, good point. We definitely wanna improve them. There were some
efforts to improve kernfs locking granularity earlier this year. It was
promising but didn't get to the finish line. cgroup_mutex, w/ cgroup2 and
especially with the optimizations around CLONE_INTO_CGROUP, we avoid that in
most hot paths and hopefully that should help quite a bit. If it continues
to be a problem, we definitely wanna further improve it.

Just to better understand the situation, can you give some more details on
the scenarios where cgroup_mutex was in the middle of a shitshow?

Thanks.

--
tejun

2022-11-01 19:16:12

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Mon, Oct 31, 2022 at 6:46 PM Tejun Heo <[email protected]> wrote:
>
> On Mon, Oct 31, 2022 at 06:01:19PM -0700, Josh Don wrote:
> > > Yeah, especially with narrow cpuset (or task cpu affinity) configurations,
> > > it can get pretty bad. Outside that tho, at least I haven't seen a lot of
> > > problematic cases as long as the low priority one isn't tightly entangled
> > > with high priority tasks, mostly because 1. if the resource the low pri one
> > > is holding affects large part of the system, the problem is self-solving as
> > > the system quickly runs out of other things to do 2. if the resource isn't
> > > affecting large part of the system, their blast radius is usually reasonably
> > > confined to things tightly coupled with it. I'm sure there are exceptions
> > > and we definitely wanna improve the situation where it makes sense.
> >
> > cgroup_mutex and kernfs rwsem beg to differ :) These are shared with
> > control plane threads, so it is pretty easy to starve those out even
> > while the system has plenty of work to do.
>
> Hahaha yeah, good point. We definitely wanna improve them. There were some
> efforts to improve kernfs locking granularity earlier this year. It was
> promising but didn't get to the finish line. cgroup_mutex, w/ cgroup2 and
> especially with the optimizations around CLONE_INTO_CGROUP, we avoid that in
> most hot paths and hopefully that should help quite a bit. If it continues
> to be a problem, we definitely wanna further improve it.
>
> Just to better understand the situation, can you give some more details on
> the scenarios where cgroup_mutex was in the middle of a shitshow?

There have been a couple, I think one of the main ones has been writes
to cgroup.procs. cpuset modifications also show up since there's a
mutex there.

>
> Thanks.
>
> --
> tejun

2022-11-01 19:34:06

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hello,

On Tue, Nov 01, 2022 at 12:11:30PM -0700, Josh Don wrote:
> > Just to better understand the situation, can you give some more details on
> > the scenarios where cgroup_mutex was in the middle of a shitshow?
>
> There have been a couple, I think one of the main ones has been writes
> to cgroup.procs. cpuset modifications also show up since there's a
> mutex there.

If you can, I'd really like to learn more about the details. We've had some
issues with the threadgroup_rwsem because it's such a big hammer but not
necessarily with cgroup_mutex because they are only used in maintenance
operations and never from any hot paths.

Regarding threadgroup_rwsem, w/ CLONE_INTO_CGROUP (userspace support is
still missing unfortunately), the usual worfklow of creating a cgroup,
seeding it with a process and then later shutting it down doesn't involve
threadgroup_rwsem at all, so most of the problems should go away in the
hopefully near future.

Thanks.

--
tejun

2022-11-01 21:04:40

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Tue, Nov 1, 2022 at 12:15 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Tue, Nov 01, 2022 at 12:11:30PM -0700, Josh Don wrote:
> > > Just to better understand the situation, can you give some more details on
> > > the scenarios where cgroup_mutex was in the middle of a shitshow?
> >
> > There have been a couple, I think one of the main ones has been writes
> > to cgroup.procs. cpuset modifications also show up since there's a
> > mutex there.
>
> If you can, I'd really like to learn more about the details. We've had some
> issues with the threadgroup_rwsem because it's such a big hammer but not
> necessarily with cgroup_mutex because they are only used in maintenance
> operations and never from any hot paths.
>
> Regarding threadgroup_rwsem, w/ CLONE_INTO_CGROUP (userspace support is
> still missing unfortunately), the usual worfklow of creating a cgroup,
> seeding it with a process and then later shutting it down doesn't involve
> threadgroup_rwsem at all, so most of the problems should go away in the
> hopefully near future.

Maybe walking through an example would be helpful? I don't know if
there's anything super specific. For cgroup_mutex for example, the
same global mutex is being taken for things like cgroup mkdir and
cgroup proc attach, regardless of which part of the hierarchy is being
modified. So, we end up sharing that mutex between random job threads
(ie. that may be manipulating their own cgroup sub-hierarchy), and
control plane threads, which are attempting to manage root-level
cgroups. Bad things happen when the cgroup_mutex (or similar) is held
by a random thread which blocks and is of low scheduling priority,
since when it wakes back up it may take quite a while for it to run
again (whether that low priority be due to CFS bandwidth, sched_idle,
or even just O(hundreds) of threads on a cpu). Starving out the
control plane causes us significant issues, since that affects machine
health. cgroup manipulation is not a hot path operation, but the
control plane tends to hit it fairly often, and so those things
combine at our scale to produce this rare problem.

>
> Thanks.
>
> --
> tejun

2022-11-01 22:14:22

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hello,

On Tue, Nov 01, 2022 at 01:56:29PM -0700, Josh Don wrote:
> Maybe walking through an example would be helpful? I don't know if
> there's anything super specific. For cgroup_mutex for example, the
> same global mutex is being taken for things like cgroup mkdir and
> cgroup proc attach, regardless of which part of the hierarchy is being
> modified. So, we end up sharing that mutex between random job threads
> (ie. that may be manipulating their own cgroup sub-hierarchy), and
> control plane threads, which are attempting to manage root-level
> cgroups. Bad things happen when the cgroup_mutex (or similar) is held
> by a random thread which blocks and is of low scheduling priority,
> since when it wakes back up it may take quite a while for it to run
> again (whether that low priority be due to CFS bandwidth, sched_idle,
> or even just O(hundreds) of threads on a cpu). Starving out the
> control plane causes us significant issues, since that affects machine
> health. cgroup manipulation is not a hot path operation, but the
> control plane tends to hit it fairly often, and so those things
> combine at our scale to produce this rare problem.

I keep asking because I'm curious about the specific details of the
contentions. Control plane locking up is obviously bad but they can usually
tolerate some latencies - stalling out multiple seconds (or longer) can be
catastrophic but tens or hundreds or millisecs occasionally usually isn't.

The only times we've seen latency spikes from CPU side which is enough to
cause system-level failures were when there were severe restrictions through
bw control. Other cases sure are possible but unless you grab these mutexes
while IDLE inside a heavily contended cgroup (which is a bit silly) you
gotta push *really* hard.

If most of the problems were with cpu bw control, fixing that should do for
the time being. Otherwise, we'll have to think about finishing kernfs
locking granularity improvements and doing something similar to cgroup
locking too.

Thanks.

--
tejun

2022-11-01 22:15:39

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Tue, Nov 1, 2022 at 2:50 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Tue, Nov 01, 2022 at 01:56:29PM -0700, Josh Don wrote:
> > Maybe walking through an example would be helpful? I don't know if
> > there's anything super specific. For cgroup_mutex for example, the
> > same global mutex is being taken for things like cgroup mkdir and
> > cgroup proc attach, regardless of which part of the hierarchy is being
> > modified. So, we end up sharing that mutex between random job threads
> > (ie. that may be manipulating their own cgroup sub-hierarchy), and
> > control plane threads, which are attempting to manage root-level
> > cgroups. Bad things happen when the cgroup_mutex (or similar) is held
> > by a random thread which blocks and is of low scheduling priority,
> > since when it wakes back up it may take quite a while for it to run
> > again (whether that low priority be due to CFS bandwidth, sched_idle,
> > or even just O(hundreds) of threads on a cpu). Starving out the
> > control plane causes us significant issues, since that affects machine
> > health. cgroup manipulation is not a hot path operation, but the
> > control plane tends to hit it fairly often, and so those things
> > combine at our scale to produce this rare problem.
>
> I keep asking because I'm curious about the specific details of the
> contentions. Control plane locking up is obviously bad but they can usually
> tolerate some latencies - stalling out multiple seconds (or longer) can be
> catastrophic but tens or hundreds or millisecs occasionally usually isn't.
>
> The only times we've seen latency spikes from CPU side which is enough to
> cause system-level failures were when there were severe restrictions through
> bw control. Other cases sure are possible but unless you grab these mutexes
> while IDLE inside a heavily contended cgroup (which is a bit silly) you
> gotta push *really* hard.
>
> If most of the problems were with cpu bw control, fixing that should do for
> the time being. Otherwise, we'll have to think about finishing kernfs
> locking granularity improvements and doing something similar to cgroup
> locking too.

Oh we've easily hit stalls measured in multiple seconds. We
extensively use cpu.idle to group batch tasks. One of the memory
bandwidth mitigations implemented in userspace is cpu jailing, which
can end up pushing lots and lots of these batch threads onto a small
number of cpus. 5ms min gran * 200 threads is already one second :)
We're in the process of transitioning to using bw instead for this
instead in order to maintain parallelism. Fixing bw is definitely
going to be useful, but I'm afraid we'll still likely have some issues
from low throughput for non-bw reasons (some of which we can't
directly control, since arbitrary jobs can spin up and configure their
hierarchy/threads in antagonistic ways, in effect pushing out the
latency of some of their threads).

>
> Thanks.
>
> --
> tejun

2022-11-01 23:02:00

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hello,

(cc'ing Michal, Christian and Li for context)

On Tue, Nov 01, 2022 at 02:59:56PM -0700, Josh Don wrote:
> > If most of the problems were with cpu bw control, fixing that should do for
> > the time being. Otherwise, we'll have to think about finishing kernfs
> > locking granularity improvements and doing something similar to cgroup
> > locking too.
>
> Oh we've easily hit stalls measured in multiple seconds. We
> extensively use cpu.idle to group batch tasks. One of the memory
> bandwidth mitigations implemented in userspace is cpu jailing, which
> can end up pushing lots and lots of these batch threads onto a small
> number of cpus. 5ms min gran * 200 threads is already one second :)

Ah, I see.

> We're in the process of transitioning to using bw instead for this
> instead in order to maintain parallelism. Fixing bw is definitely
> going to be useful, but I'm afraid we'll still likely have some issues
> from low throughput for non-bw reasons (some of which we can't
> directly control, since arbitrary jobs can spin up and configure their
> hierarchy/threads in antagonistic ways, in effect pushing out the
> latency of some of their threads).

Yeah, thanks for the explanation. Making the lock more granular is tedious
but definitely doable. I don't think I can work on it in the near future but
will keep it on mind. If anyone's interested in attacking it, please be my
guest.

Thanks.

--
tejun

2022-11-02 09:05:23

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Mon, Oct 31, 2022 at 02:56:13PM -0700, Benjamin Segall wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote:
> >> CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's
> >> inline in an hrtimer callback. Runtime distribution is a per-cpu
> >> operation, and unthrottling is a per-cgroup operation, since a tg walk
> >> is required. On machines with a large number of cpus and large cgroup
> >> hierarchies, this cpus*cgroups work can be too much to do in a single
> >> hrtimer callback: since IRQ are disabled, hard lockups may easily occur.
> >> Specifically, we've found this scalability issue on configurations with
> >> 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high
> >> memory bandwidth usage.
> >>
> >> To fix this, we can instead unthrottle cfs_rq's asynchronously via a
> >> CSD. Each cpu is responsible for unthrottling itself, thus sharding the
> >> total work more fairly across the system, and avoiding hard lockups.
> >
> > So, TJ has been complaining about us throttling in kernel-space, causing
> > grief when we also happen to hold a mutex or some other resource and has
> > been prodding us to only throttle at the return-to-user boundary.
> >
> > Would this be an opportune moment to do this? That is, what if we
> > replace this CSD with a task_work that's ran on the return-to-user path
> > instead?
>
> This is unthrottle, not throttle, but it would probably be

Duh..

2022-11-02 17:30:19

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Wed, Nov 02, 2022 at 06:10:49PM +0100, Michal Koutn? wrote:
> On Tue, Nov 01, 2022 at 12:38:23PM -1000, Tejun Heo <[email protected]> wrote:
> > > We're in the process of transitioning to using bw instead for this
> > > instead in order to maintain parallelism. Fixing bw is definitely
> > > going to be useful, but I'm afraid we'll still likely have some issues
> > > from low throughput for non-bw reasons (some of which we can't
> > > directly control, since arbitrary jobs can spin up and configure their
> > > hierarchy/threads in antagonistic ways, in effect pushing out the
> > > latency of some of their threads).
> >
> > Yeah, thanks for the explanation. Making the lock more granular is tedious
> > but definitely doable. I don't think I can work on it in the near future but
> > will keep it on mind. If anyone's interested in attacking it, please be my
> > guest.
>
> From my experience, throttling while holding kernel locks (not just
> cgroup_mutex) causes more trouble than plain cgroup_mutex scalability
> currently.

Oh yeah, absolutely. Low cpu bw config + any shared kernel resource is a
nightmare and this thread was originally about addressing that.

Thanks.

--
tejun

2022-11-02 17:33:03

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Tue, Nov 01, 2022 at 12:38:23PM -1000, Tejun Heo <[email protected]> wrote:
> (cc'ing Michal, Christian and Li for context)

Thanks.

> > We're in the process of transitioning to using bw instead for this
> > instead in order to maintain parallelism. Fixing bw is definitely
> > going to be useful, but I'm afraid we'll still likely have some issues
> > from low throughput for non-bw reasons (some of which we can't
> > directly control, since arbitrary jobs can spin up and configure their
> > hierarchy/threads in antagonistic ways, in effect pushing out the
> > latency of some of their threads).
>
> Yeah, thanks for the explanation. Making the lock more granular is tedious
> but definitely doable. I don't think I can work on it in the near future but
> will keep it on mind. If anyone's interested in attacking it, please be my
> guest.

From my experience, throttling while holding kernel locks (not just
cgroup_mutex) causes more trouble than plain cgroup_mutex scalability
currently.
But I acknowledge the latter issue too.

Michal

Attachments:

(No filename) (1.04 kB)
signature.asc (235.00 B)
Digital signature Download all attachments

2022-11-02 17:34:15

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hello.

On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don <[email protected]> wrote:
> To fix this, we can instead unthrottle cfs_rq's asynchronously via a
> CSD. Each cpu is responsible for unthrottling itself, thus sharding the
> total work more fairly across the system, and avoiding hard lockups.

FIFO behavior of the cfs_b->throttled_cfs_rq is quite important to
ensure fairness of throttling (historically when it FIFO wasn't honored,
it caused some cfs_rq starving issues).

Despite its name, distribute_cfs_runtime() doesn't distribute the
runtime, the time is pulled inside assign_cfs_rq_runtime() (but that's
already on target cpu).
Currently, it's all synchronized under cfs_b->lock but with your change,
throttled cfs_rq would be dissolved among cpus that'd run concurrently
(assign_cfs_rq_runtime() still takes cfs_b->lock but it won't be
necessarily in the unthrottling order).

Have you observed any such fairness issues? [1][2]

> +static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
> [...]
> + if (rq == this_rq()) {
> + unthrottle_cfs_rq(cfs_rq);
> + return;
> + }

It was pointed out to me that generic_exec_single() does something
similar.
Wouldn't the flow bandwidth control code be simpler relying on that?

Also, can a particular cfs_rq be on both cfs_b->throttled_csd_list and
cfs_b->throttled_cfs_rq lists at any moment?
I wonder if having a single list_head node in cfs_rq would be feasible
(and hence enforcing this constraint in data).

Regards,
Michal

[1] I'm not familiar with IPIs, just to illustrate the concurrency: the
fairness could be skewed towards CPUs that are on same "NUMA" node
as the timer callback if closer CPUs received them sooner.

[2] Currently, I don't think it's a prohibitive issue because with my
reasoning even the current code relies on cfs_b->lock being a queued
spinlock to ensure the FIFO of cfs_b->throttled_cfs_rq.

Attachments:

(No filename) (1.92 kB)
signature.asc (235.00 B)
Digital signature Download all attachments

2022-11-03 01:46:16

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hi Michal,

Thanks for taking a look.

On Wed, Nov 2, 2022 at 9:59 AM Michal Koutný <[email protected]> wrote:
>
> Hello.
>
> On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don <[email protected]> wrote:
> > To fix this, we can instead unthrottle cfs_rq's asynchronously via a
> > CSD. Each cpu is responsible for unthrottling itself, thus sharding the
> > total work more fairly across the system, and avoiding hard lockups.
>
> FIFO behavior of the cfs_b->throttled_cfs_rq is quite important to
> ensure fairness of throttling (historically when it FIFO wasn't honored,
> it caused some cfs_rq starving issues).
>
> Despite its name, distribute_cfs_runtime() doesn't distribute the
> runtime, the time is pulled inside assign_cfs_rq_runtime() (but that's
> already on target cpu).
> Currently, it's all synchronized under cfs_b->lock but with your change,
> throttled cfs_rq would be dissolved among cpus that'd run concurrently
> (assign_cfs_rq_runtime() still takes cfs_b->lock but it won't be
> necessarily in the unthrottling order).

I don't think my patch meaningfully regresses this; the prior state
was also very potentially unfair in a similar way.

Without my patch, distribute_cfs_runtime() will unthrottle the
cfs_rq's, and as you point out, it doesn't actually give them any real
quota, it lets assign_cfs_rq_runtime() take care of that. But this
happens asynchronously on those cpus. If they are idle, they wait for
an IPI from the resched_curr() in unthrottled_cfs_rq(), otherwise they
simply wait until potentially the next rescheduling point. So we are
currently far from ever being guaranteed that the order the cpus pull
actual quota via assign_cfs_rq_runtime() matches the order they were
unthrottled from the list.

> > +static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
> > [...]
> > + if (rq == this_rq()) {
> > + unthrottle_cfs_rq(cfs_rq);
> > + return;
> > + }
>
> It was pointed out to me that generic_exec_single() does something
> similar.
> Wouldn't the flow bandwidth control code be simpler relying on that?

We already hold rq lock so we couldn't rely on the
generic_exec_single() special case since that would double lock.

> Also, can a particular cfs_rq be on both cfs_b->throttled_csd_list and
> cfs_b->throttled_cfs_rq lists at any moment?
> I wonder if having a single list_head node in cfs_rq would be feasible
> (and hence enforcing this constraint in data).

That's an interesting idea, this could be rewritten so that
distribute() pulls the entity off this list and moves it to the
throttled_csd_list; we never have an actual need to have entities on
both lists at the same time.

I'll wait to see if Peter has any comments, but that could be made in
a v3 for this patch.

Best,
Josh

2022-11-03 11:05:01

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Wed, Nov 02, 2022 at 05:10:08PM -0700, Josh Don <[email protected]> wrote:
> Without my patch, distribute_cfs_runtime() will unthrottle the
> cfs_rq's, and as you point out, it doesn't actually give them any real
> quota, it lets assign_cfs_rq_runtime() take care of that. But this
> happens asynchronously on those cpus. If they are idle, they wait for
> an IPI from the resched_curr() in unthrottled_cfs_rq(), otherwise they
> simply wait until potentially the next rescheduling point. So we are
> currently far from ever being guaranteed that the order the cpus pull
> actual quota via assign_cfs_rq_runtime() matches the order they were
> unthrottled from the list.

Thanks for breaking it down, I agree your change won't affect the
fairness substantially and my other points are clarified too.

Michal

Attachments:

(No filename) (828.00 B)
signature.asc (235.00 B)
Digital signature Download all attachments

2022-11-11 00:20:43

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hey Peter,

Any other thoughts on this patch? Right now the only thing I have to
change is to eliminate the redundant list_head, per Michal's
suggestion. If the general idea here looks good to you, I can go ahead
and do that and send the v3.

Thanks,
Josh

2022-11-16 03:51:38

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Wed, Nov 2, 2022 at 9:59 AM Michal Koutný <[email protected]> wrote:
>
> Also, can a particular cfs_rq be on both cfs_b->throttled_csd_list and
> cfs_b->throttled_cfs_rq lists at any moment?
> I wonder if having a single list_head node in cfs_rq would be feasible
> (and hence enforcing this constraint in data).

After more thought, I realized that we can't reuse the throttled_list
list_head, since that would potentially break the lockless traversal
of a concurrent list_for_each_entry_rcu() (ie. if we removed the
element from the throttled list and then added it to the CSD list).

- Josh

2022-11-16 10:31:20

[permalink] [raw]

Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

On Tue, Nov 15, 2022 at 07:01:31PM -0800, Josh Don <[email protected]> wrote:
> After more thought, I realized that we can't reuse the throttled_list
> list_head, since that would potentially break the lockless traversal
> of a concurrent list_for_each_entry_rcu() (ie. if we removed the
> element from the throttled list and then added it to the CSD list).

I see, the concurrent RCU traversal is a valid point for the two heads.

What does it mean for SCHED_WARN_ON in __unthrottle_cfs_rq_async()?

IIUC, if the concurrency of cfs_b->throttled_cfs_rq list is
expected (hence I'm not sure about the SCHED_WARN_ON), then it may
happen that __unthrottle_cfs_rq_async is called on cfs_rq that's already
on rq->cfsb_csd_list (there's still rq lock but it's only help inside
cfs_b->throttled_cfs_rq iteration).

Thanks,
Michal

Attachments:

(No filename) (842.00 B)
signature.asc (235.00 B)
Digital signature Download all attachments

2022-11-16 21:49:42