2021-04-01 17:53:59

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 0/9] sched: Core scheduling interfaces

Hi,

This is a rewrite of the core sched interface bits, and mostly replaces patches
2-5 from this set here:

https://lkml.kernel.org/r/[email protected]

The task interface is extended to include PR_SCHED_CORE_GET, because the
selftest. Otherwise the task interface is much the same, except completely new
code.

The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
however changed such that you can have nested tags. The for any given task, the
first parent with a cookie is the effective one. The rationale is that this way
you can delegate subtrees and still allow them some control over grouping.

The cgroup thing also '(ab)uses' cgroup_mutex for serialization because it
needs to ensure continuity between ss->can_attach() and ss->attach() for the
memory allocation. If the prctl() were allowed to interleave it might steal the
memory.

Using cgroup_mutex feels icky, but is not without precedent,
kernel/bpf/cgroup.c does the same thing afaict.

TJ, can you please have a look at this?

The last patch implements the prctl() / cgroup interaction, up until that point
each task carries the cookie set last between either interface, which is not
desirable. It really isn't the nicest thing ever, but it does keep the
scheduling core from having to consider multiple cookies.

Also, I still hate the kernel/sched/core_sched.c filename, but short of using
gibberish names to make tab-completion easier I simply cannot come up with
a remotely sane alternative :/

The code seems to not insta crash, and I can run the prctl() selftest while in
a cgroup and have it pass, not leak any references etc.. But it's otherwise
lightly tested code. Please read carefully etc..

Also of note; I didn't seem to need the css_offline and css_exit handlers the
other set added.

FWIW, I have a 4 day weekend ahead :-)


2021-04-04 23:48:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

cc'ing Michal and Christian who've been spending some time on cgroup
interface issues recently and Li Zefan for cpuset.

On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> however changed such that you can have nested tags. The for any given task, the
> first parent with a cookie is the effective one. The rationale is that this way
> you can delegate subtrees and still allow them some control over grouping.

I find it difficult to like the proposed interface from the name (the term
"core" is really confusing given how the word tends to be used internally)
to the semantics (it isn't like anything else) and even the functionality
(we're gonna have fixed processors at some point, right?).

Here are some preliminary thoughts:

* Are both prctl and cgroup based interfaces really necessary? I could be
being naive but given that we're (hopefully) working around hardware
deficiencies which will go away in time, I think there's a strong case for
minimizing at least the interface to the bare minimum.

Given how cgroups are set up (membership operations happening only for
seeding, especially with the new clone interface), it isn't too difficult
to synchronize process tree and cgroup hierarchy where it matters - ie.
given the right per-process level interface, restricting configuration for
a cgroup sub-hierarchy may not need any cgroup involvement at all. This
also nicely gets rid of the interaction between prctl and cgroup bits.

* If we *have* to have cgroup interface, I wonder whether this would fit a
lot better as a part of cpuset. If you squint just right, this can be
viewed as some dynamic form of cpuset. Implementation-wise, it probably
won't integrate with the rest but I think the feature will be less jarring
as a part of cpuset, which already is a bit of kitchensink anyway.

> The cgroup thing also '(ab)uses' cgroup_mutex for serialization because it
> needs to ensure continuity between ss->can_attach() and ss->attach() for the
> memory allocation. If the prctl() were allowed to interleave it might steal the
> memory.
>
> Using cgroup_mutex feels icky, but is not without precedent,
> kernel/bpf/cgroup.c does the same thing afaict.
>
> TJ, can you please have a look at this?

Yeah, using cgroup_mutex for stabilizing cgroup hierarchy for consecutive
operations is fine. It might be worthwhile to break that out into a proper
interface but that's the least of concerns here.

Can someone point me to a realistic and concrete usage scenario for this
feature?

Thanks.

--
tejun

2021-04-06 08:46:07

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

Hi TJ, Peter,

On Sun, Apr 4, 2021 at 7:39 PM Tejun Heo <[email protected]> wrote:
>
> cc'ing Michal and Christian who've been spending some time on cgroup
> interface issues recently and Li Zefan for cpuset.
>
> On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> > The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> > however changed such that you can have nested tags. The for any given task, the
> > first parent with a cookie is the effective one. The rationale is that this way
> > you can delegate subtrees and still allow them some control over grouping.
>
> I find it difficult to like the proposed interface from the name (the term
> "core" is really confusing given how the word tends to be used internally)
> to the semantics (it isn't like anything else) and even the functionality
> (we're gonna have fixed processors at some point, right?).
>
> Here are some preliminary thoughts:
>
> * Are both prctl and cgroup based interfaces really necessary? I could be
> being naive but given that we're (hopefully) working around hardware
> deficiencies which will go away in time, I think there's a strong case for
> minimizing at least the interface to the bare minimum.

I don't think these issues are going away as there are constantly new
exploits related to SMT that are coming out. Further, core scheduling
is not only for SMT - there are other usecases as well (such as VM
performance by preventing vCPU threads from core-sharing).

>
> Given how cgroups are set up (membership operations happening only for
> seeding, especially with the new clone interface), it isn't too difficult
> to synchronize process tree and cgroup hierarchy where it matters - ie.
> given the right per-process level interface, restricting configuration for
> a cgroup sub-hierarchy may not need any cgroup involvement at all. This
> also nicely gets rid of the interaction between prctl and cgroup bits.
> * If we *have* to have cgroup interface, I wonder whether this would fit a
> lot better as a part of cpuset. If you squint just right, this can be
> viewed as some dynamic form of cpuset. Implementation-wise, it probably
> won't integrate with the rest but I think the feature will be less jarring
> as a part of cpuset, which already is a bit of kitchensink anyway.

I think both interfaces are important for different reasons. Could you
take a look at the initial thread I started few months ago? I tried to
elaborate about usecases in detail:
http://lore.kernel.org/r/[email protected]

Also, in ChromeOS we can't use CGroups for this purpose. The CGroup
hierarchy does not fit well with the threads we are tagging. Also, we
use CGroup v1, and since CGroups cannot overlap, this is impossible
let alone cumbersome. And, the CGroup interface having core scheduling
is still useful for people using containers and wanting to
core-schedule each container separately ( +Hao Luo can elaborate more
on that, but I did describe it in the link above).

> > The cgroup thing also '(ab)uses' cgroup_mutex for serialization because it
> > needs to ensure continuity between ss->can_attach() and ss->attach() for the
> > memory allocation. If the prctl() were allowed to interleave it might steal the
> > memory.
> >
> > Using cgroup_mutex feels icky, but is not without precedent,
> > kernel/bpf/cgroup.c does the same thing afaict.
> >
> > TJ, can you please have a look at this?
>
> Yeah, using cgroup_mutex for stabilizing cgroup hierarchy for consecutive
> operations is fine. It might be worthwhile to break that out into a proper
> interface but that's the least of concerns here.
>
> Can someone point me to a realistic and concrete usage scenario for this
> feature?

Yeah, its at http://lore.kernel.org/r/[email protected]
as mentioned above, let me know if you need any more details about
usecase.

About the file name, how about kernel/sched/smt.c ? That definitely
provides more information than 'core_sched.c'.

Thanks,
- Joel

2021-04-07 05:08:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

Hello,

On Mon, Apr 05, 2021 at 02:46:09PM -0400, Joel Fernandes wrote:
> Yeah, its at http://lore.kernel.org/r/[email protected]
> as mentioned above, let me know if you need any more details about
> usecase.

Except for the unspecified reason in usecase 4, I don't see why cgroup is in
the picture at all. This doesn't really have much to do with hierarchical
resource distribution. Besides, yes, you can use cgroup for logical
structuring and identificaiton purposes but in those cases the interactions
and interface should be with the original subsystem while using cgroup IDs
or paths as parameters - see tracing and bpf for examples.

Thanks.

--
tejun

2021-04-07 07:11:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

Hello,

On Tue, Apr 06, 2021 at 05:32:04PM +0200, Peter Zijlstra wrote:
> > I find it difficult to like the proposed interface from the name (the term
> > "core" is really confusing given how the word tends to be used internally)
> > to the semantics (it isn't like anything else) and even the functionality
> > (we're gonna have fixed processors at some point, right?).
>
> Core is the topological name for the thing that hosts the SMT threads.
> Can't really help that.

I find the name pretty unfortunate given how overloaded the term is
generally and also in kernel but oh well...

> > Here are some preliminary thoughts:
> >
> > * Are both prctl and cgroup based interfaces really necessary? I could be
> > being naive but given that we're (hopefully) working around hardware
> > deficiencies which will go away in time, I think there's a strong case for
> > minimizing at least the interface to the bare minimum.
>
> I'm not one for cgroups much, so I'll let others argue that case, except
> that per systemd and all the other new fangled shit, people seem to use
> cgroups a lot to group tasks. So it makes sense to also expose this
> through cgroups in some form.

All the new fangled things follow a certain usage pattern which makes
aligning parts of process tree with cgroup layout trivial, so when
restrictions can be applied along the process tree like this and there isn't
a particular need for dynamic hierarchical control, there isn't much need
for direct cgroup interface.

> That said; I've had requests from lots of non security folks about this
> feature to help mitigate the SMT interference.
>
> Consider for example Real-Time. If you have an active SMT sibling, the
> CPU performance is much less than it would be when the SMT sibling is
> idle. Therefore, for the benefit of determinism, it would be very nice
> if RT tasks could force-idle their SMT siblings, and voila, this
> interface allows exactly that.
>
> The same is true for other workloads that care about interference.

I see.

> > Given how cgroups are set up (membership operations happening only for
> > seeding, especially with the new clone interface), it isn't too difficult
> > to synchronize process tree and cgroup hierarchy where it matters - ie.
> > given the right per-process level interface, restricting configuration for
> > a cgroup sub-hierarchy may not need any cgroup involvement at all. This
> > also nicely gets rid of the interaction between prctl and cgroup bits.
>
> I've no idea what you mean :/ The way I use cgroups (when I have to, for
> testing) is to echo the pid into /cgroup/foo/tasks. No clone or anything
> involved.

The usage pattern is creating a new cgroup, seeding it with a process
(either writing to tasks or using CLONE_INTO_CGROUP) and let it continue
only on that sub-hierarchy, so cgroup hierarchy usually partially overlays
process trees.

> None of my test machines come up with cgroupfs mounted, and any and all
> cgroup setup is under my control.
>
> > * If we *have* to have cgroup interface, I wonder whether this would fit a
> > lot better as a part of cpuset. If you squint just right, this can be
> > viewed as some dynamic form of cpuset. Implementation-wise, it probably
> > won't integrate with the rest but I think the feature will be less jarring
> > as a part of cpuset, which already is a bit of kitchensink anyway.
>
> Not sure I agree, we do not change the affinity of things, we only
> control who's allowed to run concurrently on SMT siblings. There could
> be a cpuset partition split between the siblings and it would still work
> fine.

I see. Yeah, if we really need it, I'm not sure it fits in cgroup interface
proper. As I wrote elsewhere, these things are usually implemented on the
originating subsystem interface with cgroup ID as a parameter.

Thanks.

--
tejun

2021-04-07 09:15:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Sun, Apr 04, 2021 at 07:39:03PM -0400, Tejun Heo wrote:
> cc'ing Michal and Christian who've been spending some time on cgroup
> interface issues recently and Li Zefan for cpuset.
>
> On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> > The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> > however changed such that you can have nested tags. The for any given task, the
> > first parent with a cookie is the effective one. The rationale is that this way
> > you can delegate subtrees and still allow them some control over grouping.
>
> I find it difficult to like the proposed interface from the name (the term
> "core" is really confusing given how the word tends to be used internally)
> to the semantics (it isn't like anything else) and even the functionality
> (we're gonna have fixed processors at some point, right?).

Core is the topological name for the thing that hosts the SMT threads.
Can't really help that.

> Here are some preliminary thoughts:
>
> * Are both prctl and cgroup based interfaces really necessary? I could be
> being naive but given that we're (hopefully) working around hardware
> deficiencies which will go away in time, I think there's a strong case for
> minimizing at least the interface to the bare minimum.

I'm not one for cgroups much, so I'll let others argue that case, except
that per systemd and all the other new fangled shit, people seem to use
cgroups a lot to group tasks. So it makes sense to also expose this
through cgroups in some form.

That said; I've had requests from lots of non security folks about this
feature to help mitigate the SMT interference.

Consider for example Real-Time. If you have an active SMT sibling, the
CPU performance is much less than it would be when the SMT sibling is
idle. Therefore, for the benefit of determinism, it would be very nice
if RT tasks could force-idle their SMT siblings, and voila, this
interface allows exactly that.

The same is true for other workloads that care about interference.

> Given how cgroups are set up (membership operations happening only for
> seeding, especially with the new clone interface), it isn't too difficult
> to synchronize process tree and cgroup hierarchy where it matters - ie.
> given the right per-process level interface, restricting configuration for
> a cgroup sub-hierarchy may not need any cgroup involvement at all. This
> also nicely gets rid of the interaction between prctl and cgroup bits.

I've no idea what you mean :/ The way I use cgroups (when I have to, for
testing) is to echo the pid into /cgroup/foo/tasks. No clone or anything
involved.

None of my test machines come up with cgroupfs mounted, and any and all
cgroup setup is under my control.

> * If we *have* to have cgroup interface, I wonder whether this would fit a
> lot better as a part of cpuset. If you squint just right, this can be
> viewed as some dynamic form of cpuset. Implementation-wise, it probably
> won't integrate with the rest but I think the feature will be less jarring
> as a part of cpuset, which already is a bit of kitchensink anyway.

Not sure I agree, we do not change the affinity of things, we only
control who's allowed to run concurrently on SMT siblings. There could
be a cpuset partition split between the siblings and it would still work
fine.

2021-04-07 21:37:43

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

Hello.

IIUC, the premise is that the tasks that have different cookies imply
they would never share a core.

On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> however changed such that you can have nested tags. The for any given task, the
> first parent with a cookie is the effective one. The rationale is that this way
> you can delegate subtrees and still allow them some control over grouping.

Given the existence of prctl and clone APIs, I don't see the reason to
have a separate cgroup-bound interface too (as argued by Tejun). The
potential speciality is the possibility to re-tag whole groups of
processes at runtime (but the listed use cases [1] don't require that
and it's not such a good idea given its asynchronicity).

Also, I would find useful some more explanation how the hierarchical
behavior is supposed to work. In my understanding the task is either
allowed to request its own isolation or not. The cgroups could be used
to restrict this privilege, however, that doesn't seem to be the case
here.

My two cents,
Michal

[1] https://lore.kernel.org/lkml/[email protected]/


Attachments:
(No filename) (1.21 kB)
signature.asc (849.00 B)
Digital signature
Download all attachments

2021-04-07 21:53:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Wed, Apr 07, 2021 at 06:50:32PM +0200, Michal Koutn? wrote:
> Hello.
>
> IIUC, the premise is that the tasks that have different cookies imply
> they would never share a core.

Correct.

> On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> > The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It is
> > however changed such that you can have nested tags. The for any given task, the
> > first parent with a cookie is the effective one. The rationale is that this way
> > you can delegate subtrees and still allow them some control over grouping.
>
> Given the existence of prctl and clone APIs, I don't see the reason to
> have a separate cgroup-bound interface too (as argued by Tejun).

IMO as long as cgroups have that tasks file, you get to support people
using it. That means that tasks joining your cgroup need to 'inherit'
cgroup properties.

That's not something covered by either prctl or clone.

> The potential speciality is the possibility to re-tag whole groups of
> processes at runtime (but the listed use cases [1] don't require that
> and it's not such a good idea given its asynchronicity).

That seems to be the implication of having that tasks file. Tasks can
join a cgroup, so you get to deal with that.

You can't just say, don't do that then.

> Also, I would find useful some more explanation how the hierarchical
> behavior is supposed to work. In my understanding the task is either
> allowed to request its own isolation or not. The cgroups could be used
> to restrict this privilege, however, that doesn't seem to be the case
> here.

Given something like:

R
/ \
A B
/ \
C D

B group can set core_sched=1 and then all its (and its decendants) tasks
get to have the same (group) cookie and cannot share with others.

If however B is a delegate and has a subgroup D that is security
sensitive and must not share core resources with the rest of B, then it
can also set D.core_sched=1, such that D (and its decendants) will have
another (group) cookie.

On top of this, say C has a Real-Time tasks, that wants to limit SMT
interference, then it can set a (task/prctl) cookie on itself, such that
it will not share the core with the rest of the tasks of B.


In that scenario the D subtree is a restriction (doesn't share) with the
B subtree.

And all of B is a restriction on all its tasks, including the Real-Time
task that set a task cookie, in that none of them can share with tasks
outside of B (including system tasks which are in R), irrespective of
what they do with their task cookie.

2021-04-07 21:53:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Tue, Apr 06, 2021 at 12:08:50PM -0400, Tejun Heo wrote:

> I see. Yeah, if we really need it, I'm not sure it fits in cgroup interface
> proper. As I wrote elsewhere, these things are usually implemented on the
> originating subsystem interface with cgroup ID as a parameter.

This would be something like:

prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, cgroup-fd, PIDTYPE_CGROUP, NULL);

right? Where we assign to self the cookie from the cgroup.

The problem I see with this is that a task can trivially undo/circumvent
this by calling PR_SCHED_CORE_CLEAR on itself, at which point it can
share with system tasks again.

Also, it doesn't really transfer well to the group/tasks thing. When a
task joins a cgroup, it doesn't automagically gain the cgroup
properties. Whoever does the transition will then also have to prctl()
this, which nobody will do.

2021-04-08 13:27:34

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Wed, Apr 07, 2021 at 08:34:24PM +0200, Peter Zijlstra <[email protected]> wrote:
> IMO as long as cgroups have that tasks file, you get to support people
> using it. That means that tasks joining your cgroup need to 'inherit'
> cgroup properties.
The tasks file is consequence of binding this to cgroups, I'm one step
back. Why to make "core isolation" a cgroup property?

(I understand this could help "visualize" what the common domains are if
cgroups were the only API but with prctl the structure can be
arbitrarily modified anyway.)


> Given something like:
>
> R
> / \
> A B
> / \
> C D
Thanks for the example.

> B group can set core_sched=1 and then all its (and its decendants) tasks
> get to have the same (group) cookie and cannot share with others.
The same could be achieved with the first task of group B allocating its
new cookie which would be inherited in its descednants.

> If however B is a delegate and has a subgroup D that is security
> sensitive and must not share core resources with the rest of B, then it
> can also set D.core_sched=1, such that D (and its decendants) will have
> another (group) cookie.
If there is such a sensitive descendant task, it could allocate a new
cookie (same way as the first one in B did).

> On top of this, say C has a Real-Time tasks, that wants to limit SMT
> interference, then it can set a (task/prctl) cookie on itself, such that
> it will not share the core with the rest of the tasks of B.
(IIUC, in this particular example it'd be redundant if B had no inner
tasks since D isolated itself already.)
Yes, so this is again the same pattern as the tasks above have done.

> In that scenario the D subtree is a restriction (doesn't share) with the
> B subtree.
This implies D's isolation from everything else too, not just B's
members, no?

> And all of B is a restriction on all its tasks, including the Real-Time
> task that set a task cookie, in that none of them can share with tasks
> outside of B (including system tasks which are in R), irrespective of
> what they do with their task cookie.
IIUC, the equivalent restriction could be achieved with the PTRACE-like
check in the prctl API too (with respectively divided uids).

I'm curious whether the cgroup API actually simplifies things that are
possible with the clone/prctl API or allows anything that wouldn't be
otherwise possible.

Regards,
Michal


Attachments:
(No filename) (2.43 kB)
signature.asc (849.00 B)
Digital signature
Download all attachments

2021-04-08 16:51:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Thu, Apr 08, 2021 at 03:25:52PM +0200, Michal Koutn? wrote:
> On Wed, Apr 07, 2021 at 08:34:24PM +0200, Peter Zijlstra <[email protected]> wrote:
> > IMO as long as cgroups have that tasks file, you get to support people
> > using it. That means that tasks joining your cgroup need to 'inherit'
> > cgroup properties.
> The tasks file is consequence of binding this to cgroups, I'm one step
> back. Why to make "core isolation" a cgroup property?

Yeah, dunno, people asked for it. I'm just proposing an implementation
that, when given the need, seems to make sense and is internally
consistent.

> (I understand this could help "visualize" what the common domains are if
> cgroups were the only API but with prctl the structure can be
> arbitrarily modified anyway.)
>
>
> > Given something like:
> >
> > R
> > / \
> > A B
> > / \
> > C D
> Thanks for the example.
>
> > B group can set core_sched=1 and then all its (and its decendants) tasks
> > get to have the same (group) cookie and cannot share with others.
> The same could be achieved with the first task of group B allocating its
> new cookie which would be inherited in its descednants.

Except then the task can CLEAR its own cookie and escape the constraint.

> > In that scenario the D subtree is a restriction (doesn't share) with the
> > B subtree.
> This implies D's isolation from everything else too, not just B's
> members, no?

Correct. Look at it as a contraint on co-scheduling, you can never,
whatever you do, share an SMT sibling with someone outside your subtree.

> > And all of B is a restriction on all its tasks, including the Real-Time
> > task that set a task cookie, in that none of them can share with tasks
> > outside of B (including system tasks which are in R), irrespective of
> > what they do with their task cookie.
> IIUC, the equivalent restriction could be achieved with the PTRACE-like
> check in the prctl API too (with respectively divided uids).

I'm not sure I understand; if tasks in A and B are of the same user,
then ptrace will not help anything. And per the above, you always have
ptrace on yourself so you can escape your constraint per the above.

> I'm curious whether the cgroup API actually simplifies things that are
> possible with the clone/prctl API or allows anything that wouldn't be
> otherwise possible.

With the cgroup API it is impossible for a task to escape the cgroup
constraint. It can never share a core with anything not in the subtree.

This is not possible with just the task interface.

If this is actually needed I've no clue, IMO all of cgroups is not
needed :-) Clearly other people feel differently about that.


Much of this would go away if CLEAR were not possible I suppose. But
IIRC the idea was to let a task isolate itself temporarily, while doing
some sensitive thing (eg. encrypt an email) but otherwise not be
constrained. But I'm not sure I can remember all the various things
people wanted this crud for :/

2021-04-09 00:17:43

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Thu, Apr 8, 2021 at 9:47 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Apr 08, 2021 at 03:25:52PM +0200, Michal Koutný wrote:
>
> > I'm curious whether the cgroup API actually simplifies things that are
> > possible with the clone/prctl API or allows anything that wouldn't be
> > otherwise possible.
>
> With the cgroup API it is impossible for a task to escape the cgroup
> constraint. It can never share a core with anything not in the subtree.
>
> This is not possible with just the task interface.
>
> If this is actually needed I've no clue, IMO all of cgroups is not
> needed :-) Clearly other people feel differently about that.

The cgroup interface seems very nice from a management perspective;
moving arbitrary tasks around in the cgroup hierarchy will handle the
necessary cookie adjustments to ensure that nothing shares with an
unexpected task. It also makes auditing the core scheduling groups
very straightforward; anything in a cookie'd cgroup's tasks file will
be guaranteed isolated from the rest of the system, period.

Further, if a system management thread wants two arbitrary tasks A and
B to share a cookie, this seems more painful with the PRCTL interface.
The management thread would need to something like
- PR_SCHED_CORE_CREATE
- PR_SCHED_CORE_SHARE_TO A
- PR_SCHED_CORE_SHARE_TO B
- PR_SCHED_CORE_CLEAR

2021-04-18 01:38:34

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Tue, Apr 06, 2021 at 10:16:12AM -0400, Tejun Heo wrote:
> Hello,
>
> On Mon, Apr 05, 2021 at 02:46:09PM -0400, Joel Fernandes wrote:
> > Yeah, its at http://lore.kernel.org/r/[email protected]
> > as mentioned above, let me know if you need any more details about
> > usecase.
>
> Except for the unspecified reason in usecase 4, I don't see why cgroup is in
> the picture at all. This doesn't really have much to do with hierarchical
> resource distribution. Besides, yes, you can use cgroup for logical
> structuring and identificaiton purposes but in those cases the interactions
> and interface should be with the original subsystem while using cgroup IDs
> or paths as parameters - see tracing and bpf for examples.

Personally for ChromeOS, we need only the per-task interface. Considering
that the second argument of this prctl is a command, I don't see why we
cannot add a new command PR_SCHED_CORE_CGROUP_SHARE to do what Tejun is
saying (in the future).

In order to not block ChromeOS and other "per-task interface" usecases, I
suggest we keep the CGroup interface for a later time (whether that's
through prctl or the CGroups FS way which Tejun dislikes) and move forward
with per-task interface only initially.

Peter, any thoughts on this?

thanks,

- Joel

2021-04-19 11:19:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Sat, Apr 17, 2021 at 09:35:07PM -0400, Joel Fernandes wrote:
> On Tue, Apr 06, 2021 at 10:16:12AM -0400, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Apr 05, 2021 at 02:46:09PM -0400, Joel Fernandes wrote:
> > > Yeah, its at http://lore.kernel.org/r/[email protected]
> > > as mentioned above, let me know if you need any more details about
> > > usecase.
> >
> > Except for the unspecified reason in usecase 4, I don't see why cgroup is in
> > the picture at all. This doesn't really have much to do with hierarchical
> > resource distribution. Besides, yes, you can use cgroup for logical
> > structuring and identificaiton purposes but in those cases the interactions
> > and interface should be with the original subsystem while using cgroup IDs
> > or paths as parameters - see tracing and bpf for examples.
>
> Personally for ChromeOS, we need only the per-task interface. Considering
> that the second argument of this prctl is a command, I don't see why we
> cannot add a new command PR_SCHED_CORE_CGROUP_SHARE to do what Tejun is
> saying (in the future).
>
> In order to not block ChromeOS and other "per-task interface" usecases, I
> suggest we keep the CGroup interface for a later time (whether that's
> through prctl or the CGroups FS way which Tejun dislikes) and move forward
> with per-task interface only initially.

Josh, you being on the other Google team, the one that actually uses the
cgroup interface AFAIU, can you fight the good fight with TJ on this?

> Peter, any thoughts on this?

Adding CGROUP_SHARE is not sufficient to close the hole against CLEAR.
So we either then have to 'tweak' the meaning of CLEAR or replace it
entirely, neither seem attractive.


I'd love to make some progress on all this.

2021-04-19 13:03:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

Hello,

Sorry about late reply.

On Wed, Apr 07, 2021 at 08:34:24PM +0200, Peter Zijlstra wrote:
> > Given the existence of prctl and clone APIs, I don't see the reason to
> > have a separate cgroup-bound interface too (as argued by Tejun).
>
> IMO as long as cgroups have that tasks file, you get to support people
> using it. That means that tasks joining your cgroup need to 'inherit'

This argument doesn't really make sense to me. We don't just add things to
make interfaces orthogonal. It can be a consideration but not the only or
even one of the most important ones. There are many cases we say to not
shoot oneself in the foot and also many interfaces which are fading away or
in the process of being deprecated.

I'm not planning to deprecate the dynamic migration interfaces given the
history and usefulness in testing but the usage model of cgroup2 is clearly
defined and documented in this regard - whether the initial population of
the cgroup happens through CLONE_INTO_CGROUP or migration, for resource
tracking and control purposes, cgroup2 does not support dynamic migration
with the exception of migrations within threaded domains.

Anything is a possibility but given how this requirement is intertwined with
achieveing comprehensive resource control, a core goal of cgroup2, and
widely adopted by all the new fangled things as you put it, changing this
wouldn't be easy. Not just because some people including me are against it
but because there are inherent technical challenges and overheads to
supporting dynamic charge migration for stateful controllers and the
cost-benefit balance doesn't come out in favor.

So, the current "policy" is something like this:

* cgroupfs interface is for cgroup core features of organizing cgroups and
processes and configuring resource configurations which preferably follow
one of the control models defined in the doc.

* The hierarchical organization is semi-static in the sense that once a
cgroup is populated tasks shouldn't be moved in or out of the cgroup with
the exception of threaded domains.

* Non-resource control usages can hook into cgroup for identification /
tagging purposes but should use the originating interface for
interactions.

This has been consistently applied over the years now. There of course can
be exceptions but I haven't seen anything outstanding in this round of
discussion so am pretty skeptical. The actual use cases don't seem to need
it and the only argument for it is it'd be nice to have it and involves
violating the usage model.

My suggestion is going ahead with the per-process interface with cgroup
extension on mind in case actual use cases arise. Also, when planning cgroup
integration, putting dynamic migration front and center likely isn't a good
idea.

Thanks.

--
tejun

2021-04-20 01:19:06

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Mon, Apr 19, 2021 at 2:01 AM Peter Zijlstra <[email protected]> wrote:
>
> Josh, you being on the other Google team, the one that actually uses the
> cgroup interface AFAIU, can you fight the good fight with TJ on this?

A bit of extra context is in
https://lore.kernel.org/lkml/CABk29NtTScu2HO7V9Di+Fh2gv8zu5xiC5iNRwPFCLhpD+DKP0A@mail.gmail.com.

On the management/auditing side, the cgroup interface gives a clear
indication of which tasks share a cookie. It is a bit less attractive
to add a prctl interface for enumerating this.

Also on the management side, I illustrated in the above message how a
thread would potentially group together other threads. One limitation
of the current prctl interface is that the share_{to, from} always
operates on the current thread. Granted we can work around this as
described, and also potentially extend the prctl interface to operate
on two tasks.

So I agree that the cgroup interface here isn't strictly necessary,
though it seems convenient. I will double-check with internal teams
that would be using the interface to see if there are any other
considerations I'm missing.

On Mon, Apr 19, 2021 at 4:30 AM Tejun Heo <[email protected]> wrote:
>
> My suggestion is going ahead with the per-process interface with cgroup
> extension on mind in case actual use cases arise. Also, when planning cgroup
> integration, putting dynamic migration front and center likely isn't a good
> idea.

tasks would not be frequently moved around; I'd expect security
configuration to remain mostly static. Or maybe I'm misunderstanding
your emphasis here?


If you feel the above is not strong enough (ie. there should be a use
case not feasible with prctl), I'd support that we move forward with
the prctl stuff for now, since the cgroup interface is independant.

Thanks,
Josh

2021-04-21 18:11:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On Mon, Apr 19, 2021 at 11:00:57AM +0200, Peter Zijlstra wrote:
> On Sat, Apr 17, 2021 at 09:35:07PM -0400, Joel Fernandes wrote:

> > Peter, any thoughts on this?
>
> Adding CGROUP_SHARE is not sufficient to close the hole against CLEAR.
> So we either then have to 'tweak' the meaning of CLEAR or replace it
> entirely, neither seem attractive.
>
>
> I'd love to make some progress on all this.

Can I comment out CLEAR so we can sort that out later? I suppose people
can still do temp cookies simply by using a temp task.

2021-04-21 19:15:03

by Chris Hyser

[permalink] [raw]
Subject: Re: [PATCH 0/9] sched: Core scheduling interfaces

On 4/21/21 9:35 AM, Peter Zijlstra wrote:
> On Mon, Apr 19, 2021 at 11:00:57AM +0200, Peter Zijlstra wrote:
>> On Sat, Apr 17, 2021 at 09:35:07PM -0400, Joel Fernandes wrote:
>
>>> Peter, any thoughts on this?
>>
>> Adding CGROUP_SHARE is not sufficient to close the hole against CLEAR.
>> So we either then have to 'tweak' the meaning of CLEAR or replace it
>> entirely, neither seem attractive.
>>
>>
>> I'd love to make some progress on all this.
>
> Can I comment out CLEAR so we can sort that out later? I suppose people

I merely added CLEAR for completeness. Ultimately, I think having to kill a process because a cookie got set by mistake
is bad, but it can absolutely be sorted out later.

-chrish