Hello!
Core-scheduling aims to allow making it safe for more than 1 task that trust
each other to safely share hyperthreads within a CPU core [1]. This results
in a performance improvement for workloads that can benefit from using
hyperthreading safely while limiting core-sharing when it is not safe.
Currently no universally agreed set of interface exists and companies have
been hacking up their own interface to make use of the patches. This post
aims to list usecases which I got after talking to various people at Google
and Oracle. After which actual development of code to add interfaces can follow.
The below text uses the terms cookie and tag interchangeably. Further, cookie
of 0 is assumed to indicate a trusted process - such as kernel threads or
system daemons. By default, if nothing is tagged then everything is
considered trusted since the scheduler assumes all tasks are a match for each
other.
Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
int32 is split into 2 parts, the color and the id. The color can only be set
by privileged processes and the id can be set by anyone. The CGroup structure
looks like:
A B
/ \ / \ \
C D E F G
Here A and B are container CGroups for 2 jobs are assigned a color by a
privileged daemon. The job itself has more sub-CGroups within (for ex, B has
E, F and G). When these sub-CGroups are spawned, they inherit the color from
the parent. An unprivileged user can then set an id for the sub-CGroup
without the knowledge of the privileged daemon if it desires to add further
isolation. This setting of id can be an unprivileged operation because the
root daemon has already isolated A and B.
Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
spawns a renderer. A renderer is a sandboxed process and it is assumed it
could run arbitrary code (Javascript etc). When a renderer is created, a
prctl call is made to tag the renderer. Every thread that is spawned by the
renderer is also tagged. Essentially this turns SMT off for the renderer, but
still gives a performance boost due to privileged system threads being able
to share a core. The tagging also forbids the renderer from sharing a core
with privileged system processes. In the future, we plan to allow threads to
share a core as well (especially once we get syscall-isolation upstreamed.
Patches were posted recently for the same [2]).
Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
tagged thus disallowing core sharing between the vCPU thread and any other
thread on the system. This is because such VMs may run arbitrary user code
and attack both the guest and the host systems sharing the core.
Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
to not have to share its parent's CGroup tag. In fact, it should be allowed to
untag the child CGroup if needed thus allowing them to share a core with
trusted tasks. Others have had similar requirements.
Proposal for tagging
--------------------
We have to support both CGroup and non-CGroup users. CGroup may be overkill
for some and the CGroup v2 unified hierarchy may be too inflexible.
Regardless, we must support CGroup due its easy of use and existing users.
For Usecase #1
----------
Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
to the CPU controller:
- tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
tagged. (In the kernel, the cookie will be derived from the pointer value
of a ref-counted cookie object.). If reset, then the CGroup will inherit
the parent CGroup's cookie if there is one.
- color : The ref-counted object will be aligned say to a 256-byte boundary
(for example), then the lower 8 bits of the pointer can be used to specify
color. Together, the pointer with the color will form a cookie used by the
scheduler.
Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
their color to be the same does not imply that the 2 groups will share a
core. This is key. Also, to support usecase #4, we could add a third tag
value -- 2, along with the usual 0 and 1 to suggest that the CGroup can share
a core with cookie-0 tasks (Chris Hyser feel free to add any more comments
here).
For Usecase #2
--------------
We could add an interface that Peter suggested where 2 PIDs A and B want to
share a core. So if A wants to share a core with B, then it issues
prctl(SET_CORE_SHARE, B). ptrace_may_access() can be used to restrict access.
For renderers though, we want to likely allow a renderer to share a core
exclusive with only threads within a renderer and no one else. To support
this, renderer A could simply issue prctl(SET_CORE_SHARE, A).
For Usecase #3
--------------
By default, all threads within a process will share a core. This makes the
most sense because threads in a process share the same virtual address space.
However, for virtual machines in ChromeOS, we would like vCPU threads to not
share a core with other vCPU threads as mentioned above. To support this,
when a vCPU thread is forked, a new clone flag - CLONE_NEW_CORE_TAG could be
introduced to cause the forked thread to not share a core with its parent.
This could also support usecase #2 in the future (instead of prctl, a new
renderer being forked can simply be passed CLONE_NEW_CORE_TAG which will tag the
forked process or thread even if the forking process is not tagged).
Other considerations:
- To share a core anyway even if tags don't match: If we assume that the only
purpose of core-scheduling is to enforce security, then if the kernel knows
that CPUs are not vulnerable then cores can be shared anyway, whether the
tasks are tagged or not (Suggested-by PeterZ).
- Addition of a new CGroup controller: Instead of CPU controller, it may be
better to add a new CGroup controller in case the CPU controller is not
attached to some parts of the hierarchy and it is still desirable to use
CGroup interface for core tagging.
- Co-existence of CGroup with prctl/clone. The prctl/clone tagging should
always be made to override CGroup. For this purpose, I propose a new
'tasks_no_cg_tag' or a similar file in the CGroup controller. This file
will list all tasks that don't associate with the CGroup's tag. NOTE: I am not
sure yet how this new file will work with prctl/clone-tagging of individual
threads in a non-thread-mode CGroup v2 usage.
- Differences in tagging of a forked task (!CLONE_THREAD): If a process is
a part of a CGroup and is forked, then the child process is automatically
added to that CGroup. If such CGroup was tagged before, then the child is
automatically tagged. However, it may be desired to give the child its own
tag. In this case also, the earlier CLONE_NEW_CORE_TAG flag can be used to
achieve this behavior. If the forking process was not a part of a CGroup
but got a tag through other means before, then by default a !CLONE_THREAD
fork would imply CLONE_NEW_CORE_TAG. However, to turns this off, a
CLONE_CORE_TAG flag can be added (forking process's tag will be inheritted
by the child).
Let me know your thoughts and looking forward to a good LPC MC discussion!
thanks,
- Joel
[1] https://lwn.net/Articles/780703/
[2] https://lwn.net/Articles/828889/
> Let me know your thoughts and looking forward to a good LPC MC discussion!
>
Nice write up Joel, thanks for taking time to compile this with great detail!
After going through the details of interface proposal using cgroup v2
controllers,
and based on our discussion offline, would like to note down this idea
about a new
pseudo filesystem interface for core scheduling. We could include
this also for the
API discussion during core scheduler MC.
coreschedfs: pseudo filesystem interface for Core Scheduling
----------------------------------------------------------------------------------
The basic requirement of core scheduling is simple - we need to group a set
of tasks into a trust group that can share a core. So we don’t really
need a nested
hierarchy for the trust groups. Cgroups v2 follow a unified nested
hierarchy model
that causes a considerable confusion if the trusted tasks are in
different levels of the
hierarchy and we need to allow them to share the core. Cgroup v2's
single hierarchy
model makes it difficult to regroup tasks in different levels of
nesting for core scheduling.
As noted in this mail, we could use multi-file approach and other
interfaces like prctl to
overcome this limitation.
The idea proposed here to overcome the above limitation is to come up with a new
pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
filesystem with
maximum nesting level of 1. That means, root directory can have
sub-directories for
sub-groups, but those sub-directories cannot have more sub-directories
representing
trust groups. Root directory is to represent the system wide trust
group and sub-directories
represent trusted groups. Each directory including the root directory
has the following set
of files/directories:
- cookie_id: User exposed id for a cookie. This can be compared to a
file descriptor.
This could be used in programmatic API to join/leave a group
- properties: This is an interface to specify how child tasks of this
group should behave.
Can be used for specifying future flag requirements as well.
Current list of properties include:
NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
will result in
creation of a new trust group
SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
group will end up in
this same group
ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
group goes to the root group
- tasks: Lists the tasks in this group. Main interface for adding
removing tasks in a group
- <pid>: A directory per task who is am member of this trust group.
- <pid>/properties: This file is same as the parent properties file
but this is to override
the group setting.
This pseudo filesystem can be mounted any where in the root
filesystem, I propose the default
to be in “/sys/kernel/coresched”
When coresched is enabled, kernel internally creates the framework for
this filesystem.
The filesystem gets mounted to the default location and admin can
change this if needed.
All tasks by default are in the root group. The admin or programs can
then create trusted
groups on top of this filesystem.
Hooks will be placed in fork() and exit() to make sure that the
filesystem’s view of tasks is
up-to-date with the system. Also, APIs manipulating core scheduling
trusted groups should
also make sure that the filesystem's view is updated.
Note: The above idea is very similar to cgroups v1. Since there is no
unified hierarchy
in cgroup v1, most of the features of coreschedfs could be implemented
as a cgroup v1
controller. As no new v1 controllers are allowed, I feel the best
alternative to have
a simple API is to come up with a new filesystem - coreschedfs.
The advantages of this approach is:
- Detached from cgroup unified hierarchy and hence the very simple requirement
of core scheduling can be easily materialized.
- Admin can have fine-grained control of groups using shell and scripting
- Can have programmatic access to this using existing APIs like mkdir,rmdir,
write, read. Or can come up with new APIs using the cookie_id which can wrap
t he above linux apis or use a new systemcall for core scheduling.
- Fine grained permission control using linux filesystem permissions and ACLs
Disadvantages are
- yet another psuedo filesystem.
- very similar to cgroup v1 and might be re-implementing features
that are already
provided by cgroups v1.
Use Cases
-----------------
Usecase 1: Google cloud
---------------------------------
Since we no longer depend on cgroup v2 hierarchies, there will not be
any issue of
nesting and sharing. The main daemon can create trusted groups in the
fileystem and
provide required permissions for the group. Then the init processes
for each job can
be added to respective groups for them to create children tasks as
needed. Multiple
jobs under the same customer which needs to share the core can be
housed in one group.
Usecase 2: Chrome browser
------------------------
We start with one group for the first task and then set properties to
NEW_COOKIE_FOR_CHILD.
Usecase 3: chrome VMs
---------------------
Similar to chrome browser, the VM task can make each vcpu on its own group.
Usecase 4: Oracle use case
--------------------------
This is also similar to use case 1 with this interface. All tasks that need to
be in the root group can be easily added by the admin.
Use case 5: General virtualization
----------------------------------
The requirement is each VM should be isolated. This can be easily done
by creating a
new group per VM
Please have a look at the above proposal and let us know your
thoughts. We shall include
this also during the interface discussion at core scheduling MC.
Thanks,
Vineeth
On Fri, Aug 21, 2020 at 8:01 PM Joel Fernandes <[email protected]> wrote:
>
> Hello!
> Core-scheduling aims to allow making it safe for more than 1 task that trust
> each other to safely share hyperthreads within a CPU core [1]. This results
> in a performance improvement for workloads that can benefit from using
> hyperthreading safely while limiting core-sharing when it is not safe.
>
> Currently no universally agreed set of interface exists and companies have
> been hacking up their own interface to make use of the patches. This post
> aims to list usecases which I got after talking to various people at Google
> and Oracle. After which actual development of code to add interfaces can follow.
>
> The below text uses the terms cookie and tag interchangeably. Further, cookie
> of 0 is assumed to indicate a trusted process - such as kernel threads or
> system daemons. By default, if nothing is tagged then everything is
> considered trusted since the scheduler assumes all tasks are a match for each
> other.
>
> Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
> int32 is split into 2 parts, the color and the id. The color can only be set
> by privileged processes and the id can be set by anyone. The CGroup structure
> looks like:
>
> A B
> / \ / \ \
> C D E F G
>
> Here A and B are container CGroups for 2 jobs are assigned a color by a
> privileged daemon. The job itself has more sub-CGroups within (for ex, B has
> E, F and G). When these sub-CGroups are spawned, they inherit the color from
> the parent. An unprivileged user can then set an id for the sub-CGroup
> without the knowledge of the privileged daemon if it desires to add further
> isolation. This setting of id can be an unprivileged operation because the
> root daemon has already isolated A and B.
>
> Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
> spawns a renderer. A renderer is a sandboxed process and it is assumed it
> could run arbitrary code (Javascript etc). When a renderer is created, a
> prctl call is made to tag the renderer. Every thread that is spawned by the
> renderer is also tagged. Essentially this turns SMT off for the renderer, but
> still gives a performance boost due to privileged system threads being able
> to share a core. The tagging also forbids the renderer from sharing a core
> with privileged system processes. In the future, we plan to allow threads to
> share a core as well (especially once we get syscall-isolation upstreamed.
> Patches were posted recently for the same [2]).
>
> Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
> tagged thus disallowing core sharing between the vCPU thread and any other
> thread on the system. This is because such VMs may run arbitrary user code
> and attack both the guest and the host systems sharing the core.
>
> Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
> talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
> to not have to share its parent's CGroup tag. In fact, it should be allowed to
> untag the child CGroup if needed thus allowing them to share a core with
> trusted tasks. Others have had similar requirements.
>
Just to augment this. This doesn't necessarily need to be cgroup
based. We do have a need where certain processes want to be tagged
separately from others, which are in the same cgroup hierarchy. The
standard mechanism for this is nested cgroups. With a unified
hierarchy, and with cgroup tagging, I am unsure what this really
means. Consider
root
|- A
|- A1
|- A2
If A is tagged, can processes in A1 and A2 share a core? Should they
share a core? In some cases we might be OK with them sharing cores
just to get some of the performance back. Core scheduling really needs
to be limited to just the processes that we want to protect.
> Proposal for tagging
> --------------------
> We have to support both CGroup and non-CGroup users. CGroup may be overkill
> for some and the CGroup v2 unified hierarchy may be too inflexible.
> Regardless, we must support CGroup due its easy of use and existing users.
>
> For Usecase #1
> ----------
> Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
> to the CPU controller:
> - tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
> tagged. (In the kernel, the cookie will be derived from the pointer value
> of a ref-counted cookie object.). If reset, then the CGroup will inherit
> the parent CGroup's cookie if there is one.
>
> - color : The ref-counted object will be aligned say to a 256-byte boundary
> (for example), then the lower 8 bits of the pointer can be used to specify
> color. Together, the pointer with the color will form a cookie used by the
> scheduler.
>
> Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
> their color to be the same does not imply that the 2 groups will share a
> core. This is key.
Why? If I tag them the same, the expectation is they can share a core.
As you set, the colour is set by a privileged process.
> Also, to support usecase #4, we could add a third tag
> value -- 2, along with the usual 0 and 1 to suggest that the CGroup can share
> a core with cookie-0 tasks (Chris Hyser feel free to add any more comments
> here).
Would this not be the same as colour = null or something like that?
>
> For Usecase #2
> --------------
> We could add an interface that Peter suggested where 2 PIDs A and B want to
> share a core. So if A wants to share a core with B, then it issues
> prctl(SET_CORE_SHARE, B). ptrace_may_access() can be used to restrict access.
> For renderers though, we want to likely allow a renderer to share a core
> exclusive with only threads within a renderer and no one else. To support
> this, renderer A could simply issue prctl(SET_CORE_SHARE, A).
>
> For Usecase #3
> --------------
> By default, all threads within a process will share a core. This makes the
> most sense because threads in a process share the same virtual address space.
> However, for virtual machines in ChromeOS, we would like vCPU threads to not
> share a core with other vCPU threads as mentioned above. To support this,
> when a vCPU thread is forked, a new clone flag - CLONE_NEW_CORE_TAG could be
> introduced to cause the forked thread to not share a core with its parent.
> This could also support usecase #2 in the future (instead of prctl, a new
> renderer being forked can simply be passed CLONE_NEW_CORE_TAG which will tag the
> forked process or thread even if the forking process is not tagged).
>
> Other considerations:
> - To share a core anyway even if tags don't match: If we assume that the only
> purpose of core-scheduling is to enforce security, then if the kernel knows
> that CPUs are not vulnerable then cores can be shared anyway, whether the
> tasks are tagged or not (Suggested-by PeterZ).
>
Who knows which CPUs are vulnerable or not? Is there a way to have a
"paranoid" mode, where you don't trust the "vulnerable" list?
> - Addition of a new CGroup controller: Instead of CPU controller, it may be
> better to add a new CGroup controller in case the CPU controller is not
> attached to some parts of the hierarchy and it is still desirable to use
> CGroup interface for core tagging.
>
I agree with this. However, how will this work in a unified cgroup v2
hierrachy is an open question for me. There are lots of questions
about what happens with various interactions. Specifically as we start
using nested cgroups. (Goes to take a tylenol to make head stop
hurting)
> - Co-existence of CGroup with prctl/clone. The prctl/clone tagging should
> always be made to override CGroup. For this purpose, I propose a new
> 'tasks_no_cg_tag' or a similar file in the CGroup controller. This file
> will list all tasks that don't associate with the CGroup's tag. NOTE: I am not
> sure yet how this new file will work with prctl/clone-tagging of individual
> threads in a non-thread-mode CGroup v2 usage.
>
Umm. I think this should follow the same semantics as for cpuset /
taskset. cpuset is the group mechanism and taskset is the individual
thread mechanism. If I were to reset it in the cgroup later on, it
would be very confusing as to why a task did not get coloured the
right way.
> - Differences in tagging of a forked task (!CLONE_THREAD): If a process is
> a part of a CGroup and is forked, then the child process is automatically
> added to that CGroup. If such CGroup was tagged before, then the child is
> automatically tagged. However, it may be desired to give the child its own
> tag. In this case also, the earlier CLONE_NEW_CORE_TAG flag can be used to
> achieve this behavior. If the forking process was not a part of a CGroup
> but got a tag through other means before, then by default a !CLONE_THREAD
> fork would imply CLONE_NEW_CORE_TAG. However, to turns this off, a
> CLONE_CORE_TAG flag can be added (forking process's tag will be inheritted
> by the child).
>
I feel like the cgroup exceptions can be punted to another day. I have
no objections to the forking mechanism outside cgroup though.
Dhaval
On Mon, Aug 24, 2020 at 4:32 AM Vineeth Pillai <[email protected]> wrote:
>
> > Let me know your thoughts and looking forward to a good LPC MC discussion!
> >
>
> Nice write up Joel, thanks for taking time to compile this with great detail!
>
> After going through the details of interface proposal using cgroup v2
> controllers,
> and based on our discussion offline, would like to note down this idea
> about a new
> pseudo filesystem interface for core scheduling. We could include
> this also for the
> API discussion during core scheduler MC.
>
> coreschedfs: pseudo filesystem interface for Core Scheduling
> ----------------------------------------------------------------------------------
>
> The basic requirement of core scheduling is simple - we need to group a set
> of tasks into a trust group that can share a core. So we don’t really
> need a nested
> hierarchy for the trust groups. Cgroups v2 follow a unified nested
> hierarchy model
> that causes a considerable confusion if the trusted tasks are in
> different levels of the
> hierarchy and we need to allow them to share the core. Cgroup v2's
> single hierarchy
> model makes it difficult to regroup tasks in different levels of
> nesting for core scheduling.
> As noted in this mail, we could use multi-file approach and other
> interfaces like prctl to
> overcome this limitation.
>
> The idea proposed here to overcome the above limitation is to come up with a new
> pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
> filesystem with
> maximum nesting level of 1. That means, root directory can have
> sub-directories for
> sub-groups, but those sub-directories cannot have more sub-directories
> representing
> trust groups. Root directory is to represent the system wide trust
> group and sub-directories
> represent trusted groups. Each directory including the root directory
> has the following set
> of files/directories:
>
> - cookie_id: User exposed id for a cookie. This can be compared to a
> file descriptor.
> This could be used in programmatic API to join/leave a group
>
> - properties: This is an interface to specify how child tasks of this
> group should behave.
> Can be used for specifying future flag requirements as well.
> Current list of properties include:
> NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
> will result in
> creation of a new trust group
> SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
> group will end up in
> this same group
> ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
> group goes to the root group
>
> - tasks: Lists the tasks in this group. Main interface for adding
> removing tasks in a group
>
> - <pid>: A directory per task who is am member of this trust group.
> - <pid>/properties: This file is same as the parent properties file
> but this is to override
> the group setting.
>
> This pseudo filesystem can be mounted any where in the root
> filesystem, I propose the default
> to be in “/sys/kernel/coresched”
>
> When coresched is enabled, kernel internally creates the framework for
> this filesystem.
> The filesystem gets mounted to the default location and admin can
> change this if needed.
> All tasks by default are in the root group. The admin or programs can
> then create trusted
> groups on top of this filesystem.
>
> Hooks will be placed in fork() and exit() to make sure that the
> filesystem’s view of tasks is
> up-to-date with the system. Also, APIs manipulating core scheduling
> trusted groups should
> also make sure that the filesystem's view is updated.
>
> Note: The above idea is very similar to cgroups v1. Since there is no
> unified hierarchy
> in cgroup v1, most of the features of coreschedfs could be implemented
> as a cgroup v1
> controller. As no new v1 controllers are allowed, I feel the best
> alternative to have
> a simple API is to come up with a new filesystem - coreschedfs.
>
> The advantages of this approach is:
>
> - Detached from cgroup unified hierarchy and hence the very simple requirement
> of core scheduling can be easily materialized.
> - Admin can have fine-grained control of groups using shell and scripting
> - Can have programmatic access to this using existing APIs like mkdir,rmdir,
> write, read. Or can come up with new APIs using the cookie_id which can wrap
> t he above linux apis or use a new systemcall for core scheduling.
> - Fine grained permission control using linux filesystem permissions and ACLs
>
> Disadvantages are
> - yet another psuedo filesystem.
> - very similar to cgroup v1 and might be re-implementing features
> that are already
> provided by cgroups v1.
>
> Use Cases
> -----------------
>
> Usecase 1: Google cloud
> ---------------------------------
>
> Since we no longer depend on cgroup v2 hierarchies, there will not be
> any issue of
> nesting and sharing. The main daemon can create trusted groups in the
> fileystem and
> provide required permissions for the group. Then the init processes
> for each job can
> be added to respective groups for them to create children tasks as
> needed. Multiple
> jobs under the same customer which needs to share the core can be
> housed in one group.
>
>
> Usecase 2: Chrome browser
> ------------------------
>
> We start with one group for the first task and then set properties to
> NEW_COOKIE_FOR_CHILD.
>
> Usecase 3: chrome VMs
> ---------------------
>
> Similar to chrome browser, the VM task can make each vcpu on its own group.
>
> Usecase 4: Oracle use case
> --------------------------
> This is also similar to use case 1 with this interface. All tasks that need to
> be in the root group can be easily added by the admin.
>
> Use case 5: General virtualization
> ----------------------------------
>
> The requirement is each VM should be isolated. This can be easily done
> by creating a
> new group per VM
>
>
> Please have a look at the above proposal and let us know your
> thoughts. We shall include
> this also during the interface discussion at core scheduling MC.
>
I am inclined to say no to this. Yet another FS interface :-(. We are
just reinventing the wheel here. Let's try to stick within cgroupfs
first and see if we can make it work there.
Dhaval
On 8/21/20 11:01 PM, Joel Fernandes wrote:
> Hello!
> Core-scheduling aims to allow making it safe for more than 1 task that trust
> each other to safely share hyperthreads within a CPU core [1]. This results
> in a performance improvement for workloads that can benefit from using
> hyperthreading safely while limiting core-sharing when it is not safe.
>
> Currently no universally agreed set of interface exists and companies have
> been hacking up their own interface to make use of the patches. This post
> aims to list usecases which I got after talking to various people at Google
> and Oracle. After which actual development of code to add interfaces can follow.
>
> The below text uses the terms cookie and tag interchangeably. Further, cookie
> of 0 is assumed to indicate a trusted process - such as kernel threads or
> system daemons. By default, if nothing is tagged then everything is
> considered trusted since the scheduler assumes all tasks are a match for each
> other.
>
> Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
> int32 is split into 2 parts, the color and the id. The color can only be set
> by privileged processes and the id can be set by anyone. The CGroup structure
> looks like:
>
> A B
> / \ / \ \
> C D E F G
>
> Here A and B are container CGroups for 2 jobs are assigned a color by a
> privileged daemon. The job itself has more sub-CGroups within (for ex, B has
> E, F and G). When these sub-CGroups are spawned, they inherit the color from
> the parent. An unprivileged user can then set an id for the sub-CGroup
> without the knowledge of the privileged daemon if it desires to add further
> isolation. This setting of id can be an unprivileged operation because the
> root daemon has already isolated A and B.
>
> Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
> spawns a renderer. A renderer is a sandboxed process and it is assumed it
> could run arbitrary code (Javascript etc). When a renderer is created, a
> prctl call is made to tag the renderer. Every thread that is spawned by the
> renderer is also tagged. Essentially this turns SMT off for the renderer, but
> still gives a performance boost due to privileged system threads being able
> to share a core. The tagging also forbids the renderer from sharing a core
> with privileged system processes. In the future, we plan to allow threads to
> share a core as well (especially once we get syscall-isolation upstreamed.
> Patches were posted recently for the same [2]).
>
> Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
> tagged thus disallowing core sharing between the vCPU thread and any other
> thread on the system. This is because such VMs may run arbitrary user code
> and attack both the guest and the host systems sharing the core.
>
> Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
> talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
> to not have to share its parent's CGroup tag. In fact, it should be allowed to
> untag the child CGroup if needed thus allowing them to share a core with
> trusted tasks. Others have had similar requirements.
>
> Proposal for tagging
> --------------------
> We have to support both CGroup and non-CGroup users. CGroup may be overkill
> for some and the CGroup v2 unified hierarchy may be too inflexible.
> Regardless, we must support CGroup due its easy of use and existing users.
>
> For Usecase #1
> ----------
> Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
> to the CPU controller:
> - tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
> tagged. (In the kernel, the cookie will be derived from the pointer value
> of a ref-counted cookie object.). If reset, then the CGroup will inherit
> the parent CGroup's cookie if there is one.
>
> - color : The ref-counted object will be aligned say to a 256-byte boundary
> (for example), then the lower 8 bits of the pointer can be used to specify
> color. Together, the pointer with the color will form a cookie used by the
> scheduler.
>
> Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
> their color to be the same does not imply that the 2 groups will share a
> core. This is key. Also, to support usecase #4, we could add a third tag
> value -- 2, along with the usual 0 and 1 to suggest that the CGroup can share
> a core with cookie-0 tasks (Chris Hyser feel free to add any more comments
> here).
Let em think about this. This looks like it would support delegation of a cgroup subtree, which I suppose containers are
going to want eventually. That seems to be the advantage over just allowing setting the entire cookie. Anyway, I look
forward to tomorrow and thanks for putting this together.
-chrish
> For Usecase #2
> --------------
> We could add an interface that Peter suggested where 2 PIDs A and B want to
> share a core. So if A wants to share a core with B, then it issues
> prctl(SET_CORE_SHARE, B). ptrace_may_access() can be used to restrict access.
> For renderers though, we want to likely allow a renderer to share a core
> exclusive with only threads within a renderer and no one else. To support
> this, renderer A could simply issue prctl(SET_CORE_SHARE, A).
>
> For Usecase #3
> --------------
> By default, all threads within a process will share a core. This makes the
> most sense because threads in a process share the same virtual address space.
> However, for virtual machines in ChromeOS, we would like vCPU threads to not
> share a core with other vCPU threads as mentioned above. To support this,
> when a vCPU thread is forked, a new clone flag - CLONE_NEW_CORE_TAG could be
> introduced to cause the forked thread to not share a core with its parent.
> This could also support usecase #2 in the future (instead of prctl, a new
> renderer being forked can simply be passed CLONE_NEW_CORE_TAG which will tag the
> forked process or thread even if the forking process is not tagged).
>
> Other considerations:
> - To share a core anyway even if tags don't match: If we assume that the only
> purpose of core-scheduling is to enforce security, then if the kernel knows
> that CPUs are not vulnerable then cores can be shared anyway, whether the
> tasks are tagged or not (Suggested-by PeterZ).
>
> - Addition of a new CGroup controller: Instead of CPU controller, it may be
> better to add a new CGroup controller in case the CPU controller is not
> attached to some parts of the hierarchy and it is still desirable to use
> CGroup interface for core tagging.
>
> - Co-existence of CGroup with prctl/clone. The prctl/clone tagging should
> always be made to override CGroup. For this purpose, I propose a new
> 'tasks_no_cg_tag' or a similar file in the CGroup controller. This file
> will list all tasks that don't associate with the CGroup's tag. NOTE: I am not
> sure yet how this new file will work with prctl/clone-tagging of individual
> threads in a non-thread-mode CGroup v2 usage.
>
> - Differences in tagging of a forked task (!CLONE_THREAD): If a process is
> a part of a CGroup and is forked, then the child process is automatically
> added to that CGroup. If such CGroup was tagged before, then the child is
> automatically tagged. However, it may be desired to give the child its own
> tag. In this case also, the earlier CLONE_NEW_CORE_TAG flag can be used to
> achieve this behavior. If the forking process was not a part of a CGroup
> but got a tag through other means before, then by default a !CLONE_THREAD
> fork would imply CLONE_NEW_CORE_TAG. However, to turns this off, a
> CLONE_CORE_TAG flag can be added (forking process's tag will be inheritted
> by the child).
>
> Let me know your thoughts and looking forward to a good LPC MC discussion!
>
> thanks,
>
> - Joel
>
> [1] https://lwn.net/Articles/780703/
> [2] https://lwn.net/Articles/828889/
>
On 8/24/20 4:53 PM, chris hyser wrote:
> On 8/21/20 11:01 PM, Joel Fernandes wrote:
>> Hello!
>> Core-scheduling aims to allow making it safe for more than 1 task that trust
>> each other to safely share hyperthreads within a CPU core [1]. This results
>> in a performance improvement for workloads that can benefit from using
>> hyperthreading safely while limiting core-sharing when it is not safe.
>>
>> Currently no universally agreed set of interface exists and companies have
>> been hacking up their own interface to make use of the patches. This post
>> aims to list usecases which I got after talking to various people at Google
>> and Oracle. After which actual development of code to add interfaces can follow.
>>
>> The below text uses the terms cookie and tag interchangeably. Further, cookie
>> of 0 is assumed to indicate a trusted process - such as kernel threads or
>> system daemons. By default, if nothing is tagged then everything is
>> considered trusted since the scheduler assumes all tasks are a match for each
>> other.
>>
>> Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
>> int32 is split into 2 parts, the color and the id. The color can only be set
>> by privileged processes and the id can be set by anyone. The CGroup structure
>> looks like:
>>
>> A B
>> / \ / \ \
>> C D E F G
>>
>> Here A and B are container CGroups for 2 jobs are assigned a color by a
>> privileged daemon. The job itself has more sub-CGroups within (for ex, B has
>> E, F and G). When these sub-CGroups are spawned, they inherit the color from
>> the parent. An unprivileged user can then set an id for the sub-CGroup
>> without the knowledge of the privileged daemon if it desires to add further
>> isolation. This setting of id can be an unprivileged operation because the
>> root daemon has already isolated A and B.
>>
>> Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
>> spawns a renderer. A renderer is a sandboxed process and it is assumed it
>> could run arbitrary code (Javascript etc). When a renderer is created, a
>> prctl call is made to tag the renderer. Every thread that is spawned by the
>> renderer is also tagged. Essentially this turns SMT off for the renderer, but
>> still gives a performance boost due to privileged system threads being able
>> to share a core. The tagging also forbids the renderer from sharing a core
>> with privileged system processes. In the future, we plan to allow threads to
>> share a core as well (especially once we get syscall-isolation upstreamed.
>> Patches were posted recently for the same [2]).
>>
>> Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
>> tagged thus disallowing core sharing between the vCPU thread and any other
>> thread on the system. This is because such VMs may run arbitrary user code
>> and attack both the guest and the host systems sharing the core.
>>
>> Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
>> talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
>> to not have to share its parent's CGroup tag. In fact, it should be allowed to
>> untag the child CGroup if needed thus allowing them to share a core with
>> trusted tasks. Others have had similar requirements.
>>
>> Proposal for tagging
>> --------------------
>> We have to support both CGroup and non-CGroup users. CGroup may be overkill
>> for some and the CGroup v2 unified hierarchy may be too inflexible.
>> Regardless, we must support CGroup due its easy of use and existing users.
>>
>> For Usecase #1
>> ----------
>> Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
>> to the CPU controller:
>> - tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
>> tagged. (In the kernel, the cookie will be derived from the pointer value
>> of a ref-counted cookie object.). If reset, then the CGroup will inherit
>> the parent CGroup's cookie if there is one.
>>
>> - color : The ref-counted object will be aligned say to a 256-byte boundary
>> (for example), then the lower 8 bits of the pointer can be used to specify
>> color. Together, the pointer with the color will form a cookie used by the
>> scheduler.
>>
>> Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
>> their color to be the same does not imply that the 2 groups will share a
>> core. This is key. Also, to support usecase #4, we could add a third tag
>> value -- 2, along with the usual 0 and 1 to suggest that the CGroup can share
>> a core with cookie-0 tasks (Chris Hyser feel free to add any more comments
>> here).
>
> Let em think about this. This looks like it would support delegation of a cgroup subtree, which I suppose containers are
s/em/me/
Hey Dhaval,
On Mon, Aug 24, 2020 at 3:50 PM Dhaval Giani <[email protected]> wrote:
>
> On Fri, Aug 21, 2020 at 8:01 PM Joel Fernandes <[email protected]> wrote:
> >
> > Hello!
> > Core-scheduling aims to allow making it safe for more than 1 task that trust
> > each other to safely share hyperthreads within a CPU core [1]. This results
> > in a performance improvement for workloads that can benefit from using
> > hyperthreading safely while limiting core-sharing when it is not safe.
> >
> > Currently no universally agreed set of interface exists and companies have
> > been hacking up their own interface to make use of the patches. This post
> > aims to list usecases which I got after talking to various people at Google
> > and Oracle. After which actual development of code to add interfaces can follow.
> >
> > The below text uses the terms cookie and tag interchangeably. Further, cookie
> > of 0 is assumed to indicate a trusted process - such as kernel threads or
> > system daemons. By default, if nothing is tagged then everything is
> > considered trusted since the scheduler assumes all tasks are a match for each
> > other.
> >
> > Usecase 1: Google's cloud group tags CGroups with a 32-bit integer. This
> > int32 is split into 2 parts, the color and the id. The color can only be set
> > by privileged processes and the id can be set by anyone. The CGroup structure
> > looks like:
> >
> > A B
> > / \ / \ \
> > C D E F G
> >
> > Here A and B are container CGroups for 2 jobs are assigned a color by a
> > privileged daemon. The job itself has more sub-CGroups within (for ex, B has
> > E, F and G). When these sub-CGroups are spawned, they inherit the color from
> > the parent. An unprivileged user can then set an id for the sub-CGroup
> > without the knowledge of the privileged daemon if it desires to add further
> > isolation. This setting of id can be an unprivileged operation because the
> > root daemon has already isolated A and B.
> >
> > Usecase 2: Chrome browser - tagging renderers. In Chrome, each tab opened
> > spawns a renderer. A renderer is a sandboxed process and it is assumed it
> > could run arbitrary code (Javascript etc). When a renderer is created, a
> > prctl call is made to tag the renderer. Every thread that is spawned by the
> > renderer is also tagged. Essentially this turns SMT off for the renderer, but
> > still gives a performance boost due to privileged system threads being able
> > to share a core. The tagging also forbids the renderer from sharing a core
> > with privileged system processes. In the future, we plan to allow threads to
> > share a core as well (especially once we get syscall-isolation upstreamed.
> > Patches were posted recently for the same [2]).
> >
> > Usecase 3: ChromeOS VMs - each vCPU thread that is created by the VMM is
> > tagged thus disallowing core sharing between the vCPU thread and any other
> > thread on the system. This is because such VMs may run arbitrary user code
> > and attack both the guest and the host systems sharing the core.
> >
> > Usecase 4: Oracle - Setting a sub-CGroup as trusted (cookie 0). Chris Hyser
> > talked to me on IRC that in a CGroup hierarcy, some CGroups should be allowed
> > to not have to share its parent's CGroup tag. In fact, it should be allowed to
> > untag the child CGroup if needed thus allowing them to share a core with
> > trusted tasks. Others have had similar requirements.
> >
>
> Just to augment this. This doesn't necessarily need to be cgroup
> based. We do have a need where certain processes want to be tagged
> separately from others, which are in the same cgroup hierarchy. The
> standard mechanism for this is nested cgroups. With a unified
> hierarchy, and with cgroup tagging, I am unsure what this really
> means. Consider
>
> root
> |- A
> |- A1
> |- A2
>
> If A is tagged, can processes in A1 and A2 share a core? Should they
> share a core? In some cases we might be OK with them sharing cores
> just to get some of the performance back. Core scheduling really needs
> to be limited to just the processes that we want to protect.
Yeah this is exactly why Vineeth was suggesting separate FS without
nested hierarchies. The CGroupv2 unified hierarchy may be too
restrictive to pick and chose which nested or non-nested
sub-hierarchies want to share a core if the root of any (sub)hierarchy
is tagged. As mentioned in CGroup v2 documentation, someone thought it
is a good idea to kill the CGroup v1's non-unified flexibility so here
we are. Don't get me wrong, it has advantages but then the lack of
flexibility results in issues as these. BTW, there are other
usecases where CGroup v2 has shown to be inflexible at my employer.
The other option is a new system call to share a core between pid A
and pid B and the user can create arbitrary relationships as they
choose.
>
> > Proposal for tagging
> > --------------------
> > We have to support both CGroup and non-CGroup users. CGroup may be overkill
> > for some and the CGroup v2 unified hierarchy may be too inflexible.
> > Regardless, we must support CGroup due its easy of use and existing users.
> >
> > For Usecase #1
> > ----------
> > Usecase #1 requires a 2-level tagging mechanism. I propose 2 new files
> > to the CPU controller:
> > - tag : a boolean (0/1). If set, this CGroup and all sub-CGroups will be
> > tagged. (In the kernel, the cookie will be derived from the pointer value
> > of a ref-counted cookie object.). If reset, then the CGroup will inherit
> > the parent CGroup's cookie if there is one.
> >
> > - color : The ref-counted object will be aligned say to a 256-byte boundary
> > (for example), then the lower 8 bits of the pointer can be used to specify
> > color. Together, the pointer with the color will form a cookie used by the
> > scheduler.
> >
> > Note that if 2 CGroups belong to 2 different tagged hierarchies, then setting
> > their color to be the same does not imply that the 2 groups will share a
> > core. This is key.
>
> Why? If I tag them the same, the expectation is they can share a core.
> As you set, the colour is set by a privileged process.
The color can be used along with 'tag' to provide a second-level of
isolation. So like, a container can decide to further create isolation
within itself, but the system daemon that creates the container need
not worry since the container is already isolated from other
containers, owing to the first-level of isolation via the tag file.
> > Also, to support usecase #4, we could add a third tag
> > value -- 2, along with the usual 0 and 1 to suggest that the CGroup can share
> > a core with cookie-0 tasks (Chris Hyser feel free to add any more comments
> > here).
>
> Would this not be the same as colour = null or something like that?
That could be done.
> > For Usecase #2
> > --------------
> > We could add an interface that Peter suggested where 2 PIDs A and B want to
> > share a core. So if A wants to share a core with B, then it issues
> > prctl(SET_CORE_SHARE, B). ptrace_may_access() can be used to restrict access.
> > For renderers though, we want to likely allow a renderer to share a core
> > exclusive with only threads within a renderer and no one else. To support
> > this, renderer A could simply issue prctl(SET_CORE_SHARE, A).
> >
> > For Usecase #3
> > --------------
> > By default, all threads within a process will share a core. This makes the
> > most sense because threads in a process share the same virtual address space.
> > However, for virtual machines in ChromeOS, we would like vCPU threads to not
> > share a core with other vCPU threads as mentioned above. To support this,
> > when a vCPU thread is forked, a new clone flag - CLONE_NEW_CORE_TAG could be
> > introduced to cause the forked thread to not share a core with its parent.
> > This could also support usecase #2 in the future (instead of prctl, a new
> > renderer being forked can simply be passed CLONE_NEW_CORE_TAG which will tag the
> > forked process or thread even if the forking process is not tagged).
> >
> > Other considerations:
> > - To share a core anyway even if tags don't match: If we assume that the only
> > purpose of core-scheduling is to enforce security, then if the kernel knows
> > that CPUs are not vulnerable then cores can be shared anyway, whether the
> > tasks are tagged or not (Suggested-by PeterZ).
> >
>
> Who knows which CPUs are vulnerable or not? Is there a way to have a
> "paranoid" mode, where you don't trust the "vulnerable" list?
That's upto the arch right, I haven't dug too much into how this
works, but I know sysfs shows if a CPU has some vulnerabilities or not
such as if microcode is present to do flushing, or not.
> > - Addition of a new CGroup controller: Instead of CPU controller, it may be
> > better to add a new CGroup controller in case the CPU controller is not
> > attached to some parts of the hierarchy and it is still desirable to use
> > CGroup interface for core tagging.
> >
>
> I agree with this. However, how will this work in a unified cgroup v2
> hierrachy is an open question for me. There are lots of questions
> about what happens with various interactions. Specifically as we start
> using nested cgroups. (Goes to take a tylenol to make head stop
> hurting)
Exactly. Consider my reply on the first para repeated here.
> > - Co-existence of CGroup with prctl/clone. The prctl/clone tagging should
> > always be made to override CGroup. For this purpose, I propose a new
> > 'tasks_no_cg_tag' or a similar file in the CGroup controller. This file
> > will list all tasks that don't associate with the CGroup's tag. NOTE: I am not
> > sure yet how this new file will work with prctl/clone-tagging of individual
> > threads in a non-thread-mode CGroup v2 usage.
> >
>
> Umm. I think this should follow the same semantics as for cpuset /
> taskset. cpuset is the group mechanism and taskset is the individual
> thread mechanism. If I were to reset it in the cgroup later on, it
> would be very confusing as to why a task did not get coloured the
> right way.
For cpuset, the cpuset rules and other mechanisms like
sched_setaffinity will obey it. I am not sure if that will make sense
for tagging though. Are you saying the prctl should fail to tag
something if cgroup has already tagged?
> > - Differences in tagging of a forked task (!CLONE_THREAD): If a process is
> > a part of a CGroup and is forked, then the child process is automatically
> > added to that CGroup. If such CGroup was tagged before, then the child is
> > automatically tagged. However, it may be desired to give the child its own
> > tag. In this case also, the earlier CLONE_NEW_CORE_TAG flag can be used to
> > achieve this behavior. If the forking process was not a part of a CGroup
> > but got a tag through other means before, then by default a !CLONE_THREAD
> > fork would imply CLONE_NEW_CORE_TAG. However, to turns this off, a
> > CLONE_CORE_TAG flag can be added (forking process's tag will be inheritted
> > by the child).
> >
>
> I feel like the cgroup exceptions can be punted to another day. I have
> no objections to the forking mechanism outside cgroup though.
Yeah, this part of my email was just about how forking of a task in a
cgroup behaves wrt tagging. Since it gets a bit weird , I was
suggesting the use of earlier mentioned flags.
Thanks.