2014-10-16 18:44:23

by Shivappa Vikas

[permalink] [raw]
Subject: Cache Allocation Technology Design

Hi All , We have put together a draft design document for cache
allocation technology below. Please review the same and let us know any
feedback.

Make sure you cc my email [email protected] when replying

Thanks,
Vikas

What is Cache Allocation Technology ( CAT )
-------------------------------------------

Cache Allocation Technology provides a way for the Software (OS/VMM)
to restrict cache allocation to a defined 'subset' of cache which may
be overlapping with other 'subsets'. This feature is used when
allocating a line in cache ie when pulling new data into the cache.
The programming of the h/w is done via programming MSRs.

The different cache subsets are identified by CLOS identifier (class
of service) and each CLOS has a CBM (cache bit mask). The CBM is a
contiguous set of bits which defines the amount of cache resource that
is available for each 'subset'.

Why is CAT (cache allocation technology) needed
------------------------------------------------

The CAT enables more cache resources to be made available for higher
priority applications based on guidance from the execution
environment.

The architecture also allows dynamically changing these subsets during
runtime to further optimize the performance of the higher priority
application with minimal degradation to the low priority app.
Additionally, resources can be rebalanced for system throughput
benefit. (Refer to Section 17.15 in the Intel SDM
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf)

This technique may be useful in managing large computer systems which
large LLC. Examples may be large servers running instances of
webservers or database servers. In such complex systems, these subsets
can be used for more careful placing of the available cache
resources.

The CAT kernel patch would provide a basic kernel framework for users
to be able to implement such cache subsets.


Kernel implementation Overview
-------------------------------

Kernel implements a cgroup subsystem to support Cache Allocation.

Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each
cgroup would have one CBM and would just represent one cache 'subset'.

The user would be allowed to create as many directories as there are
CLOSs defined by the h/w. If user tries to create more than the
available CLOSs , -ENOSPC is returned. Currently we support only one
level of directory, ie directory can be created only under the root.

There are 2 modes supported

1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs
specified by the 'cpus' file. The tasks in the CAT cgroup would be
constrained only on the CPUs in the 'cpus' file. The CPUs in this file
are exclusively used for this cgroup. Requests by task
using the sched_setaffinity() would be filtered through the tasks
'cpus'.

These tasks would get to fill the LLC cache represented by the
cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as
the existing cpumask datastructure.

2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be
for a group of tasks. There is no 'cpus' file and the CPUs that the
tasks run are not restricted by the CAT cgroup


Assignment of CBM,CLOS and modes
---------------------------------

Root directory would have all bits in 'cbm' file by default.

The cbm_max file in the root defines the maximum number of bits
describing the available cache units. Say if cbm_max is 16 then the
'cbm' cannot have more than 16 bits.

The 'affinitized' file is either 0 or 1 which represent the two modes.
System would boot with affinitized mode and all CPUs would have all
bits in cbm set meaning all CPUs have 100% cache(effectively cache
allocation is not in effect).

The 'cbm' file is restricted to having no more than its cbm_max least
significant bits set. Any contiguous subset of these bits maybe set to
indication the cache mapping desired. The 'cbm' between 2 directories
can overlap. The 'cbm' would represent the cache 'subset' of the CAT
cgroup. For ex: on a system with 16 bits of max cbm bits , if the
directory has the least significant 4 bits set in its 'cbm' file, it
would be allocated the right quarter of the Last level cache which
means the tasks belonging to this CAT cgroup can use the right quarter
of the cache to fill. If it has the most significant 8 bits set ,it
would be allocated the left half of the cache(8 bits out of 16
represents 50%).

The cache subset would be affinitized to a set of cpus in affinitized
mode. The CPUs to which this allocation is affinitized to is
represented by the 'cpus' file. The 'cpus' need to be mutually
exclusive from cpus of other directories.

The cache portion defined in the CBM file is available to all tasks
within the CAT group and these task are not allowed to allocate space
in other parts of the cache.

'cbm' file is used in both modes where as the 'cpus' file is relevant
in affinitized mode and would disappear in non-affinitized mode.


Scheduling and Context Switch
------------------------------

In affinitized mode , the cache 'subset' and the tasks in a CAT cgroup
are affinitized to the CPUs represented by the CAT cgroup's 'cpus'
file i.e when user sets the 'cbm' to 'portion' and 'cpus' to c and
'tasks' to t, the tasks 't' would always be scheduled on cpus 'c' and
will get to fill in the allocated 'portion' in last level cache.

As noted above ,in the affinitized mode the tasks in a CAT cgroup
would also be affinitized to the CPUs in the 'cpus' file of the
directory. Following hooks in the kernel are required to implement
this (on the lines of cpuset code)
- in sched_setaffinity to mask the requested cpu mask with what is
present in the task's 'cpus'
- in migrate_task to migrate the tasks only to those CPUs in the
'cpus' file if possible.
- in select_task_rq

In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file
indicate the tasks the cache subset is affinitized to. When user adds
tasks to the tasks file , the tasks would get to fill the cache subset
represented by the CAT cgroup's 'cbm' file.

During context switch kernel implements this by writing the
corresponding CLOSid (internally maintained by kernel) of the CAT
cgroup to the CPU's IA32_PQR_ASSOC MSR.

Usage and Example
-----------------


Following would mount the cache allocation cgroup subsystem and create
2 directories. Please refer to Documentation/cgroups/cgroups.txt on
details about how to use cgroups.

cd /sys/fs/cgroup
mkdir cachealloc
mount -t cgroup -ocachealloc cachealloc /sys/fs/cgroup/cachealloc
cd cachealloc

Create 2 cat cgroups

mkdir group1
mkdir group2

Following are some of the Files in the directory

ls
cachea.cbm
cachea.cpus . cpus file only appears in the affinitized mode
cgroup.procs
tasks
cbm_max (root only)
affinitized (root only) . by default itsaffinitized mode

Say if the cache is 2MB and cbm supports 16 bits, then setting the
below allocates the 'right 1/4th(512KB)' of the cache to group2

Edit the CBM for group2 to set the least significant 4 bits. This
allocates 'right quarter' of the cache.

cd group2
/bin/echo 0xf > cachealloc.cbm

Change cpus in the directory.

/bin/echo 1-4 > cachealloc.cpus

Edit the CBM for group2 to set the least significant 8 bits.This
allocates the right half of the cache to 'group2'.

cd group2
/bin/echo 0xff > cachea.cbm

Assign tasks to the group2

/bin/echo PID1 > tasks
/bin/echo PID2 > tasks
Meaning now threads
PID1 and PID2 runs on CPUs 1-4 , and get to fill the 'right half' of
the cache. The tasks PID1 and PID2 can only have a subset of the cpu
affinity defined in the 'cpus' file

Edit the affinitized to 0.mode is changed in root directory cd ..

/bin/echo 0 > cachealloc.affinitized

Now the tasks and the cache allocation is not affinitized to the CPUs
and the task's cpu affinity is not restricted to being with the subset
of 'cpus' cpumask.






2014-10-20 16:19:00

by Matt Fleming

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

(Cc'ing Peter Zijlstra for comments)

On Thu, 16 Oct, at 11:44:10AM, vikas wrote:
> Hi All , We have put together a draft design document for cache
> allocation technology below. Please review the same and let us know any
> feedback.
>
> Make sure you cc my email [email protected] when replying
>
> Thanks,
> Vikas
>
> What is Cache Allocation Technology ( CAT )
> -------------------------------------------
>
> Cache Allocation Technology provides a way for the Software (OS/VMM)
> to restrict cache allocation to a defined 'subset' of cache which may
> be overlapping with other 'subsets'. This feature is used when
> allocating a line in cache ie when pulling new data into the cache.
> The programming of the h/w is done via programming MSRs.
>
> The different cache subsets are identified by CLOS identifier (class
> of service) and each CLOS has a CBM (cache bit mask). The CBM is a
> contiguous set of bits which defines the amount of cache resource that
> is available for each 'subset'.
>
> Why is CAT (cache allocation technology) needed
> ------------------------------------------------
>
> The CAT enables more cache resources to be made available for higher
> priority applications based on guidance from the execution
> environment.
>
> The architecture also allows dynamically changing these subsets during
> runtime to further optimize the performance of the higher priority
> application with minimal degradation to the low priority app.
> Additionally, resources can be rebalanced for system throughput
> benefit. (Refer to Section 17.15 in the Intel SDM
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf)
>
> This technique may be useful in managing large computer systems which
> large LLC. Examples may be large servers running instances of
> webservers or database servers. In such complex systems, these subsets
> can be used for more careful placing of the available cache
> resources.
>
> The CAT kernel patch would provide a basic kernel framework for users
> to be able to implement such cache subsets.
>
>
> Kernel implementation Overview
> -------------------------------
>
> Kernel implements a cgroup subsystem to support Cache Allocation.
>
> Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each
> cgroup would have one CBM and would just represent one cache 'subset'.
>
> The user would be allowed to create as many directories as there are
> CLOSs defined by the h/w. If user tries to create more than the
> available CLOSs , -ENOSPC is returned. Currently we support only one
> level of directory, ie directory can be created only under the root.
>
> There are 2 modes supported
>
> 1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs
> specified by the 'cpus' file. The tasks in the CAT cgroup would be
> constrained only on the CPUs in the 'cpus' file. The CPUs in this file
> are exclusively used for this cgroup. Requests by task
> using the sched_setaffinity() would be filtered through the tasks
> 'cpus'.
>
> These tasks would get to fill the LLC cache represented by the
> cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as
> the existing cpumask datastructure.
>
> 2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be
> for a group of tasks. There is no 'cpus' file and the CPUs that the
> tasks run are not restricted by the CAT cgroup
>
>
> Assignment of CBM,CLOS and modes
> ---------------------------------
>
> Root directory would have all bits in 'cbm' file by default.
>
> The cbm_max file in the root defines the maximum number of bits
> describing the available cache units. Say if cbm_max is 16 then the
> 'cbm' cannot have more than 16 bits.
>
> The 'affinitized' file is either 0 or 1 which represent the two modes.
> System would boot with affinitized mode and all CPUs would have all
> bits in cbm set meaning all CPUs have 100% cache(effectively cache
> allocation is not in effect).
>
> The 'cbm' file is restricted to having no more than its cbm_max least
> significant bits set. Any contiguous subset of these bits maybe set to
> indication the cache mapping desired. The 'cbm' between 2 directories
> can overlap. The 'cbm' would represent the cache 'subset' of the CAT
> cgroup. For ex: on a system with 16 bits of max cbm bits , if the
> directory has the least significant 4 bits set in its 'cbm' file, it
> would be allocated the right quarter of the Last level cache which
> means the tasks belonging to this CAT cgroup can use the right quarter
> of the cache to fill. If it has the most significant 8 bits set ,it
> would be allocated the left half of the cache(8 bits out of 16
> represents 50%).
>
> The cache subset would be affinitized to a set of cpus in affinitized
> mode. The CPUs to which this allocation is affinitized to is
> represented by the 'cpus' file. The 'cpus' need to be mutually
> exclusive from cpus of other directories.
>
> The cache portion defined in the CBM file is available to all tasks
> within the CAT group and these task are not allowed to allocate space
> in other parts of the cache.
>
> 'cbm' file is used in both modes where as the 'cpus' file is relevant
> in affinitized mode and would disappear in non-affinitized mode.
>
>
> Scheduling and Context Switch
> ------------------------------
>
> In affinitized mode , the cache 'subset' and the tasks in a CAT cgroup
> are affinitized to the CPUs represented by the CAT cgroup's 'cpus'
> file i.e when user sets the 'cbm' to 'portion' and 'cpus' to c and
> 'tasks' to t, the tasks 't' would always be scheduled on cpus 'c' and
> will get to fill in the allocated 'portion' in last level cache.
>
> As noted above ,in the affinitized mode the tasks in a CAT cgroup
> would also be affinitized to the CPUs in the 'cpus' file of the
> directory. Following hooks in the kernel are required to implement
> this (on the lines of cpuset code)
> - in sched_setaffinity to mask the requested cpu mask with what is
> present in the task's 'cpus'
> - in migrate_task to migrate the tasks only to those CPUs in the
> 'cpus' file if possible.
> - in select_task_rq
>
> In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file
> indicate the tasks the cache subset is affinitized to. When user adds
> tasks to the tasks file , the tasks would get to fill the cache subset
> represented by the CAT cgroup's 'cbm' file.
>
> During context switch kernel implements this by writing the
> corresponding CLOSid (internally maintained by kernel) of the CAT
> cgroup to the CPU's IA32_PQR_ASSOC MSR.
>
> Usage and Example
> -----------------
>
>
> Following would mount the cache allocation cgroup subsystem and create
> 2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> details about how to use cgroups.
>
> cd /sys/fs/cgroup
> mkdir cachealloc
> mount -t cgroup -ocachealloc cachealloc /sys/fs/cgroup/cachealloc
> cd cachealloc
>
> Create 2 cat cgroups
>
> mkdir group1
> mkdir group2
>
> Following are some of the Files in the directory
>
> ls
> cachea.cbm
> cachea.cpus . cpus file only appears in the affinitized mode
> cgroup.procs
> tasks
> cbm_max (root only)
> affinitized (root only) . by default itsaffinitized mode
>
> Say if the cache is 2MB and cbm supports 16 bits, then setting the
> below allocates the 'right 1/4th(512KB)' of the cache to group2
>
> Edit the CBM for group2 to set the least significant 4 bits. This
> allocates 'right quarter' of the cache.
>
> cd group2
> /bin/echo 0xf > cachealloc.cbm
>
> Change cpus in the directory.
>
> /bin/echo 1-4 > cachealloc.cpus
>
> Edit the CBM for group2 to set the least significant 8 bits.This
> allocates the right half of the cache to 'group2'.
>
> cd group2
> /bin/echo 0xff > cachea.cbm
>
> Assign tasks to the group2
>
> /bin/echo PID1 > tasks
> /bin/echo PID2 > tasks
> Meaning now threads
> PID1 and PID2 runs on CPUs 1-4 , and get to fill the 'right half' of
> the cache. The tasks PID1 and PID2 can only have a subset of the cpu
> affinity defined in the 'cpus' file
>
> Edit the affinitized to 0.mode is changed in root directory cd ..
>
> /bin/echo 0 > cachealloc.affinitized
>
> Now the tasks and the cache allocation is not affinitized to the CPUs
> and the task's cpu affinity is not restricted to being with the subset
> of 'cpus' cpumask.

--
Matt Fleming, Intel Open Source Technology Center

2014-10-24 10:53:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Mon, Oct 20, 2014 at 05:18:55PM +0100, Matt Fleming wrote:
> > What is Cache Allocation Technology ( CAT )
> > -------------------------------------------

Its a horrible name is what it is, please consider using the old name,
that at least was clear in purpose.

> > Kernel implementation Overview
> > -------------------------------
> >
> > Kernel implements a cgroup subsystem to support Cache Allocation.
> >
> > Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each
> > cgroup would have one CBM and would just represent one cache 'subset'.
> >
> > The user would be allowed to create as many directories as there are
> > CLOSs defined by the h/w. If user tries to create more than the
> > available CLOSs , -ENOSPC is returned. Currently we support only one
> > level of directory, ie directory can be created only under the root.

NAK, cgroups must support full hierarchies, simply enforce that the
child cgroup's mask is a subset of the parent's.

> > There are 2 modes supported
> >
> > 1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs
> > specified by the 'cpus' file. The tasks in the CAT cgroup would be
> > constrained only on the CPUs in the 'cpus' file. The CPUs in this file
> > are exclusively used for this cgroup. Requests by task
> > using the sched_setaffinity() would be filtered through the tasks
> > 'cpus'.

NAK, we will not have yet another cgroup mucking about with task
affinities.

> > These tasks would get to fill the LLC cache represented by the
> > cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as
> > the existing cpumask datastructure.
> >
> > 2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be
> > for a group of tasks. There is no 'cpus' file and the CPUs that the
> > tasks run are not restricted by the CAT cgroup

It appears to me this 'mode' thing is entirely superfluous and can be
constructed by voluntary operation of this and cpusets or manual
affinity calls.

> > Assignment of CBM,CLOS and modes
> > ---------------------------------
> >
> > Root directory would have all bits in 'cbm' file by default.
> >
> > The cbm_max file in the root defines the maximum number of bits
> > describing the available cache units. Say if cbm_max is 16 then the
> > 'cbm' cannot have more than 16 bits.

This seems redundant, if you've already stated that the root cbm is the
full set, there is no need to further provide this.

> > The 'cbm' file is restricted to having no more than its cbm_max least
> > significant bits set. Any contiguous subset of these bits maybe set to
> > indication the cache mapping desired. The 'cbm' between 2 directories
> > can overlap. The 'cbm' would represent the cache 'subset' of the CAT
> > cgroup.

This would follow from the hierarchy requirement/conditions.

> > Scheduling and Context Switch
> > ------------------------------

> > In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file
> > indicate the tasks the cache subset is affinitized to. When user adds
> > tasks to the tasks file , the tasks would get to fill the cache subset
> > represented by the CAT cgroup's 'cbm' file.
> >
> > During context switch kernel implements this by writing the
> > corresponding CLOSid (internally maintained by kernel) of the CAT
> > cgroup to the CPU's IA32_PQR_ASSOC MSR.

Right.

2014-10-28 23:22:20

by Matt Fleming

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Fri, 24 Oct, at 12:53:06PM, Peter Zijlstra wrote:
>
> NAK, cgroups must support full hierarchies, simply enforce that the
> child cgroup's mask is a subset of the parent's.

For the specific case of Cache Allocation, if we're creating hierarchies
from bitmasks, there's a very clear limit to how we can divide up the
bits - we can't support an indefinite number of cgroup directories.

What do you mean by "full hierarchies"?

--
Matt Fleming, Intel Open Source Technology Center

2014-10-29 08:16:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Tue, Oct 28, 2014 at 11:22:15PM +0000, Matt Fleming wrote:
> On Fri, 24 Oct, at 12:53:06PM, Peter Zijlstra wrote:
> >
> > NAK, cgroups must support full hierarchies, simply enforce that the
> > child cgroup's mask is a subset of the parent's.
>
> For the specific case of Cache Allocation, if we're creating hierarchies
> from bitmasks, there's a very clear limit to how we can divide up the
> bits - we can't support an indefinite number of cgroup directories.
>
> What do you mean by "full hierarchies"?

Ah, so one way around that is to only assign a (whats the CQE equivalent
of RMIDs again?) once you stick a task in.

But basically it means you need to allow things like:

root/virt/more/crap/hostA
/hostB
/sanityA
/random/other/yunk

Now, the root will have the entire bitmask set, any child, say
virt/more/crap can also have them all set, and you can maybe only start
differentiating in the /host[AB] bits.

Whether or not it makes sense, libvirt likes to create these pointless
deep hierarchies, as do a lot of other people for that matter.

2014-10-29 12:48:40

by Matt Fleming

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Wed, 29 Oct, at 09:16:40AM, Peter Zijlstra wrote:
>
> Ah, so one way around that is to only assign a (whats the CQE equivalent
> of RMIDs again?) once you stick a task in.

I think you're after "Class of Service" (CLOS) ID.

Yeah we can do the CLOS ID assignment on-demand but what we can't do
on-demand is the cache bitmask assignment, i.e. how we carve up the LLC.
These need to persist irrespective of which task is running. And it's
the cache bitmask that I'm specifically talking about not allowing
arbitrarly deep nesting.

So if I create a cgroup directory with a mask of 0x3 in the root cgroup
directory for CAT (meow). Then, create two sub-directories, and split my
0x3 bitmask into 0x2 and 0x1, it's impossible to nest any further, i.e.

/sys/fs/cgroup/cacheqe 0xffffffff
|
|
meow 0x3
/ \
/ \
sub1 sub2 0x1, 0x2

Of course the pathological case is creating a cgroup directory with
bitmask 0x1, so you can't have sub-directories because you can't split
the cache allocation at all.

Does this fly in the face of "full hierarchies"? Or is this a reasonable
limitation?

> But basically it means you need to allow things like:
>
> root/virt/more/crap/hostA
> /hostB
> /sanityA
> /random/other/yunk
>
> Now, the root will have the entire bitmask set, any child, say
> virt/more/crap can also have them all set, and you can maybe only start
> differentiating in the /host[AB] bits.
>
> Whether or not it makes sense, libvirt likes to create these pointless
> deep hierarchies, as do a lot of other people for that matter.

OK, this is something I hadn't considered; that you may *not* want to
split the cache bitmask as you move down the hierarchy.

I think that's something we could do without too much pain, though
actually programming that from a user perspective makes my head hurt.

--
Matt Fleming, Intel Open Source Technology Center

2014-10-29 13:45:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Wed, Oct 29, 2014 at 12:48:34PM +0000, Matt Fleming wrote:
> On Wed, 29 Oct, at 09:16:40AM, Peter Zijlstra wrote:
> >
> > Ah, so one way around that is to only assign a (whats the CQE equivalent
> > of RMIDs again?) once you stick a task in.
>
> I think you're after "Class of Service" (CLOS) ID.
>
> Yeah we can do the CLOS ID assignment on-demand but what we can't do
> on-demand is the cache bitmask assignment, i.e. how we carve up the LLC.
> These need to persist irrespective of which task is running. And it's
> the cache bitmask that I'm specifically talking about not allowing
> arbitrarly deep nesting.
>
> So if I create a cgroup directory with a mask of 0x3 in the root cgroup
> directory for CAT (meow).

All we now need is a DOG to go woof :-) and they can have a party.

> Then, create two sub-directories, and split my
> 0x3 bitmask into 0x2 and 0x1, it's impossible to nest any further, i.e.
>
> /sys/fs/cgroup/cacheqe 0xffffffff
> |
> |
> meow 0x3
> / \
> / \
> sub1 sub2 0x1, 0x2
>
> Of course the pathological case is creating a cgroup directory with
> bitmask 0x1, so you can't have sub-directories because you can't split
> the cache allocation at all.
>
> Does this fly in the face of "full hierarchies"? Or is this a reasonable
> limitation?

I don't see a reason why we should not allow further children of sub1,
they'll all have to have 0x1, but that should be fine, pointless
perhaps, but perfectly consistent.

> > But basically it means you need to allow things like:
> >
> > root/virt/more/crap/hostA
> > /hostB
> > /sanityA
> > /random/other/yunk
> >
> > Now, the root will have the entire bitmask set, any child, say
> > virt/more/crap can also have them all set, and you can maybe only start
> > differentiating in the /host[AB] bits.
> >
> > Whether or not it makes sense, libvirt likes to create these pointless
> > deep hierarchies, as do a lot of other people for that matter.
>
> OK, this is something I hadn't considered; that you may *not* want to
> split the cache bitmask as you move down the hierarchy.
>
> I think that's something we could do without too much pain, though
> actually programming that from a user perspective makes my head hurt.

Right, also note that in the libvirt case, most of the intermediate
groups are empty (of tasks) and would thus not actually instantiate a
CLOS thingy.

2014-10-29 16:33:16

by Auld, Will

[permalink] [raw]
Subject: RE: Cache Allocation Technology Design

I maybe repeating what Peter has just said but for elements in the hierarchy where the mask is the same as its parents mask there is no need for a separate CLOS even in the case where there are tasks in the group. So we can inherit the CLOS of the parent until which time both the mask is different than the parent and there are tasks in the group.

Thanks,

Will

> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Wednesday, October 29, 2014 6:45 AM
> To: Matt Fleming
> Cc: Vikas Shivappa; [email protected]; Fleming, Matt; Auld,
> Will; Tejun Heo; Shivappa, Vikas
> Subject: Re: Cache Allocation Technology Design
>
> On Wed, Oct 29, 2014 at 12:48:34PM +0000, Matt Fleming wrote:
> > On Wed, 29 Oct, at 09:16:40AM, Peter Zijlstra wrote:
> > >
> > > Ah, so one way around that is to only assign a (whats the CQE
> > > equivalent of RMIDs again?) once you stick a task in.
> >
> > I think you're after "Class of Service" (CLOS) ID.
> >
> > Yeah we can do the CLOS ID assignment on-demand but what we can't do
> > on-demand is the cache bitmask assignment, i.e. how we carve up the
> LLC.
> > These need to persist irrespective of which task is running. And it's
> > the cache bitmask that I'm specifically talking about not allowing
> > arbitrarly deep nesting.
> >
> > So if I create a cgroup directory with a mask of 0x3 in the root
> > cgroup directory for CAT (meow).
>
> All we now need is a DOG to go woof :-) and they can have a party.
>
> > Then, create two sub-directories, and split my
> > 0x3 bitmask into 0x2 and 0x1, it's impossible to nest any further,
> i.e.
> >
> > /sys/fs/cgroup/cacheqe 0xffffffff
> > |
> > |
> > meow 0x3
> > / \
> > / \
> > sub1 sub2 0x1, 0x2
> >
> > Of course the pathological case is creating a cgroup directory with
> > bitmask 0x1, so you can't have sub-directories because you can't
> split
> > the cache allocation at all.
> >
> > Does this fly in the face of "full hierarchies"? Or is this a
> > reasonable limitation?
>
> I don't see a reason why we should not allow further children of sub1,
> they'll all have to have 0x1, but that should be fine, pointless
> perhaps, but perfectly consistent.
>
> > > But basically it means you need to allow things like:
> > >
> > > root/virt/more/crap/hostA
> > > /hostB
> > > /sanityA
> > > /random/other/yunk
> > >
> > > Now, the root will have the entire bitmask set, any child, say
> > > virt/more/crap can also have them all set, and you can maybe only
> > > start differentiating in the /host[AB] bits.
> > >
> > > Whether or not it makes sense, libvirt likes to create these
> > > pointless deep hierarchies, as do a lot of other people for that
> matter.
> >
> > OK, this is something I hadn't considered; that you may *not* want to
> > split the cache bitmask as you move down the hierarchy.
> >
> > I think that's something we could do without too much pain, though
> > actually programming that from a user perspective makes my head hurt.
>
> Right, also note that in the libvirt case, most of the intermediate
> groups are empty (of tasks) and would thus not actually instantiate a
> CLOS thingy.

2014-10-29 17:26:18

by Shivappa Vikas

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design



On Fri, 24 Oct 2014, Peter Zijlstra wrote:

> On Mon, Oct 20, 2014 at 05:18:55PM +0100, Matt Fleming wrote:
>>> What is Cache Allocation Technology ( CAT )
>>> -------------------------------------------
>
> Its a horrible name is what it is, please consider using the old name,
> that at least was clear in purpose.
>
>>> Kernel implementation Overview
>>> -------------------------------
>>>
>>> Kernel implements a cgroup subsystem to support Cache Allocation.
>>>
>>> Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each
>>> cgroup would have one CBM and would just represent one cache 'subset'.
>>>
>>> The user would be allowed to create as many directories as there are
>>> CLOSs defined by the h/w. If user tries to create more than the
>>> available CLOSs , -ENOSPC is returned. Currently we support only one
>>> level of directory, ie directory can be created only under the root.
>
> NAK, cgroups must support full hierarchies, simply enforce that the
> child cgroup's mask is a subset of the parent's.
>
>>> There are 2 modes supported
>>>
>>> 1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs
>>> specified by the 'cpus' file. The tasks in the CAT cgroup would be
>>> constrained only on the CPUs in the 'cpus' file. The CPUs in this file
>>> are exclusively used for this cgroup. Requests by task
>>> using the sched_setaffinity() would be filtered through the tasks
>>> 'cpus'.
>
> NAK, we will not have yet another cgroup mucking about with task
> affinities.
>
>>> These tasks would get to fill the LLC cache represented by the
>>> cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as
>>> the existing cpumask datastructure.
>>>
>>> 2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be
>>> for a group of tasks. There is no 'cpus' file and the CPUs that the
>>> tasks run are not restricted by the CAT cgroup
>
> It appears to me this 'mode' thing is entirely superfluous and can be
> constructed by voluntary operation of this and cpusets or manual
> affinity calls.

Do you mean user would would just user the cpusets for cpu affinity and
CAT cgroup for cache allocation as shown in example below ?

In other words say affinitize the PID1 and PID2 to CPUs 1 and 2
and then set the desired cache allocation as well like below - then we
have the desired cpu affinity and cache allocation for these PIDs..

cd /sys/fs/cgroup/cpuset

mkdir group1_specialuse
/bin/echo 1-2 > cpuset.cpus
/bin/echo PID1 > tasks
/bin/echo PID2 > tasks

Now come to CAT and do the cache allocation for the same tasks PID1 and
PID2.

cd /sys/fs/cgroup/cat (CAT cgroup)

mkdir group1_specialuse (keeping same name just for understanding)
/bin/echo 0xf > cat.cbm (set the cache bit mask)
/bin/echo PID1 > tasks
/bin/echo PID2 > tasks



>
>>> Assignment of CBM,CLOS and modes
>>> ---------------------------------
>>>
>>> Root directory would have all bits in 'cbm' file by default.
>>>
>>> The cbm_max file in the root defines the maximum number of bits
>>> describing the available cache units. Say if cbm_max is 16 then the
>>> 'cbm' cannot have more than 16 bits.
>
> This seems redundant, if you've already stated that the root cbm is the
> full set, there is no need to further provide this.
>
>>> The 'cbm' file is restricted to having no more than its cbm_max least
>>> significant bits set. Any contiguous subset of these bits maybe set to
>>> indication the cache mapping desired. The 'cbm' between 2 directories
>>> can overlap. The 'cbm' would represent the cache 'subset' of the CAT
>>> cgroup.
>
> This would follow from the hierarchy requirement/conditions.
>
>>> Scheduling and Context Switch
>>> ------------------------------
>
>>> In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file
>>> indicate the tasks the cache subset is affinitized to. When user adds
>>> tasks to the tasks file , the tasks would get to fill the cache subset
>>> represented by the CAT cgroup's 'cbm' file.
>>>
>>> During context switch kernel implements this by writing the
>>> corresponding CLOSid (internally maintained by kernel) of the CAT
>>> cgroup to the CPU's IA32_PQR_ASSOC MSR.
>
> Right.
>

2014-10-29 17:28:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Wed, Oct 29, 2014 at 04:32:04PM +0000, Auld, Will wrote:
> I maybe repeating what Peter has just said but for elements in the
> hierarchy where the mask is the same as its parents mask there is no
> need for a separate CLOS even in the case where there are tasks in the
> group. So we can inherit the CLOS of the parent until which time both
> the mask is different than the parent and there are tasks in the
> group.

I did not state that explicitly, but I did think about that. We could
still wait to allocate a CLOS until at least one such group acquires a
task.

2014-10-29 17:41:49

by Shivappa Vikas

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design



On Wed, 29 Oct 2014, Peter Zijlstra wrote:

> On Wed, Oct 29, 2014 at 04:32:04PM +0000, Auld, Will wrote:
>> I maybe repeating what Peter has just said but for elements in the
>> hierarchy where the mask is the same as its parents mask there is no
>> need for a separate CLOS even in the case where there are tasks in the
>> group. So we can inherit the CLOS of the parent until which time both
>> the mask is different than the parent and there are tasks in the
>> group.
>
> I did not state that explicitly, but I did think about that. We could
> still wait to allocate a CLOS until at least one such group acquires a
> task.
>

Was wondering if it is a requirement of the 'full hierarchy' for the
child to inherit the cbm of parent ? .
Alternately we can allocate the CLOSid when a cgroup
is created and have an empty cbm - but dont let the tasks to be added
unless the user assigns a cbm. Cpuset does something similar where
its necessary to set the cpu mask(empty by default) of a cgroup before adding
tasks.

2014-10-29 18:16:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Wed, Oct 29, 2014 at 10:26:16AM -0700, Vikas Shivappa wrote:
> >It appears to me this 'mode' thing is entirely superfluous and can be
> >constructed by voluntary operation of this and cpusets or manual
> >affinity calls.
>
> Do you mean user would would just user the cpusets for cpu affinity and CAT
> cgroup for cache allocation as shown in example below ?
>
> In other words say affinitize the PID1 and PID2 to CPUs 1 and 2
> and then set the desired cache allocation as well like below - then we have
> the desired cpu affinity and cache allocation for these PIDs..
>
> cd /sys/fs/cgroup/cpuset
>
> mkdir group1_specialuse
> /bin/echo 1-2 > cpuset.cpus
> /bin/echo PID1 > tasks
> /bin/echo PID2 > tasks
>
> Now come to CAT and do the cache allocation for the same tasks PID1 and
> PID2.
>
> cd /sys/fs/cgroup/cat (CAT cgroup)
>
> mkdir group1_specialuse (keeping same name just for understanding)
> /bin/echo 0xf > cat.cbm (set the cache bit mask)
> /bin/echo PID1 > tasks
> /bin/echo PID2 > tasks
>

Yah, except I have a strong urge to mount cpusets under /dog when you
put it like that ;-)

Or co-mount cpusets and pets and do it that way.

2014-10-29 18:22:39

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Wed, Oct 29, 2014 at 10:41:47AM -0700, Vikas Shivappa wrote:
> Was wondering if it is a requirement of the 'full hierarchy' for the child
> to inherit the cbm of parent ? .
> Alternately we can allocate the CLOSid when a cgroup is created and have an
> empty cbm - but dont let the tasks to be added unless the user assigns a

Please don't do that. All controllers must be fully hierarchical,
shouldn't fail task migration and always allow execution of member
tasks.

Thanks.

--
tejun

2014-10-30 07:07:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Wed, Oct 29, 2014 at 02:22:34PM -0400, Tejun Heo wrote:
> On Wed, Oct 29, 2014 at 10:41:47AM -0700, Vikas Shivappa wrote:
> > Was wondering if it is a requirement of the 'full hierarchy' for the child
> > to inherit the cbm of parent ? .
> > Alternately we can allocate the CLOSid when a cgroup is created and have an
> > empty cbm - but dont let the tasks to be added unless the user assigns a
>
> Please don't do that. All controllers must be fully hierarchical,

With you so far.

> shouldn't fail task migration

If this means echo $tid > tasks, then sorry we can't do. There is a
limited number of hardware resources backing this thing. At some point
they're consumed and something must give.

So either we fail mkdir, but that means allocating CLOS IDs for possibly
empty cgroups, or we allocate on demand which means failing task
assignment.

The same -- albeit for a different reason -- is true of the RT sched
groups, we simply cannot instantiate them such that tasks can join,
sysads _have_ to configure them before we can add tasks to them.

> and always allow execution of member tasks.

If we accept tasks, they'll run.

2014-10-30 07:14:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote:
> > and always allow execution of member tasks.

This too btw is not strictly speaking possible for all controllers. Most
all sched controllers live by the grace of forcing tasks not to run at
times (eg. the bandwidth controls), falsifying the 'always'.

2014-10-30 12:43:44

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

Hello, Peter.

On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote:
> If this means echo $tid > tasks, then sorry we can't do. There is a
> limited number of hardware resources backing this thing. At some point
> they're consumed and something must give.

And that something shouldn't be disallowing task migration across
cgroups. This simply doesn't work with co-mounting or unified
hierarchy. cpuset automatically takes on the nearest ancestor's
configuration which has enough execution resources. Maybe that can be
an option for this too?

One of the problems is that we generally assume that a task can run
some point in time in a lot of places in the kernel and can't just not
run a task indefinitely because it's in a cgroup configured certain
way.

> So either we fail mkdir, but that means allocating CLOS IDs for possibly
> empty cgroups, or we allocate on demand which means failing task
> assignment.

Can't fail mkdir or css enabling either. Again, co-mounting and
unified hierarchy. Also, the behavior is just horrible to use from
userland.

> The same -- albeit for a different reason -- is true of the RT sched
> groups, we simply cannot instantiate them such that tasks can join,
> sysads _have_ to configure them before we can add tasks to them.

Yeah, RT is one of the main items which is problematic, more so
because it's currently coupled with the normal sched controller and
the default config doesn't have any RT slice. Do we completely block
RT task w/o slice? Is that okay?

Thanks.

--
tejun

2014-10-30 12:44:46

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Oct 30, 2014 at 08:14:24AM +0100, Peter Zijlstra wrote:
> On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote:
> > > and always allow execution of member tasks.
>
> This too btw is not strictly speaking possible for all controllers. Most
> all sched controllers live by the grace of forcing tasks not to run at
> times (eg. the bandwidth controls), falsifying the 'always'.

Oh sure, the a task just has to run in a foreseeable future, or
rather, a task must not be blocked indefinitely requiring userland
intervention to become executable again.

Thanks.

--
tejun

2014-10-30 13:18:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Oct 30, 2014 at 08:43:33AM -0400, Tejun Heo wrote:
> Hello, Peter.
>
> On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote:
> > If this means echo $tid > tasks, then sorry we can't do. There is a
> > limited number of hardware resources backing this thing. At some point
> > they're consumed and something must give.
>
> And that something shouldn't be disallowing task migration across
> cgroups. This simply doesn't work with co-mounting or unified
> hierarchy. cpuset automatically takes on the nearest ancestor's
> configuration which has enough execution resources. Maybe that can be
> an option for this too?

It will give very random and nondeterministic behaviour and basically
destroy the entire purpose of the controller (which are the very same
reasons I detest that 'new' behaviour in cpusets).

> One of the problems is that we generally assume that a task can run
> some point in time in a lot of places in the kernel and can't just not
> run a task indefinitely because it's in a cgroup configured certain
> way.

Refusing tasks into a previously empty cgroup creates no such problems.
Its already in a cgroup (wherever its parent was) and it can run there,
failing to move it to another does not affect things.

> > So either we fail mkdir, but that means allocating CLOS IDs for possibly
> > empty cgroups, or we allocate on demand which means failing task
> > assignment.
>
> Can't fail mkdir or css enabling either. Again, co-mounting and
> unified hierarchy. Also, the behavior is just horrible to use from
> userland.

In order to fix the co-mounting and unified hierarchy I still need to
hear a proposal for that tasks vs processes thing.

Traditionally the cgroups were task based, but many controllers are
process based (simply because what they control is process wide, not per
task), and there was talk (2-3 years ago or so) about making the entire
cgroup thing per process, which obviously fails for all scheduler
related cgroups.

> > The same -- albeit for a different reason -- is true of the RT sched
> > groups, we simply cannot instantiate them such that tasks can join,
> > sysads _have_ to configure them before we can add tasks to them.
>
> Yeah, RT is one of the main items which is problematic, more so
> because it's currently coupled with the normal sched controller and
> the default config doesn't have any RT slice.

Simply because you cannot give a slice on creation; or if you did that
would mean failing mkdir when a new cgroup would exceed the available
time.

Also any !0 slice is wrong because it will not match the requirements of
the proposed workload, the administrator will have to set it to match
the workload.

Therefore 0.

> Do we completely block RT task w/o slice? Is that okay?

We will not allow an RT task in, the write to the tasks file will fail.

The same will be true for deadline tasks, we'll fail entry into a cgroup
when the combined requirements of the tasks exceed the provisions of the
group.

There is just no way around that and still provide sane semantics.

2014-10-30 13:19:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Oct 30, 2014 at 08:44:40AM -0400, Tejun Heo wrote:
> On Thu, Oct 30, 2014 at 08:14:24AM +0100, Peter Zijlstra wrote:
> > On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote:
> > > > and always allow execution of member tasks.
> >
> > This too btw is not strictly speaking possible for all controllers. Most
> > all sched controllers live by the grace of forcing tasks not to run at
> > times (eg. the bandwidth controls), falsifying the 'always'.
>
> Oh sure, the a task just has to run in a foreseeable future, or
> rather, a task must not be blocked indefinitely requiring userland
> intervention to become executable again.

Like the freezer cgroup you mean? ;-)

2014-10-30 14:20:18

by Matt Fleming

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, 30 Oct, at 08:43:33AM, Tejun Heo wrote:
> Hello, Peter.
>
> On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote:
> > If this means echo $tid > tasks, then sorry we can't do. There is a
> > limited number of hardware resources backing this thing. At some point
> > they're consumed and something must give.
>
> And that something shouldn't be disallowing task migration across
> cgroups. This simply doesn't work with co-mounting or unified
> hierarchy. cpuset automatically takes on the nearest ancestor's
> configuration which has enough execution resources. Maybe that can be
> an option for this too?

Oh, you can always add more tasks to a cgroup, or move tasks between
cgroups. What you can't always do is create more cgroups.

--
Matt Fleming, Intel Open Source Technology Center

2014-10-30 15:25:08

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

Hello, Peter.

On Thu, Oct 30, 2014 at 02:19:04PM +0100, Peter Zijlstra wrote:
> On Thu, Oct 30, 2014 at 08:44:40AM -0400, Tejun Heo wrote:
> > On Thu, Oct 30, 2014 at 08:14:24AM +0100, Peter Zijlstra wrote:
> > > On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote:
> > > > > and always allow execution of member tasks.
> > >
> > > This too btw is not strictly speaking possible for all controllers. Most
> > > all sched controllers live by the grace of forcing tasks not to run at
> > > times (eg. the bandwidth controls), falsifying the 'always'.
> >
> > Oh sure, the a task just has to run in a foreseeable future, or
> > rather, a task must not be blocked indefinitely requiring userland
> > intervention to become executable again.
>
> Like the freezer cgroup you mean? ;-)

Oh yeah, that's horribly broken. Merging it with jobctl stop is a
todo item. This "stuck in a random place in kernel" thing made sense
for suspend/hibernation only because the kernel wasn't gonna run
anymore. The fact that this got exposed to userland on a running
system just shows how little we were thinking while implementing all
the controllers. It should be equivalent to layered job control stop
so that what's prevented from running is the userland part, not
kernel.

Thanks.

--
tejun

2014-10-30 17:03:37

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

Hey, Peter.

On Thu, Oct 30, 2014 at 02:18:45PM +0100, Peter Zijlstra wrote:
> On Thu, Oct 30, 2014 at 08:43:33AM -0400, Tejun Heo wrote:
> > And that something shouldn't be disallowing task migration across
> > cgroups. This simply doesn't work with co-mounting or unified
> > hierarchy. cpuset automatically takes on the nearest ancestor's
> > configuration which has enough execution resources. Maybe that can be
> > an option for this too?
>
> It will give very random and nondeterministic behaviour and basically
> destroy the entire purpose of the controller (which are the very same
> reasons I detest that 'new' behaviour in cpusets).

I agree with you that this is a corner case behavior which deviates
from the usual behavior; however, the deviation is inherent. This
stems from the fact that the kernel in general doesn't allow tasks
which cannot be run. You say that you detest the new behaviors of
cpuset; however, the old behaviors were just as sucky - bouncing tasks
to an ancestor cgroup forcifully and without any indication or way to
restore the previous configuration. What's different with the new
behavior is that it explicitly distinguishes between the configured
and effective configurations as the kernel isn't capable for actually
enforcing certain subset of configurations.

So, the inherent problem is always there no matter what we do and the
question is that of a policy to deal with it. One of the main issues
I see with failing cgroup-level operations for controller specific
reasons is lack of visibility. All you can get out of a failed
operation is a single error return and there's no good way to
communicate why something isn't working, well not even who's the
culprit. Having "effective" vs "configured" makes it explicit that
the kernel isn't capable of honoring all configurations and make the
details of the situation visible.

Another part is inconsistencies across controllers. This sure is
worse when there are multiple controllers involved but inconsistent
behaviors across different hierarchies are annoying all the same with
single controller multiple hierarchies. Userland often manages some
of those hierarchies together and it can get horribly confusing. No
matter what, we need to settle on a single policy and having effective
configuration seems like the better one.

> > One of the problems is that we generally assume that a task can run
> > some point in time in a lot of places in the kernel and can't just not
> > run a task indefinitely because it's in a cgroup configured certain
> > way.
>
> Refusing tasks into a previously empty cgroup creates no such problems.
> Its already in a cgroup (wherever its parent was) and it can run there,
> failing to move it to another does not affect things.

Yeah, sure, hard failing can work too. It didn't work well for cpuset
because a runnable configuration may become not so if the system
config changes afterwards but this probably doesn't have an issue like
that. I'm not saying something like the above won't work. It'd but I
don't think that's the right place to fail.

This controller might not even require the distinction between
configured and effective tho? Can't a new child just inherit the
parent's configuration and never allow the config to become completely
empty? The problem cpuset faces is that of underlying hardware
configuration changing. This one doesn't have that.

> > > So either we fail mkdir, but that means allocating CLOS IDs for possibly
> > > empty cgroups, or we allocate on demand which means failing task
> > > assignment.
> >
> > Can't fail mkdir or css enabling either. Again, co-mounting and
> > unified hierarchy. Also, the behavior is just horrible to use from
> > userland.
>
> In order to fix the co-mounting and unified hierarchy I still need to
> hear a proposal for that tasks vs processes thing.
>
> Traditionally the cgroups were task based, but many controllers are
> process based (simply because what they control is process wide, not per
> task), and there was talk (2-3 years ago or so) about making the entire
> cgroup thing per process, which obviously fails for all scheduler
> related cgroups.

Yeah, it needs to be a separate interface where a given userland task
can access its own knobs in a race-free way (cgroup interface can't
even do that) whether that's a pseudo filesystem, say,
/proc/self/BLAHBLAH or new syscalls. This one is necessary regardless
of what happens with cgroup. cgroup simply isn't a suitable mechanism
to expose these types of knobs to individual userland threads.

> > Yeah, RT is one of the main items which is problematic, more so
> > because it's currently coupled with the normal sched controller and
> > the default config doesn't have any RT slice.
>
> Simply because you cannot give a slice on creation; or if you did that
> would mean failing mkdir when a new cgroup would exceed the available
> time.
>
> Also any !0 slice is wrong because it will not match the requirements of
> the proposed workload, the administrator will have to set it to match
> the workload.
>
> Therefore 0.

As long as RT is separate from normal sched controller, this *could*
be fine. The main problem now is that userland which wants to use the
cpu controller but doesn't want to fully manage RT slices end up
disabling RT slices. It might work if a new child can share the
parent's slice till explicitly configured. Another problem is when
you wanna change the configuration after the hierarchy is already
populated. I don't know. I'd even be happy with cgroup not having
anything to do with RT slice distribution. Do you have any ideas
which can make RT slice distribution more palatable? If we can't
decouple the two, we'd be effectively requiring whoever is managing
the cpu controller to also become a full-fledged RT slice arbitrator,
which might actually work too.

> > Do we completely block RT task w/o slice? Is that okay?
>
> We will not allow an RT task in, the write to the tasks file will fail.
>
> The same will be true for deadline tasks, we'll fail entry into a cgroup
> when the combined requirements of the tasks exceed the provisions of the
> group.
>
> There is just no way around that and still provide sane semantics.

Can't a task just lose RT / deadline properties when migrating into a
different RT / deadline domain? We already modify task properties on
migration for cpuset after all. It'd be far simpler that way.

Thanks.

--
tejun

2014-10-30 17:12:52

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Oct 30, 2014 at 07:58:34AM -0700, Tim Hockin wrote:
> Another reason unified hierarchy is a bad model.

Things wrong with this message.

1. Top posted. It isn't clear which part you're referring to and this
was pointed out to you multiple times in the past.

2. No real thoughts or technical details. Maybe you had some in your
head but nothing was elaborated. This forces me to guess what you
had on mind when you produced the above sentence and of course me
not being you this takes a considerable amount of brain cycle and
I'd still end up with multiple alternative scenarios that I'll have
to cover.

3. Needlessly loaded expression, which forces me to respond.

Combined, this is just rude and you've been showing this type of
behavior multiple times. Behave yourself.

--
tejun

2014-10-30 21:44:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Oct 30, 2014 at 01:03:31PM -0400, Tejun Heo wrote:
> Hey, Peter.
>
> On Thu, Oct 30, 2014 at 02:18:45PM +0100, Peter Zijlstra wrote:
> > On Thu, Oct 30, 2014 at 08:43:33AM -0400, Tejun Heo wrote:
> > > And that something shouldn't be disallowing task migration across
> > > cgroups. This simply doesn't work with co-mounting or unified
> > > hierarchy. cpuset automatically takes on the nearest ancestor's
> > > configuration which has enough execution resources. Maybe that can be
> > > an option for this too?
> >
> > It will give very random and nondeterministic behaviour and basically
> > destroy the entire purpose of the controller (which are the very same
> > reasons I detest that 'new' behaviour in cpusets).
>
> I agree with you that this is a corner case behavior which deviates
> from the usual behavior; however, the deviation is inherent. This
> stems from the fact that the kernel in general doesn't allow tasks
> which cannot be run. You say that you detest the new behaviors of
> cpuset; however, the old behaviors were just as sucky - bouncing tasks
> to an ancestor cgroup forcifully and without any indication or way to
> restore the previous configuration. What's different with the new
> behavior is that it explicitly distinguishes between the configured
> and effective configurations as the kernel isn't capable for actually
> enforcing certain subset of configurations.

If a cpu bounces (by accident or whatever) then there is no trace left
behind that the system didn't in fact observe/obey its constraints. It
should have provided an error or failed the hotplug. But we digress,
lets not have this discussion (again :) and focus on the new thing.

> So, the inherent problem is always there no matter what we do and the
> question is that of a policy to deal with it. One of the main issues
> I see with failing cgroup-level operations for controller specific
> reasons is lack of visibility. All you can get out of a failed
> operation is a single error return and there's no good way to
> communicate why something isn't working, well not even who's the
> culprit. Having "effective" vs "configured" makes it explicit that
> the kernel isn't capable of honoring all configurations and make the
> details of the situation visible.

Right, so that is a short coming of the co-mount idea. Your effective vs
configured thing is misleading and surprising though. Operations might
'succeed' and still have failed, without any clear
indication/notification of change.

> Another part is inconsistencies across controllers. This sure is
> worse when there are multiple controllers involved but inconsistent
> behaviors across different hierarchies are annoying all the same with
> single controller multiple hierarchies. Userland often manages some
> of those hierarchies together and it can get horribly confusing. No
> matter what, we need to settle on a single policy and having effective
> configuration seems like the better one.

I'm not entirely sure I follow. Without co-mounting its entirely obvious
which one is failing.

Also, per the previous point, since you need a notification channel
anyway, you might as well do the expected fail and report more details
through that.

> > > One of the problems is that we generally assume that a task can run
> > > some point in time in a lot of places in the kernel and can't just not
> > > run a task indefinitely because it's in a cgroup configured certain
> > > way.
> >
> > Refusing tasks into a previously empty cgroup creates no such problems.
> > Its already in a cgroup (wherever its parent was) and it can run there,
> > failing to move it to another does not affect things.
>
> Yeah, sure, hard failing can work too. It didn't work well for cpuset
> because a runnable configuration may become not so if the system
> config changes afterwards but this probably doesn't have an issue like
> that. I'm not saying something like the above won't work. It'd but I
> don't think that's the right place to fail.

Right, this thing doesn't suffer that particular problem if its
good it stays good.

> This controller might not even require the distinction between
> configured and effective tho? Can't a new child just inherit the
> parent's configuration and never allow the config to become completely
> empty?

It can do that. But that still has a problem, there is a mapping in
hardware which restricts the number of active configurations. The total
configuration space is larger than the supported active configurations.

So _something_ must fail. The initial proposal was mkdir failing when
there were more than the hardware supported active config cgroup
directories. The alternative was on-demand activation where we only
allocate the hardware resource when the first task gets moved into the
group -- which then clearly can fail.

> > Traditionally the cgroups were task based, but many controllers are
> > process based (simply because what they control is process wide, not per
> > task), and there was talk (2-3 years ago or so) about making the entire
> > cgroup thing per process, which obviously fails for all scheduler
> > related cgroups.
>
> Yeah, it needs to be a separate interface where a given userland task
> can access its own knobs in a race-free way (cgroup interface can't
> even do that) whether that's a pseudo filesystem, say,
> /proc/self/BLAHBLAH or new syscalls. This one is necessary regardless
> of what happens with cgroup. cgroup simply isn't a suitable mechanism
> to expose these types of knobs to individual userland threads.

I'm not sure what you're saying there. You want to replace the
task-controllers with another pseudo filesystem that does it differently
but still is a hierarchical controller?, how is that different from just
not co-mounting the task and process based controllers, either way you
end up with 2 separate hierarchies.

> > > Yeah, RT is one of the main items which is problematic, more so
> > > because it's currently coupled with the normal sched controller and
> > > the default config doesn't have any RT slice.
> >
> > Simply because you cannot give a slice on creation; or if you did that
> > would mean failing mkdir when a new cgroup would exceed the available
> > time.
> >
> > Also any !0 slice is wrong because it will not match the requirements of
> > the proposed workload, the administrator will have to set it to match
> > the workload.
> >
> > Therefore 0.
>
> As long as RT is separate from normal sched controller, this *could*
> be fine. The main problem now is that userland which wants to use the
> cpu controller but doesn't want to fully manage RT slices end up
> disabling RT slices.

I don't get this, who but the admin manages things, and how would you
accidentally have an RT app and not know about it. And if you're in that
situation you're screwed anyhow, since you've no f'ing clue how to
configure your system for it anyhow. At which point you're in deep.

> It might work if a new child can share the
> parent's slice till explicitly configured.

Principle of least surprise. That's surprising behaviour. Why move it in
he first place?

> Another problem is when
> you wanna change the configuration after the hierarchy is already
> populated.

We fail the configuration change. For RR/FIFO we won't allow you to set
the slice to 0 if there's tasks. For deadline we would fail everything
that tries to lower things below the utilization required by the tasks
(and child groups).

> I don't know. I'd even be happy with cgroup not having
> anything to do with RT slice distribution. Do you have any ideas
> which can make RT slice distribution more palatable? If we can't
> decouple the two, we'd be effectively requiring whoever is managing
> the cpu controller to also become a full-fledged RT slice arbitrator,
> which might actually work too.

The admin you mean? He had better know what the heck he's doing if he's
running RT apps, great fail is otherwise fairly deterministic in his
future.

The thing is, you cannot arbiter this stuff, RR/FIFO are horrible pieces
of shit interfaces, they don't describe near enough. People need to be
involved.

> > > Do we completely block RT task w/o slice? Is that okay?
> >
> > We will not allow an RT task in, the write to the tasks file will fail.
> >
> > The same will be true for deadline tasks, we'll fail entry into a cgroup
> > when the combined requirements of the tasks exceed the provisions of the
> > group.
> >
> > There is just no way around that and still provide sane semantics.
>
> Can't a task just lose RT / deadline properties when migrating into a
> different RT / deadline domain? We already modify task properties on
> migration for cpuset after all. It'd be far simpler that way.

Again, why move it in the first place? This all sounds like whomever is
doing this is clueless. You don't move RT tasks about if you're not
intimately aware of them and their requirements.

2014-10-30 22:22:41

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

Hello,

On Thu, Oct 30, 2014 at 10:43:53PM +0100, Peter Zijlstra wrote:
> If a cpu bounces (by accident or whatever) then there is no trace left
> behind that the system didn't in fact observe/obey its constraints. It
> should have provided an error or failed the hotplug. But we digress,
> lets not have this discussion (again :) and focus on the new thing.

Oh, we sure can have notifications / persistent markers to track
deviation from the configuration. It's not like the old scheme did
much better in this respect. It just wrecked the configuration
without telling anyone. If this matters enough, we need error
recording / reporting no matter which way we choose. I'm not against
that at all.

> > So, the inherent problem is always there no matter what we do and the
> > question is that of a policy to deal with it. One of the main issues
> > I see with failing cgroup-level operations for controller specific
> > reasons is lack of visibility. All you can get out of a failed
> > operation is a single error return and there's no good way to
> > communicate why something isn't working, well not even who's the
> > culprit. Having "effective" vs "configured" makes it explicit that
> > the kernel isn't capable of honoring all configurations and make the
> > details of the situation visible.
>
> Right, so that is a short coming of the co-mount idea. Your effective vs
> configured thing is misleading and surprising though. Operations might
> 'succeed' and still have failed, without any clear
> indication/notification of change.

Hmmm... it gets more pronounced w/ co-mounting but it's also problem
with isolated hierarchies too. How is changing configuration
irreversibly without any notificaiton any less surprising? It's the
same end result. The only difference is that there's no way to go
back when the resource which went offline comes back. I really don't
think configuration being silently changed counts as a valid
notification mechanism to userland.

> > Another part is inconsistencies across controllers. This sure is
> > worse when there are multiple controllers involved but inconsistent
> > behaviors across different hierarchies are annoying all the same with
> > single controller multiple hierarchies. Userland often manages some
> > of those hierarchies together and it can get horribly confusing. No
> > matter what, we need to settle on a single policy and having effective
> > configuration seems like the better one.
>
> I'm not entirely sure I follow. Without co-mounting its entirely obvious
> which one is failing.

Sure, "which" is easier w/o co-mounting. Why can still be hard tho as
migration is an "apply all the configs" event.

> Also, per the previous point, since you need a notification channel
> anyway, you might as well do the expected fail and report more details
> through that.

How do you match the failure to the specific migration attempt tho? I
really can't think of a good and simple interface for that given the
interface that we have. For most controllers, it is fairly straight
forward to avoid controller specific migration failures. Sure, cpuset
is special but it has to be special one way or the other.

> > This controller might not even require the distinction between
> > configured and effective tho? Can't a new child just inherit the
> > parent's configuration and never allow the config to become completely
> > empty?
>
> It can do that. But that still has a problem, there is a mapping in
> hardware which restricts the number of active configurations. The total
> configuration space is larger than the supported active configurations.
>
> So _something_ must fail. The initial proposal was mkdir failing when
> there were more than the hardware supported active config cgroup
> directories. The alternative was on-demand activation where we only
> allocate the hardware resource when the first task gets moved into the
> group -- which then clearly can fail.

Hmmm... why can't it just refuse creating a different configuration
when its config space is full? Make children inherit the parent's
configuration and refuse config writes which require it to create a
new one if the config space is full. Seems pretty straight-forward.
What am I missing?

> > Yeah, it needs to be a separate interface where a given userland task
> > can access its own knobs in a race-free way (cgroup interface can't
> > even do that) whether that's a pseudo filesystem, say,
> > /proc/self/BLAHBLAH or new syscalls. This one is necessary regardless
> > of what happens with cgroup. cgroup simply isn't a suitable mechanism
> > to expose these types of knobs to individual userland threads.
>
> I'm not sure what you're saying there. You want to replace the
> task-controllers with another pseudo filesystem that does it differently
> but still is a hierarchical controller?, how is that different from just
> not co-mounting the task and process based controllers, either way you
> end up with 2 separate hierarchies.

It doesn't have much to do with co-mounting.

The process itself often has to be involved in assigning different
properties to its threads. It requires intimiate knowledge of which
one is doing what meaning that accessing self's knobs is the most
common use case rather than an external entity reaching inside. This
means that this should be a programmable interface accessible from
each binary. cgroup is horrible for this. A process has to read path
from /proc/self/cgroups and then access the cgroup that it's in, which
BTW could have changed inbetween.

It really needs a proper programmable interface which guarantees self
access. I don't know what the exact form should be. It can be an
extension to sched_setattr(), a new syscall or a pseudo filesystem
scoped to the process.

> > I don't know. I'd even be happy with cgroup not having
> > anything to do with RT slice distribution. Do you have any ideas
> > which can make RT slice distribution more palatable? If we can't
> > decouple the two, we'd be effectively requiring whoever is managing
> > the cpu controller to also become a full-fledged RT slice arbitrator,
> > which might actually work too.
>
> The admin you mean? He had better know what the heck he's doing if he's

Resource management is automated in a lot of cases and it's only gonna
be more so in the future. It's about having behaviors which are more
palatable to that but please read on.

> running RT apps, great fail is otherwise fairly deterministic in his
> future.
>
> The thing is, you cannot arbiter this stuff, RR/FIFO are horrible pieces
> of shit interfaces, they don't describe near enough. People need to be
> involved.

So, I think it'd be best if RT/deadline stuff can be separated out so
that grouping the usual BE scheduling doesn't affect them, but if
that's not feasible, yeah, I agree the only thing which we can do is
requiring the entity which is controlling the cpu hierarchy, which may
be a human admin or whatever manager, to distribute them explicitly.
There doesn't seem to be any way around it.

> > Can't a task just lose RT / deadline properties when migrating into a
> > different RT / deadline domain? We already modify task properties on
> > migration for cpuset after all. It'd be far simpler that way.
>
> Again, why move it in the first place? This all sounds like whomever is
> doing this is clueless. You don't move RT tasks about if you're not
> intimately aware of them and their requirements.

Oh, seriously, if I could build this thing from ground up, I'd just
tie it to process hierarchy and make the associations static. It's
just that we can't do that at this point and I'm trying to find a
behaviorally simple and acceptable way to deal with task migrations so
that neither kernel or userland has to be too complex. So, behaviors
which blow configs across migrations and consider them as "fresh" is
completely fine by me. I mostly wanna avoid requiring complicated
failure handling from the users which most likely won't be tested a
lot and crap out when something exceptional happens. If it blows
RT/deadline settings reliably on each and every migration and refuse
RT priorities or cpu controller configs which can lead the invalid
configs, it'd be perfect.

This whole thing is really about having consistent behavior patterns
which avoid obscure failure modes whenever possible. Unified
hierarchy does build on top of those but we do want these
consistencies regardless of that.

Thanks.

--
tejun

2014-10-30 22:36:08

by Tim Hockin

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Oct 30, 2014 at 10:12 AM, Tejun Heo <[email protected]> wrote:
> On Thu, Oct 30, 2014 at 07:58:34AM -0700, Tim Hockin wrote:
>> Another reason unified hierarchy is a bad model.
>
> Things wrong with this message.
>
> 1. Top posted. It isn't clear which part you're referring to and this
> was pointed out to you multiple times in the past.

I occasionally fall victim to gmail's defaults. I apologize for that.

> 2. No real thoughts or technical details. Maybe you had some in your
> head but nothing was elaborated. This forces me to guess what you
> had on mind when you produced the above sentence and of course me
> not being you this takes a considerable amount of brain cycle and
> I'd still end up with multiple alternative scenarios that I'll have
> to cover.

I think the conversation is well enough understood by the people for
whom this bit of snark was intended that reading my mind was not that
hard. That said, it was overly snark-tastic, and sent in haste.

My point, of course, was that here is an example of something which
maps very well to the idea of cgroups (a set of processes that share
some controller) but DOES NOT map well to the unified hierarchy model.
It must be managed more carefully than arbitrary hierarchy can
enforce. The result is the mish-mash of workarounds proposed in this
thread to force it into arbitrary hierarchy mode, including this
no-win situation of running out of hardware resources - it is going to
fail. Will it fail at cgroup creation time (doesn't scale to
arbitrary hierarchy) or will it fail when you add processes to it
(awkward at best) or will it fail when you flip some control file to
enable the feature?

I know the unified hierarchy ship has sailed, so there's not
non-snarky way to argue the point any further, but this is such an
obvious case, to me, that I had to say something.

Tim

2014-10-30 22:47:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design


Let me reply to just this one, I'll do the rest tomorrow, need sleeps.

On Thu, Oct 30, 2014 at 06:22:36PM -0400, Tejun Heo wrote:

> > > This controller might not even require the distinction between
> > > configured and effective tho? Can't a new child just inherit the
> > > parent's configuration and never allow the config to become completely
> > > empty?
> >
> > It can do that. But that still has a problem, there is a mapping in
> > hardware which restricts the number of active configurations. The total
> > configuration space is larger than the supported active configurations.
> >
> > So _something_ must fail. The initial proposal was mkdir failing when
> > there were more than the hardware supported active config cgroup
> > directories. The alternative was on-demand activation where we only
> > allocate the hardware resource when the first task gets moved into the
> > group -- which then clearly can fail.
>
> Hmmm... why can't it just refuse creating a different configuration
> when its config space is full? Make children inherit the parent's
> configuration and refuse config writes which require it to create a
> new one if the config space is full. Seems pretty straight-forward.
> What am I missing?

We could do that I suppose, there is the one corner case that would not
allow, intermediate directories with a restricted config that also have
priv restrictions but no actual tasks. Not sure that makes sense though.

Are there any other cases I might have missed?

2014-10-30 23:19:34

by Shivappa Vikas

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design





On Thu, 30 Oct 2014, Tejun Heo wrote:

> Hello, Peter.
>
> On Thu, Oct 30, 2014 at 08:07:25AM +0100, Peter Zijlstra wrote:
>> If this means echo $tid > tasks, then sorry we can't do. There is a
>> limited number of hardware resources backing this thing. At some point
>> they're consumed and something must give.
>
> And that something shouldn't be disallowing task migration across
> cgroups. This simply doesn't work with co-mounting or unified
> hierarchy. cpuset automatically takes on the nearest ancestor's
> configuration which has enough execution resources. Maybe that can be
> an option for this too?


One way to it is to merge the CAT cgroups into the cpuset . In essense
there is no CAT cgroup seperately and we just have a new file 'cbm' in the
cpuset. This would be visible only when system has Cache allocation
support and the user can manipulate the cache bit mask here.
The user can use the already existing cpu_exclusive file in the cpuset
to mark the cgroups to use exclusive CPUs.
That way we simplify and reuse cpuset code/hierarchy .. ?

Thanks,
Vikas



>
> One of the problems is that we generally assume that a task can run
> some point in time in a lot of places in the kernel and can't just not
> run a task indefinitely because it's in a cgroup configured certain
> way.
>
>> So either we fail mkdir, but that means allocating CLOS IDs for possibly
>> empty cgroups, or we allocate on demand which means failing task
>> assignment.
>
> Can't fail mkdir or css enabling either. Again, co-mounting and
> unified hierarchy. Also, the behavior is just horrible to use from
> userland.
>
>> The same -- albeit for a different reason -- is true of the RT sched
>> groups, we simply cannot instantiate them such that tasks can join,
>> sysads _have_ to configure them before we can add tasks to them.
>
> Yeah, RT is one of the main items which is problematic, more so
> because it's currently coupled with the normal sched controller and
> the default config doesn't have any RT slice. Do we completely block
> RT task w/o slice? Is that okay?
>
> Thanks.
>
> --
> tejun
>

2014-10-31 13:08:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Oct 30, 2014 at 06:22:36PM -0400, Tejun Heo wrote:
> Hello,
>
> On Thu, Oct 30, 2014 at 10:43:53PM +0100, Peter Zijlstra wrote:
> > If a cpu bounces (by accident or whatever) then there is no trace left
> > behind that the system didn't in fact observe/obey its constraints. It
> > should have provided an error or failed the hotplug. But we digress,
> > lets not have this discussion (again :) and focus on the new thing.
>
> Oh, we sure can have notifications / persistent markers to track
> deviation from the configuration. It's not like the old scheme did
> much better in this respect. It just wrecked the configuration
> without telling anyone. If this matters enough, we need error
> recording / reporting no matter which way we choose. I'm not against
> that at all.

True; then again, hotplug isn't a magical thing, you do it yourself --
with the suspend case being special, I'll grant you that.

> > > So, the inherent problem is always there no matter what we do and the
> > > question is that of a policy to deal with it. One of the main issues
> > > I see with failing cgroup-level operations for controller specific
> > > reasons is lack of visibility. All you can get out of a failed
> > > operation is a single error return and there's no good way to
> > > communicate why something isn't working, well not even who's the
> > > culprit. Having "effective" vs "configured" makes it explicit that
> > > the kernel isn't capable of honoring all configurations and make the
> > > details of the situation visible.
> >
> > Right, so that is a short coming of the co-mount idea. Your effective vs
> > configured thing is misleading and surprising though. Operations might
> > 'succeed' and still have failed, without any clear
> > indication/notification of change.
>
> Hmmm... it gets more pronounced w/ co-mounting but it's also problem
> with isolated hierarchies too. How is changing configuration
> irreversibly without any notificaiton any less surprising? It's the
> same end result. The only difference is that there's no way to go
> back when the resource which went offline comes back. I really don't
> think configuration being silently changed counts as a valid
> notification mechanism to userland.

I think we're talking past one another here. You said the problem was
that failing migrate is that you've no clue which controller failed in
the co-mount case. With isolated hierarchies you do know.

But then you continue talk about cpuset and hotplug. Now the thing with
that is, the only one doing hotplug is the admin (I know there's a few
kernel side hotplug but they're BUGs and I even NAKed a few, which
didn't stop them from being merged) -- the exception being suspend,
suspend is special because 1) there's a guarantee the CPU will actually
come back and 2) its unobservable, userspace never sees the CPUs go away
and come back because its frozen.

The only real way to hotplug is if you do it your damn self, and its
also you who setup the cpuset, so its fully on you if shit happens.

No real magic there. Except now people seem to want to wrap it into
magic and hide it all from the admin, pretend its not there and make it
uncontrollable.

Kernel side hotplug is broken for a myriad of reasons, but lets not
diverge too far here.

> > > Another part is inconsistencies across controllers. This sure is
> > > worse when there are multiple controllers involved but inconsistent
> > > behaviors across different hierarchies are annoying all the same with
> > > single controller multiple hierarchies. Userland often manages some
> > > of those hierarchies together and it can get horribly confusing. No
> > > matter what, we need to settle on a single policy and having effective
> > > configuration seems like the better one.
> >
> > I'm not entirely sure I follow. Without co-mounting its entirely obvious
> > which one is failing.
>
> Sure, "which" is easier w/o co-mounting. Why can still be hard tho as
> migration is an "apply all the configs" event.

Typically controllers don;'t control too many configs at once and the
specific return error could be a good hint there.

> > Also, per the previous point, since you need a notification channel
> > anyway, you might as well do the expected fail and report more details
> > through that.
>
> How do you match the failure to the specific migration attempt tho? I
> really can't think of a good and simple interface for that given the
> interface that we have. For most controllers, it is fairly straight
> forward to avoid controller specific migration failures. Sure, cpuset
> is special but it has to be special one way or the other.

You can include in the msg with the pid that was just attempted in the
pid namespace of the observer, if the pid is not available in that
namespace discard the message since the observer could not possibly have
done the deed.

> It doesn't have much to do with co-mounting.
>
> The process itself often has to be involved in assigning different
> properties to its threads. It requires intimiate knowledge of which
> one is doing what meaning that accessing self's knobs is the most
> common use case rather than an external entity reaching inside. This
> means that this should be a programmable interface accessible from
> each binary. cgroup is horrible for this. A process has to read path
> from /proc/self/cgroups and then access the cgroup that it's in, which
> BTW could have changed inbetween.
>
> It really needs a proper programmable interface which guarantees self
> access. I don't know what the exact form should be. It can be an
> extension to sched_setattr(), a new syscall or a pseudo filesystem
> scoped to the process.

That's an entirely separate issue; and I don't see that solving the task
vs process issue at all.

> > The admin you mean? He had better know what the heck he's doing if he's
>
> Resource management is automated in a lot of cases and it's only gonna
> be more so in the future. It's about having behaviors which are more
> palatable to that but please read on.
>
> > running RT apps, great fail is otherwise fairly deterministic in his
> > future.
> >
> > The thing is, you cannot arbiter this stuff, RR/FIFO are horrible pieces
> > of shit interfaces, they don't describe near enough. People need to be
> > involved.
>
> So, I think it'd be best if RT/deadline stuff can be separated out so
> that grouping the usual BE scheduling doesn't affect them, but if
> that's not feasible, yeah, I agree the only thing which we can do is
> requiring the entity which is controlling the cpu hierarchy, which may
> be a human admin or whatever manager, to distribute them explicitly.
> There doesn't seem to be any way around it.

Automation is nice and all, but RT is about providing determinism and
guarantees. Unless you morph into a full blown RT aware muddleware and
have all your RT apps communicate their requirements to it (ie. rewrite
them all) to it, this is a non starter.

Given that the RR/FIFO APIs are not communicating enough and we need to
support them anyhow, human intervention it is.

> > > Can't a task just lose RT / deadline properties when migrating into a
> > > different RT / deadline domain? We already modify task properties on
> > > migration for cpuset after all. It'd be far simpler that way.
> >
> > Again, why move it in the first place? This all sounds like whomever is
> > doing this is clueless. You don't move RT tasks about if you're not
> > intimately aware of them and their requirements.
>
> Oh, seriously, if I could build this thing from ground up, I'd just
> tie it to process hierarchy and make the associations static.

This thing being cgroups? I'm not sure static associations cater for the
various use cases that people have.

> It's
> just that we can't do that at this point and I'm trying to find a
> behaviorally simple and acceptable way to deal with task migrations so
> that neither kernel or userland has to be too complex.

Sure simple and consistent is all good, but we should also not make it
too simple and thereby exclude useful things.

> So, behaviors
> which blow configs across migrations and consider them as "fresh" is
> completely fine by me.

Its not by me, its completely surprising and counter intuitive.

> I mostly wanna avoid requiring complicated
> failure handling from the users which most likely won't be tested a
> lot and crap out when something exceptional happens.

Smells like you just want to pretend nothing bad happens when you do
stupid. I prefer to fail early and fail hard over pretend happy and
surprise behaviour any day.

> This whole thing is really about having consistent behavior patterns
> which avoid obscure failure modes whenever possible. Unified
> hierarchy does build on top of those but we do want these
> consistencies regardless of that.

I'm all for consistency, but I abhor make believe. And while I like the
unified hierarchy thing conceptually, I'm by now fairly sure reality is
about to ruin it.

2014-10-31 15:58:30

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

Hello, Peter.

On Fri, Oct 31, 2014 at 02:07:38PM +0100, Peter Zijlstra wrote:
> I think we're talking past one another here. You said the problem was
> that failing migrate is that you've no clue which controller failed in
> the co-mount case. With isolated hierarchies you do know.

Yes, with co-mounting, the issue becomes worse but I think it's still
not ideal even without co-mounting because the error reporting ends up
conflating task organization operation and application of
configurations. More on this later.

> But then you continue talk about cpuset and hotplug. Now the thing with
> that is, the only one doing hotplug is the admin (I know there's a few
> kernel side hotplug but they're BUGs and I even NAKed a few, which
> didn't stop them from being merged) -- the exception being suspend,
> suspend is special because 1) there's a guarantee the CPU will actually
> come back and 2) its unobservable, userspace never sees the CPUs go away
> and come back because its frozen.
>
> The only real way to hotplug is if you do it your damn self, and its
> also you who setup the cpuset, so its fully on you if shit happens.
>
> No real magic there. Except now people seem to want to wrap it into
> magic and hide it all from the admin, pretend its not there and make it
> uncontrollable.

Hmmm... I think a difference is how we perceive userspace is composed
and interacts with the various aspects of kernel. But even in the
presence of a competent admin that you're suggesting, interactions of
different aspects of a system are often compartmentalized. e.g. an
admin configuring cpuset to accomodate a given set of persistent and
important workload isn't too likely to expect a memory unit soft
failure in several weeks and the need to hot-swap the memory module.
It just isn't cost-effective enough to lump those two planes of
planning into the same activity especially if the admin is
hand-crafting the configuration. The issue that I see with the
current method is that a much rare exception condition ends up messing
up configurations which is on a different plane and that there's no
recourse once that happens. If the said workload keeps forking,
there's no easy way to recover the previous configuration.

Both ways of handling the situation have components of surprise but as
I wrote before that surprise is inherent and comes from the fact that
the kernel can't afford tasks which aren't runnable. As a policy of
handling the surprising situation, having explicit configured /
effective settings seems like a better option to me because 1. it
makes it explicit that the effective configuration may differ from the
requested one 2. it makes handling exception cases easier. I think #1
is important because hard errors which rarely but do happen are very
difficult to deal with properly because it's usually nearly invisible.

> > Sure, "which" is easier w/o co-mounting. Why can still be hard tho as
> > migration is an "apply all the configs" event.
>
> Typically controllers don;'t control too many configs at once and the
> specific return error could be a good hint there.

Usually, yeah. I still end up scratching my head with migration
rejections w/ cpuset or blkcg tho.

> > > Also, per the previous point, since you need a notification channel
> > > anyway, you might as well do the expected fail and report more details
> > > through that.
> >
> > How do you match the failure to the specific migration attempt tho? I
> > really can't think of a good and simple interface for that given the
> > interface that we have. For most controllers, it is fairly straight
> > forward to avoid controller specific migration failures. Sure, cpuset
> > is special but it has to be special one way or the other.
>
> You can include in the msg with the pid that was just attempted in the
> pid namespace of the observer, if the pid is not available in that
> namespace discard the message since the observer could not possibly have
> done the deed.

I don't know. Is that a good interface? If a human admin is echoing
and dmesg'ing afterwards, it should work but scraping the log for an
unstructured plain text error usually isn't a very good interface to
build tools around.

For example, for CAT and its limit on the numbers of possible
configurations, it can technically be made to work by reporting errors
on mkdir or task migration; however, it is *far* better and clearer to
report, say, -ENOSPC when you're actually trying to change the
configuration. The error is directly tied to the operation requested.
That's just how it should be whenever possible.

> > It really needs a proper programmable interface which guarantees self
> > access. I don't know what the exact form should be. It can be an
> > extension to sched_setattr(), a new syscall or a pseudo filesystem
> > scoped to the process.
>
> That's an entirely separate issue; and I don't see that solving the task
> vs process issue at all.

Hmm... I don't see it that way tho. In-process configuration is
primarily something to be done by the process while cgroup management
is to be done by external adminy entity. They are on different
planes. Individual binaries accessing their own cgroups doesn't make
a lot of sense and is actually broken. Likewise, external management
entity meddling with individual threads of a process is at best
cumbersome. It can be allowed but that's often not how it's useful.
I really don't see why cgroup would be involved with per-thread
settings.

> Automation is nice and all, but RT is about providing determinism and
> guarantees. Unless you morph into a full blown RT aware muddleware and
> have all your RT apps communicate their requirements to it (ie. rewrite
> them all) to it, this is a non starter.
>
> Given that the RR/FIFO APIs are not communicating enough and we need to
> support them anyhow, human intervention it is.

Yeah, I fully agree with you there. The issue is not that RT/FIFO
requires explicit actions from userland but that they're currently
tied to BE scheduling. Conceptually, they don't have to be but
they're in practice and that ends up requiring whoever, be that an
admin or automated tool, is managing the BE grouping to also manage
RT/FIFO slices, which isn't ideal but should be workable. I was
mostly curious whether they can be separated with a reasonable amount
of effort. That's a no, right?

> > Oh, seriously, if I could build this thing from ground up, I'd just
> > tie it to process hierarchy and make the associations static.
>
> This thing being cgroups? I'm not sure static associations cater for the
> various use cases that people have.

Sure, we have no chance of changing it at this point, but I'm pretty
sure if we started by tying it to the process hierarchy, we and the
userland would have been able to achieve about the same set of
functionalities without all these migration business.

> > It's
> > just that we can't do that at this point and I'm trying to find a
> > behaviorally simple and acceptable way to deal with task migrations so
> > that neither kernel or userland has to be too complex.
>
> Sure simple and consistent is all good, but we should also not make it
> too simple and thereby exclude useful things.

What are we excluding tho? Previously, cgroup didn't have rules,
policies or conventions. It just had this skeletal features to group
tasks and every controller did its own thing diverging the way they
treat hierarchies, errors, migrations, configurations, notifications
and so on. It didn't put in the effort to actually identify the
required functionalities or characterize what belongs where. Every
controller was doing its own brownian motion in the design space.

Most of the properties being identified and policies being set up are
actually fundamental and inherent. e.g. Creating a subhierarchy and
organizing the children in them is fundamentally a task
sub-categorizing operation. Conceptually, doing so shouldn't be
impeded by or affect the resource configured for the parent of that
sub hierarchy and for most controllers this can be achieved in a
straight-forward manner by making children not putting further
restrictions on the resources from its parent on creation. This is a
rule which should be inherent and this type of conventions ultimately
lead to better designs and implementations.

I think this is evident for the controller in question being discussed
on this thread. Task organization - creating cgroups and moving tasks
around tasks between them - is an inherently different operation from
configuring each controller. They shouldn't be conflated. It doesn't
make any sense to fail creation of a cgroup or failing task migration
later because controller can't be configured certain way. They should
be orthogonal as much as possible. If there's restriction on
controller configuration, that should be enforced on controller
configuration.

> > So, behaviors
> > which blow configs across migrations and consider them as "fresh" is
> > completely fine by me.
>
> Its not by me, its completely surprising and counter intuitive.

I don't get it. This is one of few cases where controller is
distributing hard-walled resources and as you said userland
intervention is a must in facilitating such distribution. Isn't this
pretty well in line with what you've been saying? The admin is moving
a RT / deadline task into a different scheduling domain and if such
operation always requires setting scheduling policies again, what's
surprising about it?

It makes conceptual sense - the task is moving across two scheduling
domains with different set of hard resources. It'd work well and
reliably too in practice and userland only has one less vector of
failure while achieving the same thing.

> > I mostly wanna avoid requiring complicated
> > failure handling from the users which most likely won't be tested a
> > lot and crap out when something exceptional happens.
>
> Smells like you just want to pretend nothing bad happens when you do
> stupid. I prefer to fail early and fail hard over pretend happy and
> surprise behaviour any day.

But where am I losing anything? I'm not saying everything is always
better this way but if I look at the overall compromises, it seems
like a clear win to me.

> > This whole thing is really about having consistent behavior patterns
> > which avoid obscure failure modes whenever possible. Unified
> > hierarchy does build on top of those but we do want these
> > consistencies regardless of that.
>
> I'm all for consistency, but I abhor make believe. And while I like the
> unified hierarchy thing conceptually, I'm by now fairly sure reality is
> about to ruin it.

Hmm... I get exactly the opposite feeling. A lot of fundamental
properties are being identified and things mostly fall into places.

Thanks.

--
tejun

2014-10-31 16:57:56

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

Hello, Tim.

On Thu, Oct 30, 2014 at 03:35:44PM -0700, Tim Hockin wrote:
> I think the conversation is well enough understood by the people for
> whom this bit of snark was intended that reading my mind was not that

I really don't think it is. cgroups in general isn't that well
understood and while some may be familiar with what they've been
working on, most aren't too well acquainted what changes are made and
why. I surely am responsible for not being better at communicating
but it took me quite a while and I'm still in the process of
crystalizing those myself.

> hard. That said, it was overly snark-tastic, and sent in haste.

The problem with this type of snarky one liner is that it undermines
the fundamentals of techunical discussions on the mailing list. It
requires too much effort on the other party for speculation and if the
other party doesn't repond, the snark comment succeeds at establishing
the vague negativity that it carried. If you have a technical
opinion, form and communicate it properly so that it can be analyzed
and discussed properly. I think my wording in my previous messages
was too strong and apologize for that but please don't do this.

> My point, of course, was that here is an example of something which
> maps very well to the idea of cgroups (a set of processes that share
> some controller) but DOES NOT map well to the unified hierarchy model.

I'm pretty sure that conclusion is premature. As I wrote in my reply
to Peter, I strongly believe that a set of reasonable constraints and
conventions lead to a much better and more functional design,
interface and implementation. It sure can feel like an annoyance if
one used to be accustomed to doing whatever and now has to follow
these new constraints but we were paying heavily elsewhere for the
lack of consistency and, in general, sense.

I could have communicated it clearer but the fundamental issue that I
see with the original proposal is that it conflates task organization
and controller configuration. They belong to different planes of
control and should be orthogonal as much as possible. This shows up
evidently, for example, in how errors are reported. A write to a knob
of the involved controller failing with the proper error code is a far
superior way compared to failing mkdir or task migration. The only
reason we even think that doing anything else is fine is because we've
never thought about what's the right thing to do all along and just
did whatever is convenient in terms of immediate implementation for
each individual case.

> It must be managed more carefully than arbitrary hierarchy can
> enforce. The result is the mish-mash of workarounds proposed in this
> thread to force it into arbitrary hierarchy mode, including this
> no-win situation of running out of hardware resources - it is going to
> fail. Will it fail at cgroup creation time (doesn't scale to
> arbitrary hierarchy) or will it fail when you add processes to it
> (awkward at best) or will it fail when you flip some control file to
> enable the feature?

Please see above. It's more of the process of finding the *right*
place to put operations and their failures. Task migration sure can
fail due to memory pressure or basic cgroup organizational contraints;
however, it's outright wrong to fail it because a given controller can
support only limited number of configurations. Again, being able to
do whatever one wanna do often doesn't lead to a good design.

> I know the unified hierarchy ship has sailed, so there's not
> non-snarky way to argue the point any further, but this is such an
> obvious case, to me, that I had to say something.

If you properly compose your ideas and concerns, I can think about and
discuss them and make adjustments where appropriate and it seems to me
that your impression at least in this instance isn't very well
warranted. The snark comment can achieve none of the productive
things which can come from proper discussions. All it can do is
aggravating the tone of the discussion, so, again, please refrain from
it in the future.

Thanks.

--
tejun

2014-11-03 23:32:18

by Shivappa Vikas

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design


Hello All,

Thanks for all the feedback so far and below is the modified 'Kernel
Implementation' Section for review - Rest of the sections are the
same as before with just some changes in text as per changed
implementation , so can be ignored as well ..

Also adding Peter Anvin, Thomas Gleixner, and Ingo Molnar for comments.

Kernel implementation Overview
-------------------------------

Kernel adds a file 'cbm'(cache bit mask) to the existing cpuset cgroup
subsystem to support Cache Allocation.

A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
to the kernel and not exposed to user. Each cgroup would have one CBM
and would just represent one cache 'subset'.

The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
cgroup never fails(as it was always there in cpuset already). When a
child cgroup is created it inherits the CLOSid and the CBM from its
parent. When a user changes the default CBM for a cgroup, a new
CLOSid is allocated. The changing of 'cbm' may fail once the kernel
runs out of maximum CLOSids it can support.

The tasks in the cgroup would get to fill the LLC cache represented by
the cgroup's 'cbm' file.

User can use the existing 'cpu_exclusive' file in the cpuset cgroup to
affinitize the tasks in a cgroup to exclusive set of CPUs.

Root directory would have all bits set in 'cbm' file by default. Since
all the children inherit the parent 'cbm' , this effectively makes the
feature not take effect until user changes the cbm - or in other words
the 'cbm' for all the cgroups created would be all 1s if user never
modifies any 'cbm' file.Which means all the tasks get to fill in all
the cache and hence cache allocation is not in effect.

Assignment of CBM,CLOS
---------------------------------


The 'cbm' needs to be a subset of the parent node's 'cbm'.
Any contiguous subset of these bits maybe set to
indicate the cache mapping desired. The 'cbm' between 2 directories
can overlap. The 'cbm' would represent the cache 'subset' of the CAT
cgroup. For ex: on a system with 16 bits of max cbm bits , if the
directory has the least significant 4 bits set in its 'cbm'
file(meaning the 'cbm' is just 0xf), it
would be allocated the right quarter of the Last level cache which
means the tasks belonging to this CAT cgroup can use the right quarter
of the cache to fill. If it has the most significant 8 bits set ,it
would be allocated the left half of the cache(8 bits out of 16
represents 50%).

The cache portion defined in the CBM file is available to all tasks
within the cgroup to fill and these task are not allowed to allocate
space in other parts of the cache.


Scheduling and Context Switch
------------------------------

During context switch kernel implements this by writing the
CLOSid (internally maintained by kernel) of the cgroup to which the
task belongs to the CPU's IA32_PQR_ASSOC MSR.

Usage and Example
-----------------


With this patch the cpuset cgroup would show a new file cpuset.cbm.

cd /sys/fs/cgroup/cpuset

Create 2 cpuset cgroups

mkdir group1
mkdir group2

Following are some of the Files in the directory

ls
cpuset.cpus
cpuset.cpu_exclusive
cpuset.mems
cpuset.mem_exclusive
...

cpuset.cbm

...


Say if the cache is 2MB and cbm supports 16 bits, then setting the
below allocates the 'right 1/4th(512KB)' of the cache to group2

Assign cpus and memory node to the group2.

cd group2
/bin/echo 1-2 > cpuset.cpus
/bin/echo 0 > cpuset.mems

Make the CPUs exclusive for the cgroup
/bin/echo 1 > cpuset.cpus_exclusive

Edit the CBM for group2 to set the least significant 4 bits. This
allocates 'right quarter' of the cache.

/bin/echo 0xf > cpuset.cbm

Change cpus in the directory.

/bin/echo 1-4 > cpuset.cpus

Edit the CBM for group2 to set the least significant 8 bits.This
allocates the right half of the cache to 'group2'.

cd group2
/bin/echo 0xff > cpuset.cbm

Assign tasks to the group2

/bin/echo PID1 > tasks
/bin/echo PID2 > tasks

Meaning now threads
PID1 and PID2 runs on CPUs 1-2 , and get to fill the 'right half' of
the cache.



Thanks,
Vikas




On Thu, 16 Oct 2014, vikas wrote:

> Hi All , We have put together a draft design document for cache
> allocation technology below. Please review the same and let us know any
> feedback.
>
> Make sure you cc my email [email protected] when replying
>
> Thanks,
> Vikas
>
> What is Cache Allocation Technology ( CAT )
> -------------------------------------------
>
> Cache Allocation Technology provides a way for the Software (OS/VMM)
> to restrict cache allocation to a defined 'subset' of cache which may
> be overlapping with other 'subsets'. This feature is used when
> allocating a line in cache ie when pulling new data into the cache.
> The programming of the h/w is done via programming MSRs.
>
> The different cache subsets are identified by CLOS identifier (class
> of service) and each CLOS has a CBM (cache bit mask). The CBM is a
> contiguous set of bits which defines the amount of cache resource that
> is available for each 'subset'.
>
> Why is CAT (cache allocation technology) needed
> ------------------------------------------------
>
> The CAT enables more cache resources to be made available for higher
> priority applications based on guidance from the execution
> environment.
>
> The architecture also allows dynamically changing these subsets during
> runtime to further optimize the performance of the higher priority
> application with minimal degradation to the low priority app.
> Additionally, resources can be rebalanced for system throughput
> benefit. (Refer to Section 17.15 in the Intel SDM
> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf)
>
> This technique may be useful in managing large computer systems which
> large LLC. Examples may be large servers running instances of
> webservers or database servers. In such complex systems, these subsets
> can be used for more careful placing of the available cache
> resources.
>
> The CAT kernel patch would provide a basic kernel framework for users
> to be able to implement such cache subsets.
>
>
> Kernel implementation Overview
> -------------------------------
>
> Kernel implements a cgroup subsystem to support Cache Allocation.
>
> Creating a CAT cgroup would create a new CLOS <-> CBM mapping. Each
> cgroup would have one CBM and would just represent one cache 'subset'.
>
> The user would be allowed to create as many directories as there are
> CLOSs defined by the h/w. If user tries to create more than the
> available CLOSs , -ENOSPC is returned. Currently we support only one
> level of directory, ie directory can be created only under the root.
>
> There are 2 modes supported
>
> 1. Affinitized mode : Each CAT cgroup is affinitized to a set of CPUs
> specified by the 'cpus' file. The tasks in the CAT cgroup would be
> constrained only on the CPUs in the 'cpus' file. The CPUs in this file
> are exclusively used for this cgroup. Requests by task
> using the sched_setaffinity() would be filtered through the tasks
> 'cpus'.
>
> These tasks would get to fill the LLC cache represented by the
> cgroup's 'cbm' file. 'cpus' is a cpumask and works the same way as
> the existing cpumask datastructure.
>
> 2. Non Affinitized mode : Each CAT cgroup(inturn 'subset') would be
> for a group of tasks. There is no 'cpus' file and the CPUs that the
> tasks run are not restricted by the CAT cgroup
>
>
> Assignment of CBM,CLOS and modes
> ---------------------------------
>
> Root directory would have all bits in 'cbm' file by default.
>
> The cbm_max file in the root defines the maximum number of bits
> describing the available cache units. Say if cbm_max is 16 then the
> 'cbm' cannot have more than 16 bits.
>
> The 'affinitized' file is either 0 or 1 which represent the two modes.
> System would boot with affinitized mode and all CPUs would have all
> bits in cbm set meaning all CPUs have 100% cache(effectively cache
> allocation is not in effect).
>
> The 'cbm' file is restricted to having no more than its cbm_max least
> significant bits set. Any contiguous subset of these bits maybe set to
> indication the cache mapping desired. The 'cbm' between 2 directories
> can overlap. The 'cbm' would represent the cache 'subset' of the CAT
> cgroup. For ex: on a system with 16 bits of max cbm bits , if the
> directory has the least significant 4 bits set in its 'cbm' file, it
> would be allocated the right quarter of the Last level cache which
> means the tasks belonging to this CAT cgroup can use the right quarter
> of the cache to fill. If it has the most significant 8 bits set ,it
> would be allocated the left half of the cache(8 bits out of 16
> represents 50%).
>
> The cache subset would be affinitized to a set of cpus in affinitized
> mode. The CPUs to which this allocation is affinitized to is
> represented by the 'cpus' file. The 'cpus' need to be mutually
> exclusive from cpus of other directories.
>
> The cache portion defined in the CBM file is available to all tasks
> within the CAT group and these task are not allowed to allocate space
> in other parts of the cache.
>
> 'cbm' file is used in both modes where as the 'cpus' file is relevant
> in affinitized mode and would disappear in non-affinitized mode.
>
>
> Scheduling and Context Switch
> ------------------------------
>
> In affinitized mode , the cache 'subset' and the tasks in a CAT cgroup
> are affinitized to the CPUs represented by the CAT cgroup's 'cpus'
> file i.e when user sets the 'cbm' to 'portion' and 'cpus' to c and
> 'tasks' to t, the tasks 't' would always be scheduled on cpus 'c' and
> will get to fill in the allocated 'portion' in last level cache.
>
> As noted above ,in the affinitized mode the tasks in a CAT cgroup
> would also be affinitized to the CPUs in the 'cpus' file of the
> directory. Following hooks in the kernel are required to implement
> this (on the lines of cpuset code)
> - in sched_setaffinity to mask the requested cpu mask with what is
> present in the task's 'cpus'
> - in migrate_task to migrate the tasks only to those CPUs in the
> 'cpus' file if possible.
> - in select_task_rq
>
> In non-affinitized mode the 'affinitized' is 0 , and the 'tasks' file
> indicate the tasks the cache subset is affinitized to. When user adds
> tasks to the tasks file , the tasks would get to fill the cache subset
> represented by the CAT cgroup's 'cbm' file.
>
> During context switch kernel implements this by writing the
> corresponding CLOSid (internally maintained by kernel) of the CAT
> cgroup to the CPU's IA32_PQR_ASSOC MSR.
>
> Usage and Example
> -----------------
>
>
> Following would mount the cache allocation cgroup subsystem and create
> 2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> details about how to use cgroups.
>
> cd /sys/fs/cgroup
> mkdir cachealloc
> mount -t cgroup -ocachealloc cachealloc /sys/fs/cgroup/cachealloc
> cd cachealloc
>
> Create 2 cat cgroups
>
> mkdir group1
> mkdir group2
>
> Following are some of the Files in the directory
>
> ls
> cachea.cbm
> cachea.cpus . cpus file only appears in the affinitized mode
> cgroup.procs
> tasks
> cbm_max (root only)
> affinitized (root only) . by default itsaffinitized mode
>
> Say if the cache is 2MB and cbm supports 16 bits, then setting the
> below allocates the 'right 1/4th(512KB)' of the cache to group2
>
> Edit the CBM for group2 to set the least significant 4 bits. This
> allocates 'right quarter' of the cache.
>
> cd group2
> /bin/echo 0xf > cachealloc.cbm
>
> Change cpus in the directory.
>
> /bin/echo 1-4 > cachealloc.cpus
>
> Edit the CBM for group2 to set the least significant 8 bits.This
> allocates the right half of the cache to 'group2'.
>
> cd group2
> /bin/echo 0xff > cachea.cbm
>
> Assign tasks to the group2
>
> /bin/echo PID1 > tasks
> /bin/echo PID2 > tasks
> Meaning now threads
> PID1 and PID2 runs on CPUs 1-4 , and get to fill the 'right half' of
> the cache. The tasks PID1 and PID2 can only have a subset of the cpu
> affinity defined in the 'cpus' file
>
> Edit the affinitized to 0.mode is changed in root directory cd ..
>
> /bin/echo 0 > cachealloc.affinitized
>
> Now the tasks and the cache allocation is not affinitized to the CPUs
> and the task's cpu affinity is not restricted to being with the subset
> of 'cpus' cpumask.
>
>
>
>
>
>
>

2014-11-04 13:14:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Fri, Oct 31, 2014 at 11:58:06AM -0400, Tejun Heo wrote:
> > No real magic there. Except now people seem to want to wrap it into
> > magic and hide it all from the admin, pretend its not there and make it
> > uncontrollable.
>
> Hmmm... I think a difference is how we perceive userspace is composed
> and interacts with the various aspects of kernel. But even in the
> presence of a competent admin that you're suggesting, interactions of
> different aspects of a system are often compartmentalized. e.g. an
> admin configuring cpuset to accomodate a given set of persistent and
> important workload isn't too likely to expect a memory unit soft
> failure in several weeks and the need to hot-swap the memory module.
> It just isn't cost-effective enough to lump those two planes of
> planning into the same activity especially if the admin is
> hand-crafting the configuration. The issue that I see with the
> current method is that a much rare exception condition ends up messing
> up configurations which is on a different plane and that there's no
> recourse once that happens. If the said workload keeps forking,
> there's no easy way to recover the previous configuration.
>
> Both ways of handling the situation have components of surprise but as
> I wrote before that surprise is inherent and comes from the fact that
> the kernel can't afford tasks which aren't runnable. As a policy of
> handling the surprising situation, having explicit configured /
> effective settings seems like a better option to me because 1. it
> makes it explicit that the effective configuration may differ from the
> requested one 2. it makes handling exception cases easier. I think #1
> is important because hard errors which rarely but do happen are very
> difficult to deal with properly because it's usually nearly invisible.

So there are scenarios where you want to hard fail the machine if the
constraints are not met. Its better to just give up than to pretend.

This effective/requested split is policy, a hardcoded kernel policy. One
that doesn't work for a number of cases. Fail and let userspace sort it
out is a much safer option.

Some people want hard guarantees, if you're not willing to cater to them
with cgroups they'll go off and invent yet more muck :/

Do you want to shut down the saw, or pretend its still controlled and
loose your fingers because it missed a deadline?

Even HPC might not want to pretend continue, they might want to notify
the jobs scheduler and get a different job split, rather than continue
half-arsed. A persistent delay on the job completion barrier is way bad
for them.

> > Typically controllers don;'t control too many configs at once and the
> > specific return error could be a good hint there.
>
> Usually, yeah. I still end up scratching my head with migration
> rejections w/ cpuset or blkcg tho.

This means you already need to deal with this, so how about we try and
make that work instead of saying we cannot fail migration.

> > You can include in the msg with the pid that was just attempted in the
> > pid namespace of the observer, if the pid is not available in that
> > namespace discard the message since the observer could not possibly have
> > done the deed.
>
> I don't know. Is that a good interface? If a human admin is echoing
> and dmesg'ing afterwards, it should work but scraping the log for an
> unstructured plain text error usually isn't a very good interface to
> build tools around.
>
> For example, for CAT and its limit on the numbers of possible
> configurations, it can technically be made to work by reporting errors
> on mkdir or task migration; however, it is *far* better and clearer to
> report, say, -ENOSPC when you're actually trying to change the
> configuration. The error is directly tied to the operation requested.
> That's just how it should be whenever possible.

I never suggested dmesg, I was thinking of a cgroup.notifier file that
reports all 'events' for that cgroup.

If you listen to it while performing your operation, you get the msgs:

$ cat cgroup.notifier & echo $pid > tasks ; kill -INT $!

Or something like that. Seeing how the entire cgroup thing is text
based, this would end up spewing text like:

$cgroup-path failed attach $pid: $reason

Where everything is in the namespace of the observer; and if there is
no namespace translation possible, drop the event, because you can't
have seen or done anything anyhow.

> > That's an entirely separate issue; and I don't see that solving the task
> > vs process issue at all.
>
> Hmm... I don't see it that way tho. In-process configuration is
> primarily something to be done by the process while cgroup management
> is to be done by external adminy entity. They are on different
> planes. Individual binaries accessing their own cgroups doesn't make
> a lot of sense and is actually broken. Likewise, external management
> entity meddling with individual threads of a process is at best
> cumbersome. It can be allowed but that's often not how it's useful.
> I really don't see why cgroup would be involved with per-thread
> settings.

Well, people are doing it now. And it 'works' if you assume nobody is
going to do 'crazy' things behind your back, which is a fair assumption
(most of the time).

Its just that some people seem hell bend on doing crazy things behind
your back in the name of progress or whatnot ;-) Take one would be
making sure this background crap can be shot in the head.

I'm not arguing against an atomic interface, I'm just saying its not
required for useful things.

> > Automation is nice and all, but RT is about providing determinism and
> > guarantees. Unless you morph into a full blown RT aware muddleware and
> > have all your RT apps communicate their requirements to it (ie. rewrite
> > them all) to it, this is a non starter.
> >
> > Given that the RR/FIFO APIs are not communicating enough and we need to
> > support them anyhow, human intervention it is.
>
> Yeah, I fully agree with you there. The issue is not that RT/FIFO
> requires explicit actions from userland but that they're currently
> tied to BE scheduling. Conceptually, they don't have to be but
> they're in practice and that ends up requiring whoever, be that an
> admin or automated tool, is managing the BE grouping to also manage
> RT/FIFO slices, which isn't ideal but should be workable. I was
> mostly curious whether they can be separated with a reasonable amount
> of effort. That's a no, right?

What's a BE? Separating them is technically possible (painful maybe),
but doesn't make any kind of sense to me.

> > > Oh, seriously, if I could build this thing from ground up, I'd just
> > > tie it to process hierarchy and make the associations static.
> >
> > This thing being cgroups? I'm not sure static associations cater for the
> > various use cases that people have.
>
> Sure, we have no chance of changing it at this point, but I'm pretty
> sure if we started by tying it to the process hierarchy, we and the
> userland would have been able to achieve about the same set of
> functionalities without all these migration business.

How would we do things like per-cgroup workqueues? We'd need to somehow
spawn kthreads outside of the normal kthreadd hierarchy.

(this btw is something we need to sort, but lets not have that
discussion here -- this email is getting too big as is).

> > Sure simple and consistent is all good, but we should also not make it
> > too simple and thereby exclude useful things.
>
> What are we excluding tho?

Hard guarantees it seems.

> Previously, cgroup didn't have rules,
> policies or conventions. It just had this skeletal features to group
> tasks and every controller did its own thing diverging the way they
> treat hierarchies, errors, migrations, configurations, notifications
> and so on. It didn't put in the effort to actually identify the
> required functionalities or characterize what belongs where. Every
> controller was doing its own brownian motion in the design space.

Sure, agreed, we need more sanity there. I do however think we need to
put in the effort to map out all use cases.

> Most of the properties being identified and policies being set up are
> actually fundamental and inherent. e.g. Creating a subhierarchy and
> organizing the children in them is fundamentally a task
> sub-categorizing operation.

> Conceptually, doing so shouldn't be
> impeded by or affect the resource configured for the parent of that
> sub hierarchy

Uh what? No you want exactly that in a hierarchy. You want children to
submit to the configuration of the parent.

> and for most controllers this can be achieved in a
> straight-forward manner by making children not putting further
> restrictions on the resources from its parent on creation.

The other way around, children can only put further restrictions on,
they cannot relax restrictions from the parent.

> I think this is evident for the controller in question being discussed
> on this thread. Task organization - creating cgroups and moving tasks
> around tasks between them - is an inherently different operation from
> configuring each controller. They shouldn't be conflated. It doesn't
> make any sense to fail creation of a cgroup or failing task migration
> later because controller can't be configured certain way. They should
> be orthogonal as much as possible. If there's restriction on
> controller configuration, that should be enforced on controller
> configuration.

I'd mostly agree with that, but note how you put it in relative terms
:-)

I did give one (probably strained) example where putting the fail on the
config side was more constrained than placing it at the migrate.

> > > So, behaviors
> > > which blow configs across migrations and consider them as "fresh" is
> > > completely fine by me.
> >
> > Its not by me, its completely surprising and counter intuitive.
>
> I don't get it. This is one of few cases where controller is
> distributing hard-walled resources and as you said userland
> intervention is a must in facilitating such distribution. Isn't this
> pretty well in line with what you've been saying? The admin is moving
> a RT / deadline task into a different scheduling domain and if such
> operation always requires setting scheduling policies again, what's
> surprising about it?

It would make cgroups useless. It would break running applications.
You might as well not allow migration at all.

But the very fact that migration would destroy configuration of an
existing task would surprise me, I would -- like stated before -- much
rather refuse the migration than destroy existing state.

> It makes conceptual sense - the task is moving across two scheduling
> domains with different set of hard resources. It'd work well and
> reliably too in practice and userland only has one less vector of
> failure while achieving the same thing.

No its absolutely certified insane is what. It introduces a massive ton
of fail. Tasks that were running fine and predictable are then all of a
sudden a complete trainwreck.

> > Smells like you just want to pretend nothing bad happens when you do
> > stupid. I prefer to fail early and fail hard over pretend happy and
> > surprise behaviour any day.
>
> But where am I losing anything? I'm not saying everything is always
> better this way but if I look at the overall compromises, it seems
> like a clear win to me.

You allow the creation of fail and want to mop up the pieces afterwards
-- if at all possible. I want to avoid the creation of fail.

By allowing an effective config different from the requested -- be it
either using less CPUs than specified, or a different scheduling policy
or the forced use of remote memory, you could have lost your finger
before you can fix up.

Would it not be better to keep your finger?

2014-11-04 13:17:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Oct 30, 2014 at 04:18:33PM -0700, Vikas Shivappa wrote:
> One way to it is to merge the CAT cgroups into the cpuset . In essense there
> is no CAT cgroup seperately and we just have a new file 'cbm' in the cpuset.
> This would be visible only when system has Cache allocation support and the
> user can manipulate the cache bit mask here.
> The user can use the already existing cpu_exclusive file in the cpuset to
> mark the cgroups to use exclusive CPUs.
> That way we simplify and reuse cpuset code/hierarchy .. ?

I don't like extending cpusets further. Its already a weird and too big
controller.

What is wrong with having a specific CQM controller and using it
together with cpusets where desired?

2014-11-05 20:41:27

by Tejun Heo

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

Hello, Peter.

On Tue, Nov 04, 2014 at 02:13:50PM +0100, Peter Zijlstra wrote:
> So there are scenarios where you want to hard fail the machine if the
> constraints are not met. Its better to just give up than to pretend.
>
> This effective/requested split is policy, a hardcoded kernel policy. One
> that doesn't work for a number of cases. Fail and let userspace sort it
> out is a much safer option.

cpuset simply never implemented hard failing. the old policy wasn't a
hard fail. It did the same thing as applying the effective setting.
The only difference is that the process was irreversible. The kind of
hard fail you're talking about would be rejecting CPU down command if
downing a CPU would create a non-executable cpuset, which would be a
silly conflation of layers.

> Some people want hard guarantees, if you're not willing to cater to them
> with cgroups they'll go off and invent yet more muck :/
>
> Do you want to shut down the saw, or pretend its still controlled and
> loose your fingers because it missed a deadline?
>
> Even HPC might not want to pretend continue, they might want to notify
> the jobs scheduler and get a different job split, rather than continue
> half-arsed. A persistent delay on the job completion barrier is way bad
> for them.

Again, we never had hard failures for cpuset. The old behavior was
*more* surprising than the new one in that it was all implicit and the
actions taken were out of ordinary (no other controller action moves
tasks to other cgroups) and irreversible. I agree with your point
that things should be as little surprising as possible but the facts
you're using aren't in support of that point.

One thing which is debatable is whether to allow configuring cpumasks
which make the effective set empty. I don't think we fail that now
but failing that is completely fine and doesn't create discrepancies
with having configured and effective settings.

> > > Typically controllers don;'t control too many configs at once and the
> > > specific return error could be a good hint there.
> >
> > Usually, yeah. I still end up scratching my head with migration
> > rejections w/ cpuset or blkcg tho.
>
> This means you already need to deal with this, so how about we try and
> make that work instead of saying we cannot fail migration.

My point is that failing these types of things at configuration time
is a lot better approach. Everything sure is a trade-off but the
benefits here seem pretty clear to me.

> I never suggested dmesg, I was thinking of a cgroup.notifier file that
> reports all 'events' for that cgroup.
>
> If you listen to it while performing your operation, you get the msgs:
>
> $ cat cgroup.notifier & echo $pid > tasks ; kill -INT $!
>
> Or something like that. Seeing how the entire cgroup thing is text
> based, this would end up spewing text like:
>
> $cgroup-path failed attach $pid: $reason
>
> Where everything is in the namespace of the observer; and if there is
> no namespace translation possible, drop the event, because you can't
> have seen or done anything anyhow.

Techcinally, we can do that or any number of other complex schemes but
isn't it obviously better if we can confine controller configuration
failures to actual configuration attemps. Simple -errno failures
would be enough.

> > Yeah, I fully agree with you there. The issue is not that RT/FIFO
> > requires explicit actions from userland but that they're currently
> > tied to BE scheduling. Conceptually, they don't have to be but
> > they're in practice and that ends up requiring whoever, be that an
> > admin or automated tool, is managing the BE grouping to also manage
> > RT/FIFO slices, which isn't ideal but should be workable. I was
> > mostly curious whether they can be separated with a reasonable amount
> > of effort. That's a no, right?
>
> What's a BE? Separating them is technically possible (painful maybe),
> but doesn't make any kind of sense to me.

Oops, best effort. I was using a term from io scheduling. Sorry
about that. I meant fair_sched_class.

At least conceptually, the hierarchies of different scheduling classes
are orthogonal, so I was wondering whether separating them out would
be possible. If that's not practically feasible, I don't think it's a
big problem. Userland would just have to adapt to it.

> > Sure, we have no chance of changing it at this point, but I'm pretty
> > sure if we started by tying it to the process hierarchy, we and the
> > userland would have been able to achieve about the same set of
> > functionalities without all these migration business.
>
> How would we do things like per-cgroup workqueues? We'd need to somehow
> spawn kthreads outside of the normal kthreadd hierarchy.

We can either have proxy kthreadd's or just reparent tasks once
they're created. We already reparent after all.

> (this btw is something we need to sort, but lets not have that
> discussion here -- this email is getting too big as is).

I don't think discussing this is meaningful. This train has left a
long time ago and I don't see any realistic chance of backtracking to
this route.

> Sure, agreed, we need more sanity there. I do however think we need to
> put in the effort to map out all use cases.

I've been doing that for over a year now. I haven't mapped out *all*
use cases but I do have pretty clear ideas on what matters in
achieving the core functionalities.

> > Conceptually, doing so shouldn't be
> > impeded by or affect the resource configured for the parent of that
> > sub hierarchy
>
> Uh what? No you want exactly that in a hierarchy. You want children to
> submit to the configuration of the parent.

You misunderstood. Yes, children should submit to the configuration
of the parent but the act of merely creating a new child or moving
tasks there shouldn't deviate the configuration from what the parent
has. Using CAT as an example, creating a child shouldn't create a new
configuration. It should in effect have the same configuration as its
parent. As such, moving tasks in there shouldn't fail as long as
tasks can be moved to the parent, which is a property we want to
maintain. This is really fundamental - the operation of
sub-categorazation shouldn't affect controller configuration. They
should and can remain orthogonal.

> > and for most controllers this can be achieved in a
> > straight-forward manner by making children not putting further
> > restrictions on the resources from its parent on creation.
>
> The other way around, children can only put further restrictions on,
> they cannot relax restrictions from the parent.

I meant on creation. Putting further restrictions is the only thing a
child can do but on creation it should have the same effective
configuration as its parent.

> > I think this is evident for the controller in question being discussed
> > on this thread. Task organization - creating cgroups and moving tasks
> > around tasks between them - is an inherently different operation from
> > configuring each controller. They shouldn't be conflated. It doesn't
> > make any sense to fail creation of a cgroup or failing task migration
> > later because controller can't be configured certain way. They should
> > be orthogonal as much as possible. If there's restriction on
> > controller configuration, that should be enforced on controller
> > configuration.
>
> I'd mostly agree with that, but note how you put it in relative terms
> :-)

But everything is relative. At the moment we lose sight of that, we
lose the ability to make sensible and healthy trade-offs. I could
have written the above in absolutes but I actively avoid that whenever
possible.

> I did give one (probably strained) example where putting the fail on the
> config side was more constrained than placing it at the migrate.

If you're referring to cpuset, it wasn't a good example.

> > I don't get it. This is one of few cases where controller is
> > distributing hard-walled resources and as you said userland
> > intervention is a must in facilitating such distribution. Isn't this
> > pretty well in line with what you've been saying? The admin is moving
> > a RT / deadline task into a different scheduling domain and if such
> > operation always requires setting scheduling policies again, what's
> > surprising about it?
>
> It would make cgroups useless. It would break running applications.
> You might as well not allow migration at all.

Task migrations will be a low-priority manegerial operation. It's
mostly used to set up the initial hierarchy. Tasks should be put in a
logical structure on startup and resource control changes should
happen through specific controller enable/disable and configuration
changes. This is inherent in the unified hierarchy design and the
reason why controllers are individually enabled and disabled at each
level. Task categorization is an orthogonal operation to resource
restriction. Tasks are logically organized and resource controls are
dynamically configured over the logical structure.

So, yes, the role of migration is diminished in the unified hierarchy
and that's by design. We can't go full static process hierarchy at
this point but this way we can get reasonably close while accomodating
gradual transition.

> But the very fact that migration would destroy configuration of an
> existing task would surprise me, I would -- like stated before -- much
> rather refuse the migration than destroy existing state.

I suppose this depends on the perspective but if the RT config is
reliably reset on migration, I don't see why it'd be surprising. It's
a well-defined behavior which happens without exception and we already
have a precedence in changing per-task settings according to a task's
cgroup membership - cpuset overrides the cpu and node masks on
migration.

> By allowing an effective config different from the requested -- be it
> either using less CPUs than specified, or a different scheduling policy
> or the forced use of remote memory, you could have lost your finger
> before you can fix up.

I don't get why you're lumping the cpuset and cpu situations together.
They're different and cpu doesn't deal with any "effective" settings.

Thanks.

--
tejun

2014-11-06 16:27:19

by Matt Fleming

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, 30 Oct, at 11:47:40PM, Peter Zijlstra wrote:
>
> Let me reply to just this one, I'll do the rest tomorrow, need sleeps.
>
> On Thu, Oct 30, 2014 at 06:22:36PM -0400, Tejun Heo wrote:
>
> > > > This controller might not even require the distinction between
> > > > configured and effective tho? Can't a new child just inherit the
> > > > parent's configuration and never allow the config to become completely
> > > > empty?
> > >
> > > It can do that. But that still has a problem, there is a mapping in
> > > hardware which restricts the number of active configurations. The total
> > > configuration space is larger than the supported active configurations.
> > >
> > > So _something_ must fail. The initial proposal was mkdir failing when
> > > there were more than the hardware supported active config cgroup
> > > directories. The alternative was on-demand activation where we only
> > > allocate the hardware resource when the first task gets moved into the
> > > group -- which then clearly can fail.
> >
> > Hmmm... why can't it just refuse creating a different configuration
> > when its config space is full? Make children inherit the parent's
> > configuration and refuse config writes which require it to create a
> > new one if the config space is full. Seems pretty straight-forward.
> > What am I missing?
>
> We could do that I suppose, there is the one corner case that would not
> allow, intermediate directories with a restricted config that also have
> priv restrictions but no actual tasks. Not sure that makes sense though.

Could you elaborate on this configuration?

> Are there any other cases I might have missed?

I don't think so.

So, for the specific CAT case what you're proposing is make the failure
case happen when writing to the cache bitmask file instead of failing
mkdir() or echo $tid > tasks ?

I think that's OK. If we've run out of CLOS ids I would expect to see
-ENOSPC returned, whereas if we try and set an invalid bitmask we'd get
-EINVAL.

Vikas, Will?

--
Matt Fleming, Intel Open Source Technology Center

2014-11-06 17:03:30

by Matt Fleming

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Tue, 04 Nov, at 02:17:14PM, Peter Zijlstra wrote:
>
> I don't like extending cpusets further. Its already a weird and too big
> controller.
>
> What is wrong with having a specific CQM controller and using it
> together with cpusets where desired?

The specific problem that conflating cpusets and the CAT controller is
trying to solve is that on some platforms the CLOS ID doesn't move with
data that travels up the cache hierarchy, i.e. we lose the CLOS ID when
data moves from LLC to L2.

I think the idea with pinning CLOS IDs to a specific cpu and any tasks
that are using that ID is that it works around this problem out of the
box, rather than requiring sysadmins to configure things.

--
Matt Fleming, Intel Open Source Technology Center

2014-11-06 17:21:55

by Shivappa Vikas

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design



On Thu, 6 Nov 2014, Matt Fleming wrote:

> On Thu, 30 Oct, at 11:47:40PM, Peter Zijlstra wrote:
>>
>> Let me reply to just this one, I'll do the rest tomorrow, need sleeps.
>>
>> On Thu, Oct 30, 2014 at 06:22:36PM -0400, Tejun Heo wrote:
>>
>>>>> This controller might not even require the distinction between
>>>>> configured and effective tho? Can't a new child just inherit the
>>>>> parent's configuration and never allow the config to become completely
>>>>> empty?
>>>>
>>>> It can do that. But that still has a problem, there is a mapping in
>>>> hardware which restricts the number of active configurations. The total
>>>> configuration space is larger than the supported active configurations.
>>>>
>>>> So _something_ must fail. The initial proposal was mkdir failing when
>>>> there were more than the hardware supported active config cgroup
>>>> directories. The alternative was on-demand activation where we only
>>>> allocate the hardware resource when the first task gets moved into the
>>>> group -- which then clearly can fail.
>>>
>>> Hmmm... why can't it just refuse creating a different configuration
>>> when its config space is full? Make children inherit the parent's
>>> configuration and refuse config writes which require it to create a
>>> new one if the config space is full. Seems pretty straight-forward.
>>> What am I missing?
>>
>> We could do that I suppose, there is the one corner case that would not
>> allow, intermediate directories with a restricted config that also have
>> priv restrictions but no actual tasks. Not sure that makes sense though.
>
> Could you elaborate on this configuration?
>
>> Are there any other cases I might have missed?
>
> I don't think so.
>
> So, for the specific CAT case what you're proposing is make the failure
> case happen when writing to the cache bitmask file instead of failing
> mkdir() or echo $tid > tasks ?
>
> I think that's OK. If we've run out of CLOS ids I would expect to see
> -ENOSPC returned, whereas if we try and set an invalid bitmask we'd get
> -EINVAL.
>
> Vikas, Will?

Yes that is correct. You can always create more cgroups and the new cgroup
just inherits the mask from the parent and uses the same CLOSid as its
parent , so it wont fail because of lack of CLOSids.

The only case of failure as you said is when user tries to modify a cbm to
a different one.

>
> --
> Matt Fleming, Intel Open Source Technology Center
>

2014-11-10 15:50:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Cache Allocation Technology Design

On Thu, Nov 06, 2014 at 05:03:23PM +0000, Matt Fleming wrote:
> On Tue, 04 Nov, at 02:17:14PM, Peter Zijlstra wrote:
> >
> > I don't like extending cpusets further. Its already a weird and too big
> > controller.
> >
> > What is wrong with having a specific CQM controller and using it
> > together with cpusets where desired?
>
> The specific problem that conflating cpusets and the CAT controller is
> trying to solve is that on some platforms the CLOS ID doesn't move with
> data that travels up the cache hierarchy, i.e. we lose the CLOS ID when
> data moves from LLC to L2.
>
> I think the idea with pinning CLOS IDs to a specific cpu and any tasks
> that are using that ID is that it works around this problem out of the
> box, rather than requiring sysadmins to configure things.

So either the user needs to set that mode _and_ set cpu masks, or the
user needs to use cpusets and set masks, same difference to me.