Over the last couple of months, we have seen a number of proposals for
resource management infrastructure/controllers and also good discussions
surrounding those proposals. These discussions has resulted in few
consensus points and few other points that are still being debated.
This RFC is an attempt to:
o summarize various proposals to date for infrastructure
o summarize consensus/debated points for infrastructure
o (more importantly) get various stakeholders agree on what is a good
compromise for infrastructure in going forward
Couple of questions that I am trying to address in this RFC:
- Do we wait till controllers are worked out before merging
any infrastructure?
IMHO, its good if we can merge some basic infrastructure now
and incrementally enhance it and add controllers based on it.
This perspective leads to the second question below ..
- Paul Menage's patches present a rework of existing code, which makes
it simpler to get it in. Does it meet container (Openvz/Linux
VServer) and resource management requirements?
Paul has ported over the CKRM code on top of his patches. So I
am optimistic that it meets resource management requirements in
general.
One shortcoming I have seen in it is that it lacks an
efficient method to retrieve tasks associated with a group.
This may be needed by few controllers implementations if they
have to support, say, change of resource limits. This however
I think could be addressed quite easily (by a linked list
hanging off each container structure).
Resource Management - Goals
---------------------------
Develop mechanisms for isolating use of shared resources like cpu, memory
between various applications. This includes:
- mechanism to group tasks by some attribute (ex: containers,
CKRM/RG class, cpuset etc)
- mechanism to monitor and control usage of a variety of resources by
such groups of tasks
Resources to be managed:
- Memory, CPU and disk I/O bandwidth (of high interest perhaps)
- network bandwidth, number of tasks/file-descriptors/sockets etc.
Proposals to date for infrastructure
------------------------------------
- CKRM/RG
- UBC
- Container implementation (by Paul Menage) based on generalization of
cpusets.
A. Class-based Kernel Resource Management/Resource Groups
Framework to monitor/control use of various resources by a group of
tasks as per specified guarantee/limits.
Provides a config-fs based interface to:
- create/delete task-groups
- allow a task to change its (or some other task's) association
from one group to other (provided it has the right
privileges). New children of the affected task inherit the
same group association.
- list tasks present in a group (A group can exist without any
tasks associated with it)
- specify group's min/max use of various resources. A special
value "DONT_CARE" specifies that the group doesn't care for
how much resource it gets.
- obtain resource usage statistics
- Supports heirarchy depth of 1 (??)
In addition to this user-interface, it provides a framework for
controllers to:
- register/deregister themselves
- be intimated about changes in resource allocation for a group
- be intimated about task movements between groups
- be intimated about creation/deletion of groups
- know which group a task belongs to
B. UBC
Framework to account and limit usage of various resources by a
container (group of tasks).
Provides a system call based interface to:
- set a task's beancounter id. If the id does not exist, a new
beancounter object is created
- change a task's association from one beancounter to other
- return beancounter id to which the calling task belongs
- set limits of consumption of a particular resource by a
beancounter
- return statistics information for a given beancounter and
resource.
Provides a framework for controllers to:
- register various resources
- lookup beancounter object given a particular id
- charge/uncharge usage of some resource to a beancounter by
some amount
- also know if the resulting usage is above the allowed
soft/hard limit.
- change a task's accounting beancounter (usefull in, say,
interrupt handling)
- know when the resource limits change for a beancounter
C. Paul Menage's container patches
Provides a generic heirarchial process grouping mechanism based on
cpusets, which can be used for resource management purposes.
Provides a filesystem-based interface to:
- create/destroy containers
- change a task's association from one container to other
- retrieve all the tasks associated with a container
- know which container a task belongs to (from /proc)
- know when the last task belonging to a container has exited
Consensus/Debated Points
------------------------
Consensus:
- Provide resource control over a group of tasks
- Support movement of task from one resource group to another
- Dont support heirarchy for now
- Support limit (soft and/or hard depending on the resource
type) in controllers. Guarantee feature could be indirectly
met thr limits.
Debated:
- syscall vs configfs interface
- Interaction of resource controllers, containers and cpusets
- Should we support, for instance, creation of resource
groups/containers under a cpuset?
- Should we have different groupings for different resources?
- Support movement of all threads of a process from one group
to another atomically?
--
Regards,
vatsa
We've seen a lot of discussion lately on the memory controller. The RFC below
provides a summary of the discussions so far. The goal of this RFC is to bring
together the thoughts so far, build consensus and agree on a path forward.
NOTE: I have tried to keep the information as accurate and current as possible.
Please bring out any omissions/corrections if you notice them. I would like to
keep this summary document accurate, current and live.
Summary of Memory Controller Discussions and Patches
1. Accounting
The patches submitted so far agree that the following memory
should be accounted for
Reclaimable memory
(i) Anonymous pages - Anonymous pages are pages allocated by the user space,
they are mapped into the user page tables, but not backed by a file.
(ii) File mapped pages - File mapped pages map a portion of a file
(iii) Page Cache Pages - Consists of the following
(a) Pages used during IPC using shmfs
(c) Pages of a user mode process that are swapped out
(c) Pages from block read/write operations
(d) Pages from file read/write operations
Non Reclaimable memory
This memory is not reclaimable until it is explicitly released by the
allocator. Examples of such memory include slab allocated memory and
memory allocated by the kernel components in process context. mlock()'ed
memory is also considered as non-reclaimable, but it is usually handled
as a separate resource.
(i) Slabs
(ii) Kernel pages and page_tables allocated on behalf of a task.
2. Control considerations for the memory controller
Control can be implemented using either
(i) Limits
Limits, limit the usage of the resource to the specified value. If the
resource usage crosses the limit, then the group might be penalized
or restricted. Soft limits can be exceeded by the group as long as
the resource is still available. Hard limits are usually the cut-of-point.
No additional resources might be allocated beyond the hard limit.
(ii) Guarantees
Guarantees, come in two forms
(a) Soft guarantees is a best effort service to provide the group
with the specified guarantee of resource availability. In this form
resources can be shared (the unutilized resources of one
group can be used by other groups) among groups and groups are allowed to
exceed their guarantee when the resource is available (there is
no other group unable to meet its guarantee). When a group is unable
to meet its guarantee, the system tries to provide it with it's
guaranteed resources by trying to reclaim from other groups, which
have exceeded their guarantee. In spite of its best effort, if the
system is unable to meet the specified guarantee, the guarantee
failed statistic of the group is incremented. This form of guarantees
is best suited for non-reclaimable resources.
(b) Hard guarantees is a more deterministic method of providing QoS.
Resources need to be allocated in advance, to ensure that the group
is always able to meet its guarantee. This form is undesirable as
it leads to resource under utilization. Another approach is to
allow sharing of resources, but when a group is unable to meet its
guarantee, the system will OOM kill a group that exceeds its
guarantee. Hard guarantees are more difficult to provide for
non-reclaimable resources, but might be easier to provide for
reclaimable resources.
NOTE: It has been argued that guarantees can be implemented using
limits. See http://wiki.openvz.org/Guarantees_for_resources
3. Memory Controller Alternatives
(i) Beancouners
(ii) Containers
(iii) Resource groups (aka CKRM)
(iv) Fake Nodes
+----+---------+------+---------+------------+----------------+-----------+
| No |Guarantee| Limit| User I/F| Controllers| New Controllers|Statistics |
+----+---------+------+---------+------------+----------------+-----------+
| i | No | Yes | syscall | Memory | No framework | Yes |
| | | | | | to write new | |
| | | | | | controllers | |
+----+---------+------+---------+------------+----------------+-----------+
|ii | No | Yes | configfs| Memory, | Plans to | Yes |
| | | | | task limit.| provide a | |
| | | | | Plans to | framework | |
| | | | | allow | to write new | |
| | | | | CPU and I/O| controllers | |
+----+---------+------+---------+------------+----------------+-----------+
|iii | Yes | Yes | configfs| CPU, task | Provides a | Yes |
| | | | | limit & | framework to | |
| | | | | Memory | add new | |
| | | | | controller.| controllers | |
| | | | | I/O contr | | |
| | | | | oller for | | |
| | | | | older | | |
| | | | | revisions | | |
+----+---------+------+---------+------------+----------------+-----------+
4. Existing accounting
a. Beancounters currently account for the following resources
(i) kmemsize - memory obtained through alloc_pages() with __GFP_BC flag set.
(ii) physpages - Resident set size of the tasks in the group.
Reclaim support is provided for this resource.
(iii) lockedpages - User pages locked in memory
(iv) slabs - slabs allocated with kmem_cache_alloc_bc are accounted and
controlled.
Beancounters provides some support for event notification (limit/barrier hit).
b. Containers account for the following resources
(i) mapped pages
(ii) anonymous pages
(iii) file pages (from the page cache)
(iv) active pages
There is some support for reclaiming pages, the code is in the early stages of
development.
c. CKRM/RG Memory Controller
(i) Tracks active pages
(ii) Supports reclaim of LRU pages
(iii) Shared pages are not tracked
This controller provides its own res_zone, to aid reclaim and tracking of pages.
d. Fake NUMA Nodes
This approach was suggested while discussing the memory controller
Advantages
(i) Accounting for zones is already present
(ii) Reclaim code can directly deal with zones
Disadvantages
(i) The approach leads to hard partitioning of memory.
(ii) It's complex to
resize the node. Resizing is required to allow change of limits for
resource management.
(ii) Addition/Deletion of a resource group would require memory hotplug
support for add/delete a node. On deletion of node, its memory is
not utilized until a new node of a same or lesser size is created.
Addition of node, requires reserving memory for it upfront.
5. Open issues
(i) Can we allow threads belonging to the same process belong
to two different resource groups? Does this mean we need to do per-thread
VM accounting now?
(ii) There is an overhead associated with adding a pointer in struct page.
Can this be reduced/avoided? One solution suggested is to use a
mirror mem_map.
(iii) How do we distribute the remaining resources among resource hungry
groups? The Resource Group implementation used the ratio of the limits
to decide on the ratio according to which they are distributed.
(iv) How do we account for shared pages? Should it be charged to the first
container which touches the page or should it be charged equally among
all containers sharing the page?
(v) Definition of RSS (see http://lkml.org/lkml/2006/10/10/130)
6. Going forward
(i) Agree on requirements (there has been some agreement already, please
see http://lkml.org/lkml/2006/9/6/102 and the BOF summary [7])
(ii) Agree on minimum accounting and hooks in the kernel. It might be
a good idea to take this up in phases
phase 1 - account for user space memory
phase 2 - account for kernel memory allocated on behalf of the user/task
(iii) Infrastructure - There is a separate RFC on that.
7. References
1. http://www.openvz.org
2. http://lkml.org/lkml/2006/9/19/283 (Containers patches)
3. http://lwn.net/Articles/200073/ (Another Container Implementation)
4. http://ckrm.sf.net (Resource Groups)
5. http://lwn.net/Articles/197433/ (Resource Beancounters)
6. http://lwn.net/Articles/182369/ (CKRM Rebranded)
7. http://lkml.org/lkml/2006/7/26/237 (OLS BoF on Resource Management (NOTES))
vatsa wrote:
> C. Paul Menage's container patches
>
> Provides a generic heirarchial ...
>
> Consensus/Debated Points
> ------------------------
>
> Consensus:
> ...
> - Dont support heirarchy for now
Looks like this item can be dropped from the concensus ... ;).
I for one would recommend getting the hierarchy right from the
beginning.
Though I can appreciate that others were trying to "keep it simple"
and postpone dealing with such complications. I don't agree.
Such stuff as this deeply affects all that sits on it. Get the
basic data shape presented by the kernel-user API right up front.
The rest will follow, much easier.
Good review of the choices - thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 10/30/06, Srivatsa Vaddagiri <[email protected]> wrote:
>
> - Paul Menage's patches present a rework of existing code, which makes
> it simpler to get it in. Does it meet container (Openvz/Linux
> VServer) and resource management requirements?
>
> Paul has ported over the CKRM code on top of his patches. So I
> am optimistic that it meets resource management requirements in
> general.
>
> One shortcoming I have seen in it is that it lacks an
> efficient method to retrieve tasks associated with a group.
> This may be needed by few controllers implementations if they
> have to support, say, change of resource limits. This however
> I think could be addressed quite easily (by a linked list
> hanging off each container structure).
The cpusets code which this was based on simply locked the task list,
and traversed it to find threads in the cpuset of interest; you could
do the same thing in any other resource controller.
Not keeping a list of tasks in the container makes fork/exit more
efficient, and I assume is the reason that cpusets made that design
decision. If we really wanted to keep a list of tasks in a container
it wouldn't be hard, but should probably be conditional on at least
one of the registered resource controllers to avoid unnecessary
overhead when none of the controllers actually care (in a similar
manner to the fork/exit callbacks, which only take the container
callback mutex if some container subsystem is interested in fork/exit
events).
>
> - register/deregister themselves
> - be intimated about changes in resource allocation for a group
> - be intimated about task movements between groups
> - be intimated about creation/deletion of groups
> - know which group a task belongs to
Apart from the deregister, my generic containers patch provides all of
these as well.
How important is it for controllers/subsystems to be able to
deregister themselves, do you think? I could add it relatively easily,
but it seemed unnecessary in general.
>
> B. UBC
>
> Framework to account and limit usage of various resources by a
> container (group of tasks).
>
> Provides a system call based interface to:
>
> - set a task's beancounter id. If the id does not exist, a new
> beancounter object is created
> - change a task's association from one beancounter to other
> - return beancounter id to which the calling task belongs
> - set limits of consumption of a particular resource by a
> beancounter
> - return statistics information for a given beancounter and
> resource.
I've not really played with it yet, but I don't see any reason why the
beancounter resource control concept couldn't also be built over
generic containers. The user interface would be different, of course
(filesysmem vs syscall), but maybe even that could be emulated if
there was a need for backwards compatibility.
>
> Consensus:
>
> - Provide resource control over a group of tasks
> - Support movement of task from one resource group to another
> - Dont support heirarchy for now
Both CKRM/RG and generic containers support a hierarchy.
> - Support limit (soft and/or hard depending on the resource
> type) in controllers. Guarantee feature could be indirectly
> met thr limits.
That's an issue for resource controllers, rather than the underlying
infrastructure, I think.
>
> Debated:
> - syscall vs configfs interface
> - Interaction of resource controllers, containers and cpusets
> - Should we support, for instance, creation of resource
> groups/containers under a cpuset?
> - Should we have different groupings for different resources?
I've played around with the idea where the hierarchies of resource
controller entities was distinct from the hierarchy of process
containers.
The simplest form of this would be that at each level in the hierarchy
the user could indicate, for each resource controller, whether child
containers would inherit the same resource entity for that controller,
or would have a new one created. E.g. you could determine if, when you
create a child container, whether tasks in that container would be in
the same cpuset as the parent, or in a fresh cpuset; this would be
independent of whether they were in the same disk I/O scheduling
domain, or in a fresh child domain, etc. This would be an extension of
the "X_enabled" files that appear in the top-level container directory
for each container subsystem in my current patch.
At a more complex level, the resource controller entity tree for each
resource controller could be independent, and the mapping from
containers to resource controller nodes could be arbitrary and
different for each controller - so every process would belong to
exactly one container, but the user could pick e.g. any cpuset and any
disk I/O scheduling domain for each container.
Both of these seem a little complex for a first cut of the code, though.
Paul
On 10/30/06, Balbir Singh <[email protected]> wrote:
> +----+---------+------+---------+------------+----------------+-----------+
> |ii | No | Yes | configfs| Memory, | Plans to | Yes |
> | | | | | task limit.| provide a | |
> | | | | | Plans to | framework | |
> | | | | | allow | to write new | |
> | | | | | CPU and I/O| controllers | |
I have a port of Rohit's memory controller to run over my generic containers.
>
> d. Fake NUMA Nodes
>
> This approach was suggested while discussing the memory controller
>
> Advantages
>
> (i) Accounting for zones is already present
> (ii) Reclaim code can directly deal with zones
>
> Disadvantages
>
> (i) The approach leads to hard partitioning of memory.
> (ii) It's complex to
> resize the node. Resizing is required to allow change of limits for
> resource management.
> (ii) Addition/Deletion of a resource group would require memory hotplug
> support for add/delete a node. On deletion of node, its memory is
> not utilized until a new node of a same or lesser size is created.
> Addition of node, requires reserving memory for it upfront.
A much simpler way of adding/deleting/resizing resource groups is to
partition the system at boot time into a large number of fake numa
nodes (say one node per 64MB in the system) and then use cpusets to
assign the appropriate number of nodes each group. We're finding a few
ineffiencies in the current code when using such a large number of
small nodes (e.g. slab alien node caches), but we're confident that we
can iron those out.
> (iv) How do we account for shared pages? Should it be charged to the first
> container which touches the page or should it be charged equally among
> all containers sharing the page?
A third option is to allow inodes to be associated with containers in
their own right, and charge all pages for those inodes to the
associated container. So if several different containers are sharing a
large data file, you can put that file in its own container, and you
then have an exact count of how many pages are in use in that shared
file.
This is cheaper than having to keep track of multiple users of a page,
and is also useful when you're trying to do scheduling, to decide who
to evict. Suppose you have two jobs each allocating 100M of anonymous
memory and each accessing all of a 1G shared file, and you need to
free up 500M of memory in order to run a higher-priority job.
If you charge the first user, then it will appear that the first job
is using 1.1G of memory and the second is using 100M of memory. So you
might evict the first job, thinking it would free up 1.1G of memory -
but it would actually only free up 100M of memory, since the shared
pages would still be in use by the second job.
If you share the charge between both users, then it would appear that
each job is using 600M of memory - but it's still the case that
evicting either one would only free up 100M of memory.
If you can see that the shared file that they're both using is
accounting for 1G of the memory total, and that they're each using
100M of anon memory, then it's easier to see that you'd need to evict
*both* jobs in order to free up 500M of memory.
Paul
Paul M wrote:
> The cpusets code which this was based on simply locked the task list,
> and traversed it to find threads in the cpuset of interest; you could
> do the same thing in any other resource controller.
I get away with this in the cpuset code because:
1) I have the cpuset pointer directly in 'task_struct', so don't
have to chase down anything, for each task, while scanning the
task list. I just have to ask, for each task, if its cpuset
pointer points to the cpuset of interest.
2) I don't care if I get an inconsistent answer, so I don't have
to lock each task, nor do I even lockout the rest of the cpuset
code. All I know, at the end of the scan, is that each task that
I claim is attached to the cpuset in question was attached to it at
some point during my scan, not necessarilly all at the same time.
3) It's not a flaming disaster if the kmalloc() of enough memory
to hold all the pids I collect in a single array fails. That
just means that some hapless users open for read of a cpuset
'tasks' file failed, -ENOMEM. Oh well ...
If someone is actually trying to manage system resources accurately,
they probably can't get away being as fast and loose as this.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul M wrote:
> I've played around with the idea where the hierarchies of resource
> controller entities was distinct from the hierarchy of process
> containers.
It would be nice, me thinks, if the underlying container technology
didn't really care whether we had one hierarchy or seven. Let the
users (such as CKRM/RG, cpusets, ...) of this container infrastructure
determine when and where they need separate hierarchies, and when and
where they are better off sharing the same hierarchy.
The question of one or more separate hierarchies is one of those long
term questions that should be driven by the basic semantics of what we
are trying to model, not by transient infrastructure expediencies.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 10/30/06, Paul Jackson <[email protected]> wrote:
> It would be nice, me thinks, if the underlying container technology
> didn't really care whether we had one hierarchy or seven. Let the
> users (such as CKRM/RG, cpusets, ...)
I was thinking that it would be even better if the actual (human)
users could determine this; have the container infrastructure make it
practical to have flexible hierarchy mappings, and have the resource
controller subsystems not have to care about how they were being used.
Paul
On 10/30/06, Paul Jackson <[email protected]> wrote:
> I get away with this in the cpuset code because:
> 1) I have the cpuset pointer directly in 'task_struct', so don't
> have to chase down anything, for each task, while scanning the
> task list. I just have to ask, for each task, if its cpuset
> pointer points to the cpuset of interest.
That's the same when it's transferred to containers - each task_struct
now has a container pointer, and you can just see whether the
container pointer matches the container that you're interested in.
> 2) I don't care if I get an inconsistent answer, so I don't have
> to lock each task, nor do I even lockout the rest of the cpuset
> code. All I know, at the end of the scan, is that each task that
> I claim is attached to the cpuset in question was attached to it at
> some point during my scan, not necessarilly all at the same time.
Well, anything that can be accomplished from within the tasklist_lock
can get a consistent result without any additional lists or
synchronization - it seems that it would be good to come up with a
real-world example of something that *can't* make do with this before
adding extra book-keeping.
Paul
Paul M wrote:
> I was thinking that it would be even better if the actual (human)
> users could determine this; have the container infrastructure make it
You mean let the system admin, say, of a system determine
whether or not CKRM/RG and cpusets have one shared, or two
separate, hierarchies?
Wow - I think my brain just exploded.
Oh well ... I'll have to leave it an open issue for the moment;
I'm focusing on something else right now.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul M wrote:
> it seems that it would be good to come up with a
> real-world example of something that *can't* make do with this before
> adding extra book-keeping.
that seems reasonable enough ...
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Menage wrote:
> On 10/30/06, Balbir Singh <[email protected]> wrote:
>> +----+---------+------+---------+------------+----------------+-----------+
>> |ii | No | Yes | configfs| Memory, | Plans to | Yes |
>> | | | | | task limit.| provide a | |
>> | | | | | Plans to | framework | |
>> | | | | | allow | to write new | |
>> | | | | | CPU and I/O| controllers | |
>
> I have a port of Rohit's memory controller to run over my generic containers.
Cool!
>
>> d. Fake NUMA Nodes
>>
>> This approach was suggested while discussing the memory controller
>>
>> Advantages
>>
>> (i) Accounting for zones is already present
>> (ii) Reclaim code can directly deal with zones
>>
>> Disadvantages
>>
>> (i) The approach leads to hard partitioning of memory.
>> (ii) It's complex to
>> resize the node. Resizing is required to allow change of limits for
>> resource management.
>> (ii) Addition/Deletion of a resource group would require memory hotplug
>> support for add/delete a node. On deletion of node, its memory is
>> not utilized until a new node of a same or lesser size is created.
>> Addition of node, requires reserving memory for it upfront.
>
> A much simpler way of adding/deleting/resizing resource groups is to
> partition the system at boot time into a large number of fake numa
> nodes (say one node per 64MB in the system) and then use cpusets to
> assign the appropriate number of nodes each group. We're finding a few
> ineffiencies in the current code when using such a large number of
> small nodes (e.g. slab alien node caches), but we're confident that we
> can iron those out.
>
You'll also end up with per zone page cache pools for each zone. A list of
active/inactive pages per zone (which will split up the global LRU list).
What about the hard-partitioning. If a container/cpuset is not using its full
64MB of a fake node, can some other node use it? Also, won't you end up
with a big zonelist?
>> (iv) How do we account for shared pages? Should it be charged to the first
>> container which touches the page or should it be charged equally among
>> all containers sharing the page?
>
> A third option is to allow inodes to be associated with containers in
> their own right, and charge all pages for those inodes to the
> associated container. So if several different containers are sharing a
> large data file, you can put that file in its own container, and you
> then have an exact count of how many pages are in use in that shared
> file.
>
> This is cheaper than having to keep track of multiple users of a page,
> and is also useful when you're trying to do scheduling, to decide who
> to evict. Suppose you have two jobs each allocating 100M of anonymous
> memory and each accessing all of a 1G shared file, and you need to
> free up 500M of memory in order to run a higher-priority job.
>
> If you charge the first user, then it will appear that the first job
> is using 1.1G of memory and the second is using 100M of memory. So you
> might evict the first job, thinking it would free up 1.1G of memory -
> but it would actually only free up 100M of memory, since the shared
> pages would still be in use by the second job.
>
> If you share the charge between both users, then it would appear that
> each job is using 600M of memory - but it's still the case that
> evicting either one would only free up 100M of memory.
>
> If you can see that the shared file that they're both using is
> accounting for 1G of the memory total, and that they're each using
> 100M of anon memory, then it's easier to see that you'd need to evict
> *both* jobs in order to free up 500M of memory.
Consider the other side of the story. lets say we have a shared lib shared
among quite a few containers. We limit the usage of the inode containing
the shared library to 50M. Tasks A and B use some part of the library
and cause the container "C" to reach the limit. Container C is charged
for all usage of the shared library. Now no other task, irrespective of which
container it belongs to, can touch any new pages of the shared library.
We might also be interested in limiting the page cache usage of a container.
In such cases, this solution might not work out to be the best.
What you are suggesting is to virtually group the inodes by container rather
than task. It might make sense in some cases, but not all.
We could consider implementing the controllers in phases
1. RSS control (anon + mapped pages)
2. Page Cache control
3. Kernel accounting and control
--
Cheers,
Balbir Singh,
Linux Technology Center,
IBM Software Labs
[snip]
>
> Consensus/Debated Points
> ------------------------
>
> Consensus:
>
> - Provide resource control over a group of tasks
> - Support movement of task from one resource group to another
> - Dont support heirarchy for now
> - Support limit (soft and/or hard depending on the resource
> type) in controllers. Guarantee feature could be indirectly
> met thr limits.
>
> Debated:
> - syscall vs configfs interface
1. One of the major configfs ideas is that lifetime of
the objects is completely driven by userspace.
Resource controller shouldn't live as long as user
want. It "may", but not "must"! As you have seen from
our (beancounters) patches beancounters disapeared
as soon as the last reference was dropped. Removing
configfs entries on beancounter's automatic destruction
is possible, but it breaks the logic of configfs.
2. Having configfs as the only interface doesn't alow
people having resource controll facility w/o configfs.
Resource controller must not depend on any "feature".
3. Configfs may be easily implemented later as an additional
interface. I propose the following solution:
- First we make an interface via any common kernel
facility (syscall, ioctl, etc);
- Later we may extend this with configfs. This will
alow one to have configfs interface build as a module.
> - Interaction of resource controllers, containers and cpusets
> - Should we support, for instance, creation of resource
> groups/containers under a cpuset?
> - Should we have different groupings for different resources?
This breaks the idea of groups isolation.
> - Support movement of all threads of a process from one group
> to another atomically?
This is not a critical question. This is something that
has difference in
- move_task_to_container(task);
+ do_each_thread_all(g, p) {
+ if (g->mm == task->mm)
+ move_task_to_container(g);
+ } while_each_thread_all(g, p);
or similar. If we have an infrastructure for accounting and
moving one task_struct into group then solution of how many
task to move in one syscall may be taken, but not the other
way round.
I also add [email protected] to Cc. Please keep it on your replies.
Paul Jackson wrote:
> vatsa wrote:
>> C. Paul Menage's container patches
>>
>> Provides a generic heirarchial ...
>>
>> Consensus/Debated Points
>> ------------------------
>>
>> Consensus:
>> ...
>> - Dont support heirarchy for now
>
> Looks like this item can be dropped from the concensus ... ;).
Agree.
>
> I for one would recommend getting the hierarchy right from the
> beginning.
>
> Though I can appreciate that others were trying to "keep it simple"
> and postpone dealing with such complications. I don't agree.
>
> Such stuff as this deeply affects all that sits on it. Get the
I can share our experience with it.
Hierarchy support over beancounters was done in one patch.
This patch altered only three places - charge/uncharge routines,
beancounter creation/destruction code and BC's /proc entry.
All the rest code was not modified.
My point is that a good infrastrucure doesn't care wether
or not beancounter (group controller) has a parent.
> basic data shape presented by the kernel-user API right up front.
> The rest will follow, much easier.
>
> Good review of the choices - thanks.
>
Pavel wrote:
> 1. One of the major configfs ideas is that lifetime of
> the objects is completely driven by userspace.
> Resource controller shouldn't live as long as user
> want. It "may", but not "must"!
I had trouble understanding what you are saying here.
What does the phrase "live as long as user want" mean?
> 2. Having configfs as the only interface doesn't alow
> people having resource controll facility w/o configfs.
> Resource controller must not depend on any "feature".
>
> 3. Configfs may be easily implemented later as an additional
> interface. I propose the following solution:
> - First we make an interface via any common kernel
> facility (syscall, ioctl, etc);
> - Later we may extend this with configfs. This will
> alow one to have configfs interface build as a module.
So you would add bloat to the kernel, with two interfaces
to the same facility, because you don't want the resource
controller to depend on configfs.
I am familiar with what is wrong with kernel bloat.
Can you explain to me what is wrong with having resource
groups depend on configfs? Is there something wrong with
configfs that would be a significant problem for some systems
needing resource groups?
It is better where possible, I would think, to reuse common
infrastructure and minimize redundancy. If there is something
wrong with configfs that makes this a problem, perhaps we
should fix that.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Pavel wrote:
> My point is that a good infrastrucure doesn't care wether
> or not beancounter (group controller) has a parent.
I am far more interested in the API, including the shape
of the data model, that we present to the user across the
kernel-user boundary.
Getting one, good, stable API for the long haul is worth alot.
Whether or not some substantial semantic change in this, such
as going from a flat to a tree shape, can be done in a single
line of kernel code, or a thousand lines, is less important.
What is the right long term kernel-user API and data model?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Jackson wrote:
> Pavel wrote:
>> 1. One of the major configfs ideas is that lifetime of
>> the objects is completely driven by userspace.
>> Resource controller shouldn't live as long as user
>> want. It "may", but not "must"!
>
> I had trouble understanding what you are saying here.
>
> What does the phrase "live as long as user want" mean?
What if if user creates a controller (configfs directory)
and doesn't remove it at all. Should controller stay in memory
even if nobody uses it?
>
>
>> 2. Having configfs as the only interface doesn't alow
>> people having resource controll facility w/o configfs.
>> Resource controller must not depend on any "feature".
>>
>> 3. Configfs may be easily implemented later as an additional
>> interface. I propose the following solution:
>> - First we make an interface via any common kernel
>> facility (syscall, ioctl, etc);
>> - Later we may extend this with configfs. This will
>> alow one to have configfs interface build as a module.
>
> So you would add bloat to the kernel, with two interfaces
> to the same facility, because you don't want the resource
> controller to depend on configfs.
>
> I am familiar with what is wrong with kernel bloat.
>
> Can you explain to me what is wrong with having resource
> groups depend on configfs? Is there something wrong with
Resource controller has nothing common with confgifs.
That's the same as if we make netfilter depend on procfs.
> configfs that would be a significant problem for some systems
> needing resource groups?
Why do we need to make some dependency if we can avoid it?
> It is better where possible, I would think, to reuse common
> infrastructure and minimize redundancy. If there is something
> wrong with configfs that makes this a problem, perhaps we
> should fix that.
The same can be said about system calls interface, isn't it?
Pavel wrote:
> >> 3. Configfs may be easily implemented later as an additional
> >> interface. I propose the following solution:
> >> ...
> >
> Resource controller has nothing common with confgifs.
> That's the same as if we make netfilter depend on procfs.
Well ... if you used configfs as an interface to resource
controllers, as you said was easily done, then they would
have something to do with each other, right ;)?
Choose the right data structure for the job, and then reuse
what fits for that choice.
Neither avoid nor encouraging code reuse is the key question.
What's the best fit, long term, for the style of kernel-user
API, for this use? That's the key question.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Jackson wrote:
> Pavel wrote:
>>>> 3. Configfs may be easily implemented later as an additional
>>>> interface. I propose the following solution:
>>>> ...
>> Resource controller has nothing common with confgifs.
>> That's the same as if we make netfilter depend on procfs.
>
> Well ... if you used configfs as an interface to resource
> controllers, as you said was easily done, then they would
> have something to do with each other, right ;)?
Right. We'll create a dependency that is not needed.
> Choose the right data structure for the job, and then reuse
> what fits for that choice.
>
> Neither avoid nor encouraging code reuse is the key question.
>
> What's the best fit, long term, for the style of kernel-user
> API, for this use? That's the key question.
I agree, but you've cut some importaint questions away,
so I ask them again:
> What if if user creates a controller (configfs directory)
> and doesn't remove it at all. Should controller stay in
> memory even if nobody uses it?
This is importaint to solve now - wether we want or not to
keep "empty" beancounters in memory. If we do not then configfs
usage is not acceptible.
> The same can be said about system calls interface, isn't it?
I haven't seen any objections against system calls yet.
[snip]
> Reclaimable memory
>
> (i) Anonymous pages - Anonymous pages are pages allocated by the user space,
> they are mapped into the user page tables, but not backed by a file.
I do not agree with such classification.
When one maps file then kernel can remove page from address
space as there is already space on disk for it. When one
maps an anonymous page then kernel won't remove this page
for sure as system may simply be configured to be swapless.
I also remind you that beancounter code keeps all the logic
of memory classification in one place, so changing this
would require minimal changes.
[snip]
>
> (i) Slabs
> (ii) Kernel pages and page_tables allocated on behalf of a task.
I'd pay more attention to kernel memory accounting and less
to user one as having kmemsize resource actually protects
the system from DoS attacks. Accounting and limiting user
pages doesn't protect system from anything.
[snip]
> (b) Hard guarantees is a more deterministic method of providing QoS.
> Resources need to be allocated in advance, to ensure that the group
> is always able to meet its guarantee. This form is undesirable as
How would you allocate memory on NUMA in advance?
[snip]
> +----+---------+------+---------+------------+----------------+-----------+
> | No |Guarantee| Limit| User I/F| Controllers| New Controllers|Statistics |
> +----+---------+------+---------+------------+----------------+-----------+
> | i | No | Yes | syscall | Memory | No framework | Yes |
> | | | | | | to write new | |
> | | | | | | controllers | |
The lattest Beancounter patches do provide framework for
new controllers.
[snip]
> a. Beancounters currently account for the following resources
>
> (i) kmemsize - memory obtained through alloc_pages() with __GFP_BC flag set.
> (ii) physpages - Resident set size of the tasks in the group.
> Reclaim support is provided for this resource.
> (iii) lockedpages - User pages locked in memory
> (iv) slabs - slabs allocated with kmem_cache_alloc_bc are accounted and
> controlled.
This is also not true now. The latest beancounter code accounts for
1. kmemsie - this includes slab and vmalloc objects and "raw" pages
allocated directly from buddy allocator.
2. unreclaimable memory - this accounts for the total length of
mappings of a certain type. These are _mappings_ that are
accounted since limiting mapping limits memory usage and alows
a grace rejects (-ENOMEM returned from sys_mmap), but with
unlimited mappings you may limit memory usage with SIGKILL only.
3. physical pages - these includes pages mapped in page faults and
hitting the pyspages limit starts pages reclamation.
[snip]
> 5. Open issues
>
> (i) Can we allow threads belonging to the same process belong
> to two different resource groups? Does this mean we need to do per-thread
> VM accounting now?
Solving this question is the same as solving "how would we account for
pages that are shared between several processes?".
> (ii) There is an overhead associated with adding a pointer in struct page.
> Can this be reduced/avoided? One solution suggested is to use a
> mirror mem_map.
This does not affect infrastructure, right? E.g. current beancounter
code uses page_bc() macro to get BC pointer from page. Changing it
from
#define page_bc(page) ((page)->page_bc)
to
#define page_bc(page) ((bc_mmap[page_to_pfn(page)])
or similar may be done at any moment.
We may deside that "each page has an associated BC pointer" and
go on further discussion (e.g. which interface to use). The solution
where to store this pointer may be taken after we agree on all the
rest.
Since we're not going to discuss right now what kind of locking
we are going to have, let's delay the discussion of anything that
is code-dependent.
> (iii) How do we distribute the remaining resources among resource hungry
> groups? The Resource Group implementation used the ratio of the limits
> to decide on the ratio according to which they are distributed.
> (iv) How do we account for shared pages? Should it be charged to the first
> container which touches the page or should it be charged equally among
> all containers sharing the page?
> (v) Definition of RSS (see http://lkml.org/lkml/2006/10/10/130)
>
> 6. Going forward
>
> (i) Agree on requirements (there has been some agreement already, please
> see http://lkml.org/lkml/2006/9/6/102 and the BOF summary [7])
> (ii) Agree on minimum accounting and hooks in the kernel. It might be
> a good idea to take this up in phases
> phase 1 - account for user space memory
> phase 2 - account for kernel memory allocated on behalf of the user/task
I'd raised the priority of kernel memory accounting.
I see that everyone agree that we want to see three resources:
1. kernel memory
2. unreclaimable memory
3. reclaimable memory
if this is right then let's save it somewhere
(e.g. http://wiki.openvz.org/Containers/UBC_discussion)
and go on discussing the next question - interface.
Right now this is the most diffucult one and there are two
candidates - syscalls and configfs. I've pointed my objections
agains configfs and haven't seen any against system calls...
On Mon, Oct 30, 2006 at 02:43:20AM -0800, Paul Jackson wrote:
> > Consensus:
> > ...
> > - Dont support heirarchy for now
>
> Looks like this item can be dropped from the concensus ... ;).
>
> I for one would recommend getting the hierarchy right from the
> beginning.
>
> Though I can appreciate that others were trying to "keep it simple"
> and postpone dealing with such complications. I don't agree.
>
> Such stuff as this deeply affects all that sits on it. Get the
> basic data shape presented by the kernel-user API right up front.
> The rest will follow, much easier.
Hierarchy has implications in not just the kernel-user API, but also on
the controller design. I would prefer to progressively enhance the
controller, not supporting hierarchy in the begining.
However you do have a valid concern that, if we dont design the user-kernel
API keeping hierarchy in mind, then we may break this interface when we
latter add hierarchy support, which will be bad.
One possibility is to design the user-kernel interface that supports hierarchy
but not support creating hierarchical depths more than 1 in the initial
versions. Would that work?
--
Regards,
vatsa
On Monday 30 October 2006 11:09 am, Srivatsa Vaddagiri wrote:
> Hierarchy has implications in not just the kernel-user API, but also on
> the controller design. I would prefer to progressively enhance the
> controller, not supporting hierarchy in the begining.
>
> However you do have a valid concern that, if we dont design the user-kernel
> API keeping hierarchy in mind, then we may break this interface when we
> latter add hierarchy support, which will be bad.
>
> One possibility is to design the user-kernel interface that supports
> hierarchy but not support creating hierarchical depths more than 1 in the
> initial versions. Would that work?
Is there any user demand for heirarchy right now? I agree that we should
design the API to allow heirarchy, but unless there is a current need for it
I think we should not support actually creating heirarchies. In addition to
the reduction in code complexity, it will simplify the paradigm presented to
the users. I'm a firm believer in not giving users options they will never
use.
Dave McCracken
Pavel Emelianov wrote:
> [snip]
>
>> Reclaimable memory
>>
>> (i) Anonymous pages - Anonymous pages are pages allocated by the user space,
>> they are mapped into the user page tables, but not backed by a file.
>
> I do not agree with such classification.
> When one maps file then kernel can remove page from address
> space as there is already space on disk for it. When one
> maps an anonymous page then kernel won't remove this page
> for sure as system may simply be configured to be swapless.
Yes, I agree if there is no swap space, then anonymous memory is pinned.
Assuming that we'll end up using a an abstraction on top of the
existing reclaim mechanism, the mechanism would know if a particular
type of memory is reclaimable or not.
But your point is well taken.
[snip]
>> (i) Slabs
>> (ii) Kernel pages and page_tables allocated on behalf of a task.
>
> I'd pay more attention to kernel memory accounting and less
> to user one as having kmemsize resource actually protects
> the system from DoS attacks. Accounting and limiting user
> pages doesn't protect system from anything.
>
Please see my comments at the end
[snip]
>
>> +----+---------+------+---------+------------+----------------+-----------+
>> | No |Guarantee| Limit| User I/F| Controllers| New Controllers|Statistics |
>> +----+---------+------+---------+------------+----------------+-----------+
>> | i | No | Yes | syscall | Memory | No framework | Yes |
>> | | | | | | to write new | |
>> | | | | | | controllers | |
>
> The lattest Beancounter patches do provide framework for
> new controllers.
>
I'll update the RFC.
> [snip]
>
>> a. Beancounters currently account for the following resources
>>
>> (i) kmemsize - memory obtained through alloc_pages() with __GFP_BC flag set.
>> (ii) physpages - Resident set size of the tasks in the group.
>> Reclaim support is provided for this resource.
>> (iii) lockedpages - User pages locked in memory
>> (iv) slabs - slabs allocated with kmem_cache_alloc_bc are accounted and
>> controlled.
>
> This is also not true now. The latest beancounter code accounts for
> 1. kmemsie - this includes slab and vmalloc objects and "raw" pages
> allocated directly from buddy allocator.
This is what I said, pages marked with __GFP_BC, so far on i386 I see
slab, vmalloc, PTE & LDT entries marked with the flag.
> 2. unreclaimable memory - this accounts for the total length of
> mappings of a certain type. These are _mappings_ that are
> accounted since limiting mapping limits memory usage and alows
> a grace rejects (-ENOMEM returned from sys_mmap), but with
> unlimited mappings you may limit memory usage with SIGKILL only.
ok, I'll add this too.
> 3. physical pages - these includes pages mapped in page faults and
> hitting the pyspages limit starts pages reclamation.
>
Yep, thats what I said.
> [snip]
>
>> 5. Open issues
>>
>> (i) Can we allow threads belonging to the same process belong
>> to two different resource groups? Does this mean we need to do per-thread
>> VM accounting now?
>
> Solving this question is the same as solving "how would we account for
> pages that are shared between several processes?".
>
Yes and that's an open issue too :)
>> (ii) There is an overhead associated with adding a pointer in struct page.
>> Can this be reduced/avoided? One solution suggested is to use a
>> mirror mem_map.
>
> This does not affect infrastructure, right? E.g. current beancounter
> code uses page_bc() macro to get BC pointer from page. Changing it
> from
> #define page_bc(page) ((page)->page_bc)
> to
> #define page_bc(page) ((bc_mmap[page_to_pfn(page)])
> or similar may be done at any moment.
The goal of the RFC is to discuss the controller. In the OLS BOF
on resource management, it was agreed that the controllers should be
discussed and designed first, so that the proper infrastructure
could be put in place. Please see http://lkml.org/lkml/2006/7/26/237.
>
> We may deside that "each page has an associated BC pointer" and
> go on further discussion (e.g. which interface to use). The solution
> where to store this pointer may be taken after we agree on all the
> rest.
Yes, what you point out is an abstraction mechanism, that abstracts
out the implementation detail right now. I think its a good starting
point for further discussion.
[snip]
>>
>> 6. Going forward
>>
>> (i) Agree on requirements (there has been some agreement already, please
>> see http://lkml.org/lkml/2006/9/6/102 and the BOF summary [7])
>> (ii) Agree on minimum accounting and hooks in the kernel. It might be
>> a good idea to take this up in phases
>> phase 1 - account for user space memory
>> phase 2 - account for kernel memory allocated on behalf of the user/task
>
> I'd raised the priority of kernel memory accounting.
>
> I see that everyone agree that we want to see three resources:
> 1. kernel memory
> 2. unreclaimable memory
> 3. reclaimable memory
> if this is right then let's save it somewhere
> (e.g. http://wiki.openvz.org/Containers/UBC_discussion)
> and go on discussing the next question - interface.
I understand that kernel memory accounting is the first priority for
containers, but accounting kernel memory requires too many changes
to the VM core, hence I was hesitant to put it up as first priority.
But in general I agree, these are the three important resources for
accounting and control
[snip]
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
On 10/30/06, Paul Jackson <[email protected]> wrote:
>
> You mean let the system admin, say, of a system determine
> whether or not CKRM/RG and cpusets have one shared, or two
> separate, hierarchies?
Yes - let the sysadmin define the process groupings, and how those
groupings get associated with resource control entities. The default
should be that all the hierarchies are the same, since I think that's
likely to be the common case.
Paul
On 10/30/06, Pavel Emelianov <[email protected]> wrote:
> > Debated:
> > - syscall vs configfs interface
>
> 1. One of the major configfs ideas is that lifetime of
> the objects is completely driven by userspace.
> Resource controller shouldn't live as long as user
> want. It "may", but not "must"! As you have seen from
> our (beancounters) patches beancounters disapeared
> as soon as the last reference was dropped.
Why is this an important feature for beancounters? All the other
resource control approaches seem to prefer having userspace handle
removing empty/dead groups/containers.
> 2. Having configfs as the only interface doesn't alow
> people having resource controll facility w/o configfs.
> Resource controller must not depend on any "feature".
Why is depending on a feature like configfs worse than depending on a
feature of being able to extend the system call interface?
> > - Interaction of resource controllers, containers and cpusets
> > - Should we support, for instance, creation of resource
> > groups/containers under a cpuset?
> > - Should we have different groupings for different resources?
>
> This breaks the idea of groups isolation.
That's fine - some people don't want total isolation. If we're looking
for a solution that fits all the different requirements, then we need
that flexibility. I agree that the default would probably want to be
that the groupings be the same for all resource controllers /
subsystems.
Paul
On 10/30/06, Dave McCracken <[email protected]> wrote:
>
> Is there any user demand for heirarchy right now? I agree that we should
> design the API to allow heirarchy, but unless there is a current need for it
> I think we should not support actually creating heirarchies. In addition to
> the reduction in code complexity, it will simplify the paradigm presented to
> the users. I'm a firm believer in not giving users options they will never
> use.
The current CPUsets code supports hierarchies, and I believe that
there are people out there who depend on them (right, PaulJ?) Since
CPUsets are at heart a form of resource controller, it would be nice
to have them use the same resource control infrastructure as other
resource controllers (see the generic container patches that I sent
out as an example of this). So that would be at least one user that
requires a hierarchy.
Paul
Balbir Singh wrote:
[snip]
>
>> I see that everyone agree that we want to see three resources:
>> 1. kernel memory
>> 2. unreclaimable memory
>> 3. reclaimable memory
>> if this is right then let's save it somewhere
>> (e.g. http://wiki.openvz.org/Containers/UBC_discussion)
>> and go on discussing the next question - interface.
>
> I understand that kernel memory accounting is the first priority for
> containers, but accounting kernel memory requires too many changes
> to the VM core, hence I was hesitant to put it up as first priority.
>
> But in general I agree, these are the three important resources for
> accounting and control
I missed out to mention, I hope you were including the page cache in
your definition of reclaimable memory.
>
> [snip]
>
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
On 10/30/06, Balbir Singh <[email protected]> wrote:
>
> You'll also end up with per zone page cache pools for each zone. A list of
> active/inactive pages per zone (which will split up the global LRU list).
Yes, these are some of the inefficiencies that we're ironing out.
> What about the hard-partitioning. If a container/cpuset is not using its full
> 64MB of a fake node, can some other node use it?
No. So the granularity at which you can divide up the system depends
on how big your fake nodes are. For our purposes, we figure that 64MB
granularity should be fine.
> Also, won't you end up
> with a big zonelist?
Yes - but PaulJ's recent patch to speed up the zone selection helped
reduce the overhead of this a lot.
>
> Consider the other side of the story. lets say we have a shared lib shared
> among quite a few containers. We limit the usage of the inode containing
> the shared library to 50M. Tasks A and B use some part of the library
> and cause the container "C" to reach the limit. Container C is charged
> for all usage of the shared library. Now no other task, irrespective of which
> container it belongs to, can touch any new pages of the shared library.
Well, if the pages aren't mlocked then presumably some of the existing
pages can be flushed out to disk and replaced with other pages.
>
> What you are suggesting is to virtually group the inodes by container rather
> than task. It might make sense in some cases, but not all.
Right - I think it's an important feature to be able to support, but I
agree that it's not suitable for all situations.
>
> We could consider implementing the controllers in phases
>
> 1. RSS control (anon + mapped pages)
> 2. Page Cache control
Page cache control is actually more essential that RSS control, in our
experience - it's pretty easy to track RSS values from userspace, and
react reasonably quickly to kill things that go over their limit, but
determining page cache usage (i.e. determining which job on the system
is flooding the page cache with dirty buffers) is pretty much
impossible currently.
Paul
On 10/30/06, Pavel Emelianov <[email protected]> wrote:
> and go on discussing the next question - interface.
>
> Right now this is the most diffucult one and there are two
> candidates - syscalls and configfs. I've pointed my objections
> agains configfs and haven't seen any against system calls...
>
Some objections:
- they require touching every architecture to add the new system calls
- they're harder to debug from userspace, since you can't using useful
tools such as echo and cat
- changing the interface is harder since it's (presumably) a binary API
Paul
> Yes - let the sysadmin define the process groupings, and how those
> groupings get associated with resource control entities. The default
> should be that all the hierarchies are the same, since I think that's
> likely to be the common case.
Ah - I had thought earlier you were saying let the user define whether
or not (speaking metaphorically) their car had multiple gears in its
transmission, or just one gear. That would have been kind of insane.
You meant we deliver a car with multiple gears, and its up to the user
when and if to ever shift. That makes more sense.
In other words you are recommending delivering a system that internally
tracks separate hierarchies for each resource control entity, but where
the user can conveniently overlap some of these hierarchies and deal
with them as a single hierarchy.
What you are suggesting goes beyond the question of whether the kernel
has just and exactly and nevermore than one hierarchy, to suggest that
not only should the kernel support multiple hierarchies for different
resource control entities, but furthermore the kernel should make it
convenient for users to "bind" two or more of these hierarchies and
treat them as one.
Ok. Sounds useful.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
> I believe that
> there are people out there who depend on them (right, PaulJ?)
Yes. For example a common usage pattern has the system admin carve
off a big chunk of CPUs and Memory Nodes into a cpuset for the batch
scheduler to manage, within which the batch scheduler creates child
cpusets, roughly one for each job under its control.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 10/30/06, Paul Jackson <[email protected]> wrote:
>
> In other words you are recommending delivering a system that internally
> tracks separate hierarchies for each resource control entity, but where
> the user can conveniently overlap some of these hierarchies and deal
> with them as a single hierarchy.
More or less. More concretely:
- there is a single hierarchy of process containers
- each process is a member of exactly one process container
- for each resource controller, there's a hierarchy of resource "nodes"
- each process container is associated with exactly one resource node
of each type
- by default, the process container hierarchy and the resource node
hierarchies are isomorphic, but that can be controlled by userspace.
Paul
> More concretely:
>
> - there is a single hierarchy of process containers
> - each process is a member of exactly one process container
>
> - for each resource controller, there's a hierarchy of resource "nodes"
> - each process container is associated with exactly one resource node
> of each type
>
> - by default, the process container hierarchy and the resource node
> hierarchies are isomorphic, but that can be controlled by userspace.
nice
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 10/30/06, Paul Menage <[email protected]> wrote:
>
> - there is a single hierarchy of process containers
> - each process is a member of exactly one process container
>
> - for each resource controller, there's a hierarchy of resource "nodes"
> - each process container is associated with exactly one resource node
> of each type
>
> - by default, the process container hierarchy and the resource node
> hierarchies are isomorphic, but that can be controlled by userspace.
A simpler alternative that I thought about would be to restrict the
resource contoller hierarchies to be strict subtrees of the process
container hierarchy - so at each stage in the hierarchy, a container
could either inherit its parent's node for a given resource or have a
new child node (i.e. be in the same cpuset or be in a fresh child
cpuset).
This is a much simpler abstraction to present to userspace (simply one
flag for each resource controller in each process container) and might
be sufficient for all reasonable scenarios.
Paul
> - they require touching every architecture to add the new system calls
> - they're harder to debug from userspace, since you can't using useful
> tools such as echo and cat
> - changing the interface is harder since it's (presumably) a binary API
To my mind these are rather secondary selection criteria.
If say we were adding a single, per-thread scalar value that each
thread could query of and perhaps modify for itself, then a system call
would be an alternative worthy of further consideration.
Representing complicated, nested, structured information via custom
system calls is a pain. We have more luck using classic file system
structures, and abstracting the representation a layer up. Of course
there are still system calls, but they are the classic Unix calls such
as mkdir, chdir, rmdir, creat, unlink, open, read, write and close.
The same thing happens in designing network and web services. There
are always low level protocols, such as physical and link and IP.
And sometimes these have to be extended, such as IPv4 versus IPv6.
But we don't code AJAX down at that level - AJAX sits on top of things
like Javascript and XML, higher up in the protocol stack.
And we didn't start coding AJAX as a series of IP hacks, saying we can
add a higher level protocol alternative later on. That would have been
useless.
Figuring out where in the protocol stack one is targeting ones new
feature work is a foundation decision. Get it right, up front,
forevermore, or risk ending up in Documentation/ABI/obsolete or
Documentation/ABI/removed in a few years, if your work doesn't
just die sooner without a whimper.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Mon, 2006-10-30 at 18:26 +0300, Pavel Emelianov wrote:
> Paul Jackson wrote:
> > Pavel wrote:
> >>>> 3. Configfs may be easily implemented later as an additional
> >>>> interface. I propose the following solution:
> >>>> ...
> >> Resource controller has nothing common with confgifs.
> >> That's the same as if we make netfilter depend on procfs.
> >
> > Well ... if you used configfs as an interface to resource
> > controllers, as you said was easily done, then they would
> > have something to do with each other, right ;)?
>
> Right. We'll create a dependency that is not needed.
>
> > Choose the right data structure for the job, and then reuse
> > what fits for that choice.
> >
> > Neither avoid nor encouraging code reuse is the key question.
> >
> > What's the best fit, long term, for the style of kernel-user
> > API, for this use? That's the key question.
>
> I agree, but you've cut some importaint questions away,
> so I ask them again:
>
> > What if if user creates a controller (configfs directory)
> > and doesn't remove it at all. Should controller stay in
> > memory even if nobody uses it?
Yes. The controller should stay in memory until userspace decides that
control of the resource is no longer desired. Though not all controllers
should be removable since that may impose unreasonable restrictions on
what useful/performant controllers can be implemented.
That doesn't mean that the controller couldn't reclaim memory it uses
when it's no longer needed.
<snip>
Cheers,
-Matt Helsley
Paul Menage wrote:
> On 10/30/06, Pavel Emelianov <[email protected]> wrote:
>> > Debated:
>> > - syscall vs configfs interface
>>
>> 1. One of the major configfs ideas is that lifetime of
>> the objects is completely driven by userspace.
>> Resource controller shouldn't live as long as user
>> want. It "may", but not "must"! As you have seen from
>> our (beancounters) patches beancounters disapeared
>> as soon as the last reference was dropped.
>
> Why is this an important feature for beancounters? All the other
> resource control approaches seem to prefer having userspace handle
> removing empty/dead groups/containers.
That's functionality user may want. I agree that some users
may want to create some kind of "persistent" beancounters, but
this must not be the only way to control them. I like the way
TUN devices are done. Each has TUN_PERSIST flag controlling
whether or not to destroy device right on closing. I think that
we may have something similar - a flag BC_PERSISTENT to keep
beancounters with zero refcounter in memory to reuse them.
Objections?
>> 2. Having configfs as the only interface doesn't alow
>> people having resource controll facility w/o configfs.
>> Resource controller must not depend on any "feature".
>
> Why is depending on a feature like configfs worse than depending on a
> feature of being able to extend the system call interface?
Because configfs is a _feature_, while system calls interface is
a mandatory part of a kernel. Since "resource beancounters" is a
core thing it shouldn't depend on "optional" kernel stuff. E.g.
procfs is the way userspace gets information about running tasks,
but disabling procfs doesn't disable such core functionality
as fork-ing and execve-ing.
Moreover, I hope you agree that beancounters can't be made as
module. If so user will have to built-in configfs, and thus
CONFIG_CONFIGFS_FS essentially becomes "bool", not a "tristate".
I have nothing against using configfs as additional, optional
interface, but I do object using it as the only window inside
BC world.
>> > - Interaction of resource controllers, containers and cpusets
>> > - Should we support, for instance, creation of resource
>> > groups/containers under a cpuset?
>> > - Should we have different groupings for different resources?
>>
>> This breaks the idea of groups isolation.
>
> That's fine - some people don't want total isolation. If we're looking
> for a solution that fits all the different requirements, then we need
> that flexibility. I agree that the default would probably want to be
> that the groupings be the same for all resource controllers /
> subsystems.
Hm... OK, I don't mind although don't see any reasonable use of it.
Thus we add one more point to our "agreement" list
http://wiki.openvz.org/Containers/UBC_discussion
- all resource groups are independent
[snip]
> Yes. The controller should stay in memory until userspace decides that
> control of the resource is no longer desired. Though not all controllers
> should be removable since that may impose unreasonable restrictions on
> what useful/performant controllers can be implemented.
>
> That doesn't mean that the controller couldn't reclaim memory it uses
> when it's no longer needed.
>
I've already answered Paul Menage about this. Shortly:
... I agree that some users may want to create some
kind of "persistent" beancounters, but this must not be
the only way to control them...
... I think that we may have something [like this] - a flag
BC_PERSISTENT to keep beancounters with zero refcounter in
memory to reuse them...
... I have nothing against using configfs as additional,
optional interface, but I do object using it as the only
window inside BC world...
Please, refer to my full reply for comments.
Balbir Singh wrote:
> Pavel Emelianov wrote:
>> [snip]
>>
>>> Reclaimable memory
>>>
>>> (i) Anonymous pages - Anonymous pages are pages allocated by the user space,
>>> they are mapped into the user page tables, but not backed by a file.
>> I do not agree with such classification.
>> When one maps file then kernel can remove page from address
>> space as there is already space on disk for it. When one
>> maps an anonymous page then kernel won't remove this page
>> for sure as system may simply be configured to be swapless.
>
> Yes, I agree if there is no swap space, then anonymous memory is pinned.
> Assuming that we'll end up using a an abstraction on top of the
> existing reclaim mechanism, the mechanism would know if a particular
> type of memory is reclaimable or not.
If memory is considered to be unreclaimable then actions should be
taken at mmap() time, not later! Rejecting mmap() is the only way to
limit user in unreclaimable memory consumption.
> But your point is well taken.
Thank you.
[snip]
>> This is also not true now. The latest beancounter code accounts for
>> 1. kmemsie - this includes slab and vmalloc objects and "raw" pages
>> allocated directly from buddy allocator.
>
> This is what I said, pages marked with __GFP_BC, so far on i386 I see
> slab, vmalloc, PTE & LDT entries marked with the flag.
Yes. I just wanted to keep all the things together.
[snip]
> I understand that kernel memory accounting is the first priority for
> containers, but accounting kernel memory requires too many changes
> to the VM core, hence I was hesitant to put it up as first priority.
Among all the kernel-code-intrusive patches in BC patch set
kmemsize hooks are the most "conservative" - only one place
is heavily patched - this is slab allocator. Buddy is patched,
but _significantly_ smaller. The rest of the patch adds __GFP_BC
flags to some allocations and SLAB_BC to some kmem_caches.
User memory controlling patch is much heavier...
I'd set priorities of development that way:
1. core infrastructure (mainly headers)
2. interface
3. kernel memory hooks and accounting
4. mappings hooks and accounting
5. physical pages hooks and accounting
6. user pages reclamation
7. moving threads between beancounters
8. make beancounter persistent
[snip]
>> But in general I agree, these are the three important resources for
>> accounting and control
>
> I missed out to mention, I hope you were including the page cache in
> your definition of reclaimable memory.
As far as page cache is concerned my opinion is the following.
(If I misunderstood you, please correct me.)
Page cache is designed to keep in memory as much pages as
possible to optimize performance. If we start limiting the page
cache usage we cut the performance. What is to be controlled is
_used_ resources (touched pages, opened file descriptors, mapped
areas, etc), but not the cached ones. I see nothing bad if the
page that belongs to a file, but is not used by ANY task in BC,
stays in memory. I think this is normal. If kernel wants it may
push this page out easily it won't event need to try_to_unmap()
it. So cached pages must not be accounted.
I've also noticed that you've [snip]-ed on one of my questions.
> How would you allocate memory on NUMA in advance?
Please, clarify this.
Pavel Emelianov wrote:
> [snip]
>
>>> But in general I agree, these are the three important resources for
>>> accounting and control
>> I missed out to mention, I hope you were including the page cache in
>> your definition of reclaimable memory.
>
> As far as page cache is concerned my opinion is the following.
> (If I misunderstood you, please correct me.)
>
> Page cache is designed to keep in memory as much pages as
> possible to optimize performance. If we start limiting the page
> cache usage we cut the performance. What is to be controlled is
> _used_ resources (touched pages, opened file descriptors, mapped
> areas, etc), but not the cached ones. I see nothing bad if the
> page that belongs to a file, but is not used by ANY task in BC,
> stays in memory. I think this is normal. If kernel wants it may
> push this page out easily it won't event need to try_to_unmap()
> it. So cached pages must not be accounted.
>
The idea behind limiting the page cache is this
1. Lets say one container fills up the page cache.
2. The other containers will not be able to allocate memory (even
though they are within their limits) without the overhead of having
to flush the page cache and freeing up occupied cache. The kernel
will have to pageout() the dirty pages in the page cache.
Since it is easy to push the page out (as you said), it should be
easy to impose a limit on the page cache usage of a container.
>
> I've also noticed that you've [snip]-ed on one of my questions.
>
> > How would you allocate memory on NUMA in advance?
>
> Please, clarify this.
I am not quite sure I understand the question. Could you please rephrase
it and highlight some of the difficulty?
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
Balbir Singh wrote:
> Pavel Emelianov wrote:
>> [snip]
>>
>>>> But in general I agree, these are the three important resources for
>>>> accounting and control
>>> I missed out to mention, I hope you were including the page cache in
>>> your definition of reclaimable memory.
>> As far as page cache is concerned my opinion is the following.
>> (If I misunderstood you, please correct me.)
>>
>> Page cache is designed to keep in memory as much pages as
>> possible to optimize performance. If we start limiting the page
>> cache usage we cut the performance. What is to be controlled is
>> _used_ resources (touched pages, opened file descriptors, mapped
>> areas, etc), but not the cached ones. I see nothing bad if the
>> page that belongs to a file, but is not used by ANY task in BC,
>> stays in memory. I think this is normal. If kernel wants it may
>> push this page out easily it won't event need to try_to_unmap()
>> it. So cached pages must not be accounted.
>>
>
> The idea behind limiting the page cache is this
>
> 1. Lets say one container fills up the page cache.
> 2. The other containers will not be able to allocate memory (even
> though they are within their limits) without the overhead of having
> to flush the page cache and freeing up occupied cache. The kernel
> will have to pageout() the dirty pages in the page cache.
>
> Since it is easy to push the page out (as you said), it should be
> easy to impose a limit on the page cache usage of a container.
If a group is limited with memory _consumption_ it won't fill
the page cache...
>> I've also noticed that you've [snip]-ed on one of my questions.
>>
>> > How would you allocate memory on NUMA in advance?
>>
>> Please, clarify this.
>
> I am not quite sure I understand the question. Could you please rephrase
> it and highlight some of the difficulty?
I'd like to provide a guarantee for a newly created group. According
to your idea I have to preallocate some pages in advance. OK. How to
select a NUMA node to allocate them from?
On Tue, 31 Oct 2006 14:49:12 +0530
Balbir Singh <[email protected]> wrote:
> The idea behind limiting the page cache is this
>
> 1. Lets say one container fills up the page cache.
> 2. The other containers will not be able to allocate memory (even
> though they are within their limits) without the overhead of having
> to flush the page cache and freeing up occupied cache. The kernel
> will have to pageout() the dirty pages in the page cache.
There's a vast difference between clean pagecache and dirty pagecache in this
context. It is terribly imprecise to use the term "pagecache". And it would be
a poor implementation which failed to distinguish between clean pagecache and
dirty pagecache.
Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Pavel Emelianov wrote:
>>> [snip]
>>>
>>>>> But in general I agree, these are the three important resources for
>>>>> accounting and control
>>>> I missed out to mention, I hope you were including the page cache in
>>>> your definition of reclaimable memory.
>>> As far as page cache is concerned my opinion is the following.
>>> (If I misunderstood you, please correct me.)
>>>
>>> Page cache is designed to keep in memory as much pages as
>>> possible to optimize performance. If we start limiting the page
>>> cache usage we cut the performance. What is to be controlled is
>>> _used_ resources (touched pages, opened file descriptors, mapped
>>> areas, etc), but not the cached ones. I see nothing bad if the
>>> page that belongs to a file, but is not used by ANY task in BC,
>>> stays in memory. I think this is normal. If kernel wants it may
>>> push this page out easily it won't event need to try_to_unmap()
>>> it. So cached pages must not be accounted.
>>>
>> The idea behind limiting the page cache is this
>>
>> 1. Lets say one container fills up the page cache.
>> 2. The other containers will not be able to allocate memory (even
>> though they are within their limits) without the overhead of having
>> to flush the page cache and freeing up occupied cache. The kernel
>> will have to pageout() the dirty pages in the page cache.
>>
>> Since it is easy to push the page out (as you said), it should be
>> easy to impose a limit on the page cache usage of a container.
>
> If a group is limited with memory _consumption_ it won't fill
> the page cache...
>
So you mean the memory _consumption_ limit is already controlling
the page cache? That's what we need the ability for a container
not to fill up the page cache :)
I don't remember correctly, but do you account for dirty page cache usage in
the latest patches of BC?
>>> I've also noticed that you've [snip]-ed on one of my questions.
>>>
>>> > How would you allocate memory on NUMA in advance?
>>>
>>> Please, clarify this.
>> I am not quite sure I understand the question. Could you please rephrase
>> it and highlight some of the difficulty?
>
> I'd like to provide a guarantee for a newly created group. According
> to your idea I have to preallocate some pages in advance. OK. How to
> select a NUMA node to allocate them from?
The idea of pre-allocation was discussed as a possibility in the case
that somebody needed hard guarantees, but most of us don't need it.
I was in the RFC for the sake of completeness.
Coming back to your question
Why do you need to select a NUMA node? For performance?
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
[snip]
>>> Since it is easy to push the page out (as you said), it should be
>>> easy to impose a limit on the page cache usage of a container.
>> If a group is limited with memory _consumption_ it won't fill
>> the page cache...
>>
>
> So you mean the memory _consumption_ limit is already controlling
> the page cache? That's what we need the ability for a container
> not to fill up the page cache :)
I mean page cache limiting is not needed. We need to make
sure group eats less that N physical pages. That can be
achieved by controlling page faults, setup_arg_pages(), etc.
Page cache is not to be touched.
> I don't remember correctly, but do you account for dirty page cache usage in
> the latest patches of BC?
We do not account for page cache itself. We track only
physical pages regardless of where they are.
[snip]
> The idea of pre-allocation was discussed as a possibility in the case
> that somebody needed hard guarantees, but most of us don't need it.
> I was in the RFC for the sake of completeness.
>
> Coming back to your question
>
> Why do you need to select a NUMA node? For performance?
Of course! Otherwise what do we need kmem_cache_alloc_node() etc
calls in kernel?
The second question is - what if two processes from different
beancounters try to share one page. I remember that the current
solution is to take the page from the first user's reserve. OK.
Consider then that this first user stops using the page. When
this happens one page must be put back to it's reserve, right?
But where to get this page from?
Note that making guarantee through limiting doesn't care about
where the page is get from.
Andrew Morton wrote:
> On Tue, 31 Oct 2006 14:49:12 +0530
> Balbir Singh <[email protected]> wrote:
>
>> The idea behind limiting the page cache is this
>>
>> 1. Lets say one container fills up the page cache.
>> 2. The other containers will not be able to allocate memory (even
>> though they are within their limits) without the overhead of having
>> to flush the page cache and freeing up occupied cache. The kernel
>> will have to pageout() the dirty pages in the page cache.
>
> There's a vast difference between clean pagecache and dirty pagecache in this
> context. It is terribly imprecise to use the term "pagecache". And it would be
> a poor implementation which failed to distinguish between clean pagecache and
> dirty pagecache.
>
Yes, I agree, it will be a good idea to distinguish between the two.
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Pavel Emelianov wrote:
>>> [snip]
>>>
>>>> Reclaimable memory
>>>>
>>>> (i) Anonymous pages - Anonymous pages are pages allocated by the user space,
>>>> they are mapped into the user page tables, but not backed by a file.
>>> I do not agree with such classification.
>>> When one maps file then kernel can remove page from address
>>> space as there is already space on disk for it. When one
>>> maps an anonymous page then kernel won't remove this page
>>> for sure as system may simply be configured to be swapless.
>> Yes, I agree if there is no swap space, then anonymous memory is pinned.
>> Assuming that we'll end up using a an abstraction on top of the
>> existing reclaim mechanism, the mechanism would know if a particular
>> type of memory is reclaimable or not.
>
> If memory is considered to be unreclaimable then actions should be
> taken at mmap() time, not later! Rejecting mmap() is the only way to
> limit user in unreclaimable memory consumption.
That's like disabling memory over-commit in the regular kernel.
Don't you think this should again be based on the systems configuration
of over-commit?
[snip]
>
>> I understand that kernel memory accounting is the first priority for
>> containers, but accounting kernel memory requires too many changes
>> to the VM core, hence I was hesitant to put it up as first priority.
>
> Among all the kernel-code-intrusive patches in BC patch set
> kmemsize hooks are the most "conservative" - only one place
> is heavily patched - this is slab allocator. Buddy is patched,
> but _significantly_ smaller. The rest of the patch adds __GFP_BC
> flags to some allocations and SLAB_BC to some kmem_caches.
>
> User memory controlling patch is much heavier...
>
Please see the patching of Rohit's memory controller for user
level patching. It seems much simpler.
> I'd set priorities of development that way:
>
> 1. core infrastructure (mainly headers)
> 2. interface
> 3. kernel memory hooks and accounting
> 4. mappings hooks and accounting
> 5. physical pages hooks and accounting
> 6. user pages reclamation
> 7. moving threads between beancounters
> 8. make beancounter persistent
I would prefer a different set
1 & 2, for now we could use any interface and then start developing the
controller. As we develop the new controller, we are likely to find the
need to add/enhance the interface, so freezing in on 1 & 2 might not be
a good idea.
I would put 4, 5 and 6 ahead of 3, based on the changes I see in Rohit's
memory controller.
Then take up the rest.
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
[snip]
> That's like disabling memory over-commit in the regular kernel.
Nope. We limit only unreclaimable mappings. Allowing user
to break limits breaks the sense of limit.
Or you do not agree that allowing unlimited unreclaimable
mappings doesn't alow you the way to cut groups gracefully?
[snip]
> Please see the patching of Rohit's memory controller for user
> level patching. It seems much simpler.
Could you send me an URL where to get the patch from, please.
Or the patch itself directly to me. Thank you.
[snip]
> I would prefer a different set
>
> 1 & 2, for now we could use any interface and then start developing the
> controller. As we develop the new controller, we are likely to find the
> need to add/enhance the interface, so freezing in on 1 & 2 might not be
> a good idea.
Paul Menage won't agree. He believes that interface must come first.
I also remind you that the latest beancounter patch provides all the
stuff we're discussing. It may move tasks, limit all three resources
discussed, reclaim memory and so on. And configfs interface could be
attached easily.
> I would put 4, 5 and 6 ahead of 3, based on the changes I see in Rohit's
> memory controller.
>
> Then take up the rest.
I'll review Rohit's patches and comment.
On Mon, Oct 30, 2006 at 12:47:59PM -0800, Paul Menage wrote:
> On 10/30/06, Paul Jackson <[email protected]> wrote:
> >
> >In other words you are recommending delivering a system that internally
> >tracks separate hierarchies for each resource control entity, but where
> >the user can conveniently overlap some of these hierarchies and deal
> >with them as a single hierarchy.
>
> More or less. More concretely:
>
> - there is a single hierarchy of process containers
> - each process is a member of exactly one process container
>
> - for each resource controller, there's a hierarchy of resource "nodes"
> - each process container is associated with exactly one resource node
> of each type
>
> - by default, the process container hierarchy and the resource node
> hierarchies are isomorphic, but that can be controlled by userspace.
For the case where resource node hierarchy is different from process
container hierarchy, I am trying to make sense of "why do we need to
maintain two hierarchies" - one the actual hierarchy used for resource
control purpose, another the process container hierarchy. What purpose
does maintaining the process container hierarchy (in addition to the
resource controller hierarchy) solve?
I am thinking we can avoid maintaining these two hierarchies, by
something on these lines:
mkdir /dev/cpu
mount -t container -ocpu container /dev/cpu
-> Represents a hierarchy for cpu control purpose.
tsk->cpurc = represent the node in the cpu
controller hierarchy. Also maintains
resource allocation information for
this node.
tsk->cpurc->parent = parent node.
mkdir /dev/mem
mount -t container -omem container /dev/mem
-> Represents a hierarchy for mem control purpose.
tsk->memrc = represent the node in the mem
controller hierarchy. Also maintains
resource allocation information for
this node.
tsk->memrc->parent = parent node.
mkdir /dev/containers
mount -t container -ocontainer container /dev/container
-> Represents a (mostly flat?) hierarchy for the real
container (virtualization) purpose.
tsk->container = represent the node in the container
hierarchy. Also maintains relavant
container information for this node.
tsk->container->parent = parent node.
I suspect this may simplify the "container" filesystem, since it doesnt
have to track multiple hierarchies at the same time, and improve lock
contention too (modifying the cpu controller hierarchy can take a different
lock than the mem controller hierarchy).
--
Regards,
vatsa
Pavel Emelianov wrote:
>> That's like disabling memory over-commit in the regular kernel.
>
> Nope. We limit only unreclaimable mappings. Allowing user
> to break limits breaks the sense of limit.
>
> Or you do not agree that allowing unlimited unreclaimable
> mappings doesn't alow you the way to cut groups gracefully?
>
A quick code review showed that most of the accounting is the
same.
I see that most of the mmap accounting code, it seems to do
the equivalent of security_vm_enough_memory() when VM_ACCOUNT
is set. May be we could merge the accounting code to handle
even containers.
I looked at
do_mmap_pgoff
acct_stack_growth
__do_brk (
do_mremap
> [snip]
>
>> Please see the patching of Rohit's memory controller for user
>> level patching. It seems much simpler.
>
> Could you send me an URL where to get the patch from, please.
> Or the patch itself directly to me. Thank you.
Please see http://lkml.org/lkml/2006/9/19/283
>
> [snip]
>
>> I would prefer a different set
>>
>> 1 & 2, for now we could use any interface and then start developing the
>> controller. As we develop the new controller, we are likely to find the
>> need to add/enhance the interface, so freezing in on 1 & 2 might not be
>> a good idea.
>
> Paul Menage won't agree. He believes that interface must come first.
> I also remind you that the latest beancounter patch provides all the
> stuff we're discussing. It may move tasks, limit all three resources
> discussed, reclaim memory and so on. And configfs interface could be
> attached easily.
>
I think the interface should depend on the controllers and not
the other way around. I fear that the infrastructure discussion might
hold us back and no fruitful work will happen on the controllers.
Once we add and agree on the controller, we can then look at the
interface requirements (like persistence if kernel memory is being
tracked, etc). What do you think?
>> I would put 4, 5 and 6 ahead of 3, based on the changes I see in Rohit's
>> memory controller.
>>
>> Then take up the rest.
>
> I'll review Rohit's patches and comment.
ok
--
Thanks,
Balbir Singh,
Linux Technology Center,
IBM Software Labs
On Tue, Oct 31, 2006 at 05:23:43PM +0530, Srivatsa Vaddagiri wrote:
> mount -t container -ocpu container /dev/cpu
>
> -> Represents a hierarchy for cpu control purpose.
>
> tsk->cpurc = represent the node in the cpu
> controller hierarchy. Also maintains
> resource allocation information for
> this node.
I suspect this will lead to code like:
if (something->..->options == cpu)
tsk->cpurc = ..
else if (something->..->options == mem)
tsk->memrc = ..
Dont know enough of filesystems atm to say if such code is avoidable.
--
Regards,
vatsa
[snip]
> A quick code review showed that most of the accounting is the
> same.
>
> I see that most of the mmap accounting code, it seems to do
> the equivalent of security_vm_enough_memory() when VM_ACCOUNT
> is set. May be we could merge the accounting code to handle
> even containers.
>
> I looked at
>
> do_mmap_pgoff
> acct_stack_growth
> __do_brk (
> do_mremap
I'm sure this is possible. I'll take this into account
in the next patch series. Thank you.
>> [snip]
>>
>>> Please see the patching of Rohit's memory controller for user
>>> level patching. It seems much simpler.
>> Could you send me an URL where to get the patch from, please.
>> Or the patch itself directly to me. Thank you.
>
> Please see http://lkml.org/lkml/2006/9/19/283
Thanks. I'll review it in a couple of days and comment.
[snip]
> I think the interface should depend on the controllers and not
> the other way around. I fear that the infrastructure discussion might
> hold us back and no fruitful work will happen on the controllers.
> Once we add and agree on the controller, we can then look at the
> interface requirements (like persistence if kernel memory is being
> tracked, etc). What do you think?
I do agree with you. But we have to make an agreement with
Paul in this also...
On Mon, Oct 30, 2006 at 05:08:03PM +0300, Pavel Emelianov wrote:
> 1. One of the major configfs ideas is that lifetime of
> the objects is completely driven by userspace.
> Resource controller shouldn't live as long as user
> want. It "may", but not "must"! As you have seen from
> our (beancounters) patches beancounters disapeared
> as soon as the last reference was dropped. Removing
> configfs entries on beancounter's automatic destruction
> is possible, but it breaks the logic of configfs.
cpusets has a neat flag called notify_on_release. If set, some userspace
agent is invoked when the last task exists from a cpuset.
Can't we use a similar flag as a configfs file and (if set) invoke a
userspace agent (to cleanup) upon last reference drop? How would this
violate logic of configfs?
> 2. Having configfs as the only interface doesn't alow
> people having resource controll facility w/o configfs.
> Resource controller must not depend on any "feature".
One flexibility configfs (and any fs-based interface) offers is, as Matt
had pointed out sometime back, the ability to delage management of a
sub-tree to a particular user (without requiring root permission).
For ex:
/
|
-----------------
| |
vatsa (70%) linux (20%)
|
----------------------------------
| | |
browser (10%) compile (50%) editor (10%)
In this, group 'vatsa' has been alloted 70% share of cpu. Also user
'vatsa' has been given permissions to manage this share as he wants. If
the cpu controller supports hierarchy, user 'vatsa' can create further
sub-groups (browser, compile ..etc) -without- requiring root access.
Also it is convenient to manipulate resource hierarchy/parameters thr a
shell-script if it is fs-based.
> 3. Configfs may be easily implemented later as an additional
> interface. I propose the following solution:
Ideally we should have one interface - either syscall or configfs - and
not both.
Assuming your requirement of auto-deleting objects in configfs can be
met thr' something similar to cpuset's notify_on_release, what other
killer problem do you think configfs will pose?
> > - Should we have different groupings for different resources?
>
> This breaks the idea of groups isolation.
Sorry dont get you here. Are you saying we should support different
grouping for different controllers?
> > - Support movement of all threads of a process from one group
> > to another atomically?
>
> This is not a critical question. This is something that
> has difference in
It can be a significant pain for some workloads. I have heard that
workload management products often encounter processes with anywhere
between 200-700 threads in a process. Moving all those threads one by
one from user-space can suck.
--
Regards,
vatsa
Pavel Emelianov wrote:
> Paul Jackson wrote:
> I agree, but you've cut some importaint questions away,
> so I ask them again:
>
> > What if if user creates a controller (configfs directory)
> > and doesn't remove it at all. Should controller stay in
> > memory even if nobody uses it?
>
> This is importaint to solve now - wether we want or not to
> keep "empty" beancounters in memory. If we do not then configfs
> usage is not acceptible.
I can certainly see scenarios where we would want to keep "empty"
beancounters around.
For instance, I move all the tasks out of a group but still want to be
able to obtain stats on how much cpu time the group has used.
Maybe we can do that without persisting the actual beancounters...I'm
not familiar enough with the code to say.
Chris
On 10/31/06, Pavel Emelianov <[email protected]> wrote:
>
> That's functionality user may want. I agree that some users
> may want to create some kind of "persistent" beancounters, but
> this must not be the only way to control them. I like the way
> TUN devices are done. Each has TUN_PERSIST flag controlling
> whether or not to destroy device right on closing. I think that
> we may have something similar - a flag BC_PERSISTENT to keep
> beancounters with zero refcounter in memory to reuse them.
How about the cpusets approach, where once a cpuset has no children
and no processes, a usermode helper can be executed - this could
immediately remove the container/bean-counter if that's what the user
wants. My generic containers patch copies this from cpusets.
>
> Moreover, I hope you agree that beancounters can't be made as
> module. If so user will have to built-in configfs, and thus
> CONFIG_CONFIGFS_FS essentially becomes "bool", not a "tristate".
How about a small custom filesystem as part of the containers support,
then? I'm not wedded to using configfs itself, but I do think that a
filesystem interface is much more debuggable and extensible than a
system call interface, and the simple filesystem is only a couple of
hundred lines.
Paul
On 10/31/06, Srivatsa Vaddagiri <[email protected]> wrote:
> For the case where resource node hierarchy is different from process
> container hierarchy, I am trying to make sense of "why do we need to
> maintain two hierarchies" - one the actual hierarchy used for resource
> control purpose, another the process container hierarchy. What purpose
> does maintaining the process container hierarchy (in addition to the
> resource controller hierarchy) solve?
The idea is that in general, people aren't going to want to have
separate hierarchies for different resources - they're going to have
the hierarchies be the same for all resources. So in general when they
move a process from one container to another, they're going to want to
move that task to use all the new resources limits/guarantees
simultaneously.
Having completely independent hierarchies makes this more difficult -
you have to manually maintain multiple different hierarchies from
userspace. Suppose a task forks while you're moving it from one
container to another? With the approach that each process is in one
container, and each container is in a set of resource nodes, at least
the child task is either entirely in the new resource limits or
entirely in the old limits - if userspace has to update several
hierarchies at once non-atomically then a freshly forked child could
end up with a mixture of resource nodes.
>
> I am thinking we can avoid maintaining these two hierarchies, by
> something on these lines:
>
> mkdir /dev/cpu
> mount -t container -ocpu container /dev/cpu
>
> -> Represents a hierarchy for cpu control purpose.
>
> tsk->cpurc = represent the node in the cpu
> controller hierarchy. Also maintains
> resource allocation information for
> this node.
>
If we were going to do something like this, hopefully it would look
more like an array of generic container subsystems, rather than a
separate named pointer for each subsystem.
>
> mkdir /dev/mem
> mount -t container -omem container /dev/mem
>
> -> Represents a hierarchy for mem control purpose.
>
> tsk->memrc = represent the node in the mem
> controller hierarchy. Also maintains
> resource allocation information for
> this node.
>
> tsk->memrc->parent = parent node.
>
>
> mkdir /dev/containers
> mount -t container -ocontainer container /dev/container
>
> -> Represents a (mostly flat?) hierarchy for the real
> container (virtualization) purpose.
I think we have an overloading of terminology here. By "container" I
just mean "group of processes tracked for resource control and other
purposes". Can we use a term like "virtual server" if you're doing
virtualization? I.e. a virtual server would be a specialization of a
container (effectively analagous to a resource controller)
>
> I suspect this may simplify the "container" filesystem, since it doesnt
> have to track multiple hierarchies at the same time, and improve lock
> contention too (modifying the cpu controller hierarchy can take a different
> lock than the mem controller hierarchy).
Do you think that lock contention when modifying hierarchies is
generally going to be an issue - how often do tasks get moved around
in the hierarchy, compared to the other operations going on on the
system?
Paul
On Tue, Oct 31, 2006 at 08:34:52AM -0800, Paul Menage wrote:
> How about the cpusets approach, where once a cpuset has no children
> and no processes, a usermode helper can be executed - this could
> immediately remove the container/bean-counter if that's what the user
> wants. My generic containers patch copies this from cpusets.
Bingo. We crossed mails!
Kirill/Pavel,
As I mentioned in the begining of this thread, one of the
objective of this RFC is to seek consensus on what could be a good
compromise for the infrastructure in going forward. Paul Menage's
patches, being rework of existing code, is attactive to maintainers like
Andew.
>From that perspective, how well do you think the container
infrastructure patches meet your needs?
--
Regards,
vatsa
On 10/31/06, Pavel Emelianov <[email protected]> wrote:
>
> Paul Menage won't agree. He believes that interface must come first.
No, I'm just trying to get agreement on the generic infrastructure for
process containers and extensibility - the actual API to the memory
controller (i.e. what limits, what to track, etc) can presumably be
fitted into the generic mechanism fairly easily (or else the
infrastructure probably isn't generic enough).
Paul
On Tue, 2006-10-31 at 11:48 +0300, Pavel Emelianov wrote:
> If memory is considered to be unreclaimable then actions should be
> taken at mmap() time, not later! Rejecting mmap() is the only way to
> limit user in unreclaimable memory consumption.
I don't think this is necessarily true. Today, if a kernel exceeds its
allocation limits (runs out of memory) it gets killed. Doing the
limiting at mmap() time instead of fault time will keep a sparse memory
applications from even being able to run.
Now, failing an mmap() is a wee bit more graceful than a SIGBUS, but it
certainly introduces its own set of problems.
-- Dave
Paul Menage wrote:
> On 10/30/06, Balbir Singh <[email protected]> wrote:
>> You'll also end up with per zone page cache pools for each zone. A list of
>> active/inactive pages per zone (which will split up the global LRU list).
>
> Yes, these are some of the inefficiencies that we're ironing out.
>
>> What about the hard-partitioning. If a container/cpuset is not using its full
>> 64MB of a fake node, can some other node use it?
>
> No. So the granularity at which you can divide up the system depends
> on how big your fake nodes are. For our purposes, we figure that 64MB
> granularity should be fine.
>
I am still a little concerned about how limit size changes will be implemented.
Will the cpuset "mems" field change to reflect the changed limits?
>> Also, won't you end up
>> with a big zonelist?
>
> Yes - but PaulJ's recent patch to speed up the zone selection helped
> reduce the overhead of this a lot.
Great! let me find those patches
>
>> Consider the other side of the story. lets say we have a shared lib shared
>> among quite a few containers. We limit the usage of the inode containing
>> the shared library to 50M. Tasks A and B use some part of the library
>> and cause the container "C" to reach the limit. Container C is charged
>> for all usage of the shared library. Now no other task, irrespective of which
>> container it belongs to, can touch any new pages of the shared library.
>
> Well, if the pages aren't mlocked then presumably some of the existing
> pages can be flushed out to disk and replaced with other pages.
>
>> What you are suggesting is to virtually group the inodes by container rather
>> than task. It might make sense in some cases, but not all.
>
> Right - I think it's an important feature to be able to support, but I
> agree that it's not suitable for all situations.
>> We could consider implementing the controllers in phases
>>
>> 1. RSS control (anon + mapped pages)
>> 2. Page Cache control
>
> Page cache control is actually more essential that RSS control, in our
> experience - it's pretty easy to track RSS values from userspace, and
> react reasonably quickly to kill things that go over their limit, but
> determining page cache usage (i.e. determining which job on the system
> is flooding the page cache with dirty buffers) is pretty much
> impossible currently.
>
Hmm... interesting. Why do you think its impossible, what are the kinds of
issues you've run into?
> Paul
>
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
On 10/31/06, Balbir Singh <[email protected]> wrote:
>
> I am still a little concerned about how limit size changes will be implemented.
> Will the cpuset "mems" field change to reflect the changed limits?
That's how we've been doing it - increasing limits is easy, shrinking
them is harder ...
> > Page cache control is actually more essential that RSS control, in our
> > experience - it's pretty easy to track RSS values from userspace, and
> > react reasonably quickly to kill things that go over their limit, but
> > determining page cache usage (i.e. determining which job on the system
> > is flooding the page cache with dirty buffers) is pretty much
> > impossible currently.
> >
>
> Hmm... interesting. Why do you think its impossible, what are the kinds of
> issues you've run into?
>
Issues such as:
- determining from userspace how much of the page cache is really
"free" memory that can be given out to new jobs without impacting the
performance of existing jobs
- determining which job on the system is flooding the page cache with
dirty buffers
- accounting the active pagecache usage of a job as part of its memory
footprint (if a process is only 1MB large but is seeking randomly
through a 1GB file, treating it as only using/needing 1MB isn't
practical).
Paul
On Tue, 2006-10-31 at 09:22 -0800, Paul Menage wrote:
> >
> > Hmm... interesting. Why do you think its impossible, what are the kinds of
> > issues you've run into?
> >
>
> Issues such as:
>
> - determining from userspace how much of the page cache is really
> "free" memory that can be given out to new jobs without impacting the
> performance of existing jobs
>
> - determining which job on the system is flooding the page cache with
> dirty buffers
>
Interesting .. these are exactly questions our database people
have been us asking for few years :)
Thanks,
Badari
On Mon, 30 Oct 2006, Paul Menage wrote:
> More or less. More concretely:
>
> - there is a single hierarchy of process containers
> - each process is a member of exactly one process container
>
> - for each resource controller, there's a hierarchy of resource "nodes"
> - each process container is associated with exactly one resource node
> of each type
>
> - by default, the process container hierarchy and the resource node
> hierarchies are isomorphic, but that can be controlled by userspace.
>
This approach appears to be the most complete and extensible
implementation of containers for all practical uses. Not only can you use
these process containers in conjunction with your choice of memory
controllers, network controllers, disk I/O controllers, etc, but you can
also pick and choose your own modular controller of choice to meet your
needs.
So here's our three process containers, A, B, and C, with our tasks m-t:
-----A----- -----B----- -----C-----
| | | | | | | |
m n o p q r s t
Here's our memory controller groups D and E and our containers set within
them:
-----D----- -----E-----
| | |
A B C
[ My memory controller E is for my real-time processes so I set its
attributes appropriately so that it never breaks. ]
And our network controller groups F, G, and H:
-----F----- -----G-----
| |
-----H----- C
| |
A B
[ I'm going to leave my network controller F open for my customer's
WWW browsing, but nobody is using it right now. ]
I choose not to control disk I/O so there is change from current behavior
for any of the processes listed above.
There's two things I notice about this approach (my use of the word
"container" refers to the process containers A, B, and C; my use of the
word "controller" refers to memory, disk I/O, network, etc controllers):
- While the process containers are only single-level, the controllers are
_inherently_ hierarchial just like a filesystem. So it appears that
the manipulation of these controllers would most effectively be done
from userspace with a filesystem approach. While it may not be served
by forcing CONFIG_CONFIGFS_FS to be enabled, I observe no objection to
giving it its own filesystem capability, apart from configfs, through
the kernel. The filesystem manipulation tools that everybody is
familiar with makes the implementation of controllers simple and, more
importantly, easier to _use_.
- The process containers will need to be setup as desired following
boot. So if the current approach of cpusets is used, where the
functionality is enabled on mount, all processes will originally belong
to the default container that encompasses the entire system. Since
each process must belong to only one process container as per Paul
Menage's proposal, a new container will need to be created and
processes _moved_ to it for later use by controllers. So it appears
that the manipulation of containers would most effectively be done from
userspace by a syscall approach.
In this scenario, it is not necessary for network controller groups F and
G above to be limited (or guaranteed) to 100% of our network load. It is
quite possible that we do not assign every container to a network
controller so that they receive the remainder of the bandwidth that is not
already attributed to F and G. The same is true with any controller. Our
controllers should only seek the limit or guarantee certain amount of
resources, not force each system process to be a member of one group or
another to receive the resources.
Two questions also arise:
- Why do I need to create (i.e. mount the filesystem) the container in
the first place? Since the use of these containers are entirely on the
shoulders of the optional controllers, there should be no interference
with current behavior if I choose not to use any controller. So why
not take the approach that NUMA did whereas if we're on an UMA machine,
all of memory belongs to a node 0? In our case, all processes will
inherently belong to a system-wide container similar to procfs. In
fact, procfs is how this can be implemented apart from configfs
following the criticism from UBC.
- How is forking handled with the various controllers? Do child
processes automatically inherit all the controller groups of its
parent? If not (or if its dependant on a user-configured attribute
of the controller), what happens when I want forked processes to
belong to a new network controller group from container A in the
illustration above? Certaintly that new controller cannot be
created as a sibling of F and G; and determining the limit on
network for a third child of H would be non-trivial because then
the network resources allocated to A and B would be scaled back
prehaps in an undesired manner.
So the container abstraction looks appropriate for a syscall interface
whereas a controller abstraction looks appropriate for a filesystem
interface. If Paul Menage's proposal of above is adopted, it seems like
the design and implementation of containers is the first milestone; how
far does the current patchset get us to what is described above? Does it
still support a hierarchy just like cpusets?
And following that, it seems like the next milestone would be to design
the different characteristics that the various modular controllers could
support such as notify_on_release, limits/guarantees, behavior on fork,
and privileges.
David
On Tue, 31 Oct 2006, Pavel Emelianov wrote:
> Paul Menage won't agree. He believes that interface must come first.
> I also remind you that the latest beancounter patch provides all the
> stuff we're discussing. It may move tasks, limit all three resources
> discussed, reclaim memory and so on. And configfs interface could be
> attached easily.
>
There's really two different interfaces: those to the controller and those
to the container. While the configfs (or simpler fs implementation solely
for our purposes) is the most logical because of its inherent hierarchial
nature, it seems like the only criticism on that has come from UBC. From
my understanding of beancounter, it could be implemented on top of any
such container abstraction anyway.
David
Paul Menage wrote:
> On 10/31/06, Balbir Singh <[email protected]> wrote:
>> I am still a little concerned about how limit size changes will be implemented.
>> Will the cpuset "mems" field change to reflect the changed limits?
>
> That's how we've been doing it - increasing limits is easy, shrinking
> them is harder ...
>
>>> Page cache control is actually more essential that RSS control, in our
>>> experience - it's pretty easy to track RSS values from userspace, and
>>> react reasonably quickly to kill things that go over their limit, but
>>> determining page cache usage (i.e. determining which job on the system
>>> is flooding the page cache with dirty buffers) is pretty much
>>> impossible currently.
>>>
>> Hmm... interesting. Why do you think its impossible, what are the kinds of
>> issues you've run into?
>>
>
> Issues such as:
>
> - determining from userspace how much of the page cache is really
> "free" memory that can be given out to new jobs without impacting the
> performance of existing jobs
>
> - determining which job on the system is flooding the page cache with
> dirty buffers
>
> - accounting the active pagecache usage of a job as part of its memory
> footprint (if a process is only 1MB large but is seeking randomly
> through a 1GB file, treating it as only using/needing 1MB isn't
> practical).
>
> Paul
>
Thanks for the info!
I thought this would be hard to do in general, but with a page -->
container mapping that will come as a result of the memory controller,
will it still be that hard?
I'll dig deeper.
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
On 10/31/06, Balbir Singh <[email protected]> wrote:
>
> I thought this would be hard to do in general, but with a page -->
> container mapping that will come as a result of the memory controller,
> will it still be that hard?
I meant that it's pretty much impossible with the current APIs
provided by the kernel. That's why one of the most useful things that
a memory controller can provide is accounting and limiting of page
cache usage.
Paul
Paul Menage wrote:
> On 10/31/06, Balbir Singh <[email protected]> wrote:
>> I thought this would be hard to do in general, but with a page -->
>> container mapping that will come as a result of the memory controller,
>> will it still be that hard?
>
> I meant that it's pretty much impossible with the current APIs
> provided by the kernel. That's why one of the most useful things that
> a memory controller can provide is accounting and limiting of page
> cache usage.
>
> Paul
Thanks for clarifying that! I completely agree, page cache control is
very important!
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
Dave Hansen wrote:
> On Tue, 2006-10-31 at 11:48 +0300, Pavel Emelianov wrote:
>> If memory is considered to be unreclaimable then actions should be
>> taken at mmap() time, not later! Rejecting mmap() is the only way to
>> limit user in unreclaimable memory consumption.
>
> I don't think this is necessarily true. Today, if a kernel exceeds its
> allocation limits (runs out of memory) it gets killed. Doing the
> limiting at mmap() time instead of fault time will keep a sparse memory
> applications from even being able to run.
If limiting _every_ mapping it will, but when limiting only
"private" mappings - no problems at all. BC code lives for
more than 3 years already and no claims from users on this
question yet.
> Now, failing an mmap() is a wee bit more graceful than a SIGBUS, but it
> certainly introduces its own set of problems.
>
> -- Dave
>
>
Paul Menage wrote:
> On 10/31/06, Pavel Emelianov <[email protected]> wrote:
>>
>> That's functionality user may want. I agree that some users
>> may want to create some kind of "persistent" beancounters, but
>> this must not be the only way to control them. I like the way
>> TUN devices are done. Each has TUN_PERSIST flag controlling
>> whether or not to destroy device right on closing. I think that
>> we may have something similar - a flag BC_PERSISTENT to keep
>> beancounters with zero refcounter in memory to reuse them.
>
> How about the cpusets approach, where once a cpuset has no children
> and no processes, a usermode helper can be executed - this could
Hmm... Sounds good. I'll think over this.
> immediately remove the container/bean-counter if that's what the user
> wants. My generic containers patch copies this from cpusets.
>
>>
>> Moreover, I hope you agree that beancounters can't be made as
>> module. If so user will have to built-in configfs, and thus
>> CONFIG_CONFIGFS_FS essentially becomes "bool", not a "tristate".
>
> How about a small custom filesystem as part of the containers support,
> then? I'm not wedded to using configfs itself, but I do think that a
> filesystem interface is much more debuggable and extensible than a
> system call interface, and the simple filesystem is only a couple of
> hundred lines.
This sounds more reasonable than using configfs for me.
> Paul
>
[snip]
>> 2. Having configfs as the only interface doesn't alow
>> people having resource controll facility w/o configfs.
>> Resource controller must not depend on any "feature".
>
> One flexibility configfs (and any fs-based interface) offers is, as Matt
> had pointed out sometime back, the ability to delage management of a
> sub-tree to a particular user (without requiring root permission).
>
> For ex:
>
> /
> |
> -----------------
> | |
> vatsa (70%) linux (20%)
> |
> ----------------------------------
> | | |
> browser (10%) compile (50%) editor (10%)
>
> In this, group 'vatsa' has been alloted 70% share of cpu. Also user
> 'vatsa' has been given permissions to manage this share as he wants. If
> the cpu controller supports hierarchy, user 'vatsa' can create further
> sub-groups (browser, compile ..etc) -without- requiring root access.
I can do the same using bcctl tool and sudo :)
> Also it is convenient to manipulate resource hierarchy/parameters thr a
> shell-script if it is fs-based.
>
>> 3. Configfs may be easily implemented later as an additional
>> interface. I propose the following solution:
>
> Ideally we should have one interface - either syscall or configfs - and
> not both.
Agree.
> Assuming your requirement of auto-deleting objects in configfs can be
> met thr' something similar to cpuset's notify_on_release, what other
> killer problem do you think configfs will pose?
>
>
>>> - Should we have different groupings for different resources?
>> This breaks the idea of groups isolation.
>
> Sorry dont get you here. Are you saying we should support different
> grouping for different controllers?
Not me, but other people in this thread.
David Rientjes wrote:
> On Tue, 31 Oct 2006, Pavel Emelianov wrote:
>
>> Paul Menage won't agree. He believes that interface must come first.
>> I also remind you that the latest beancounter patch provides all the
>> stuff we're discussing. It may move tasks, limit all three resources
>> discussed, reclaim memory and so on. And configfs interface could be
>> attached easily.
>>
>
> There's really two different interfaces: those to the controller and those
> to the container. While the configfs (or simpler fs implementation solely
> for our purposes) is the most logical because of its inherent hierarchial
> nature, it seems like the only criticism on that has come from UBC. From
> my understanding of beancounter, it could be implemented on top of any
> such container abstraction anyway.
beancounters may be implemented above any (or nearly any) userspace
interface, no questions. But we're trying to come to agreement here,
so I just say my point of view.
I don't mind having file system based interface, I just believe that
configfs is not so good for it. I've already answered that having
our own filesystem for it sounds better than having configfs.
Maybe we can summarize what we have come to?
> David
>
On Wed, 1 Nov 2006, Pavel Emelianov wrote:
> beancounters may be implemented above any (or nearly any) userspace
> interface, no questions. But we're trying to come to agreement here,
> so I just say my point of view.
>
> I don't mind having file system based interface, I just believe that
> configfs is not so good for it. I've already answered that having
> our own filesystem for it sounds better than having configfs.
>
> Maybe we can summarize what we have come to?
>
I've seen nothing but praise for Paul Menage's suggestion of implementing
a single-level containers abstraction for processes and attaching
these to various resource controller (disk, network, memory, cpu) nodes.
The question of whether to use configfs or not is really at the fore-front
of that discussion because making any progress in implementation is
difficult without first deciding upon it, and the containers abstraction
patchset uses configfs as its interface.
The original objection against configfs was against the lifetime of the
resource controller. But this is actually a two part question since there
are two interfaces: one for the containers, one for the controllers. At
present it seems like the only discussion taking place is that of the
container so this objection can wait. After boot, there are one of two
options:
- require the user to mount the configfs filesystem with a single
system-wide container as default
i. include all processes in that container by default
ii. include no processes in that container, force the user to add them
- create the entire container abstraction upon boot and attach all
processes to it in a manner similar to procfs
[ In both scenarios, kernel behavior is unchanged if no resource
controller node is attached to any container as if the container(s)
didn't exist. ]
Another objection against configfs was the fact that you must define
CONFIG_CONFIGFS_FS to use CONFIG_CONTAINERS. This objection does not make
much sense since it seems like we are falling the direction of abandoning
the syscall approach here and looking toward an fs approach in the first
place. So CONFIG_CONTAINERS will need to include its own lightweight
filesystem if we cannot use CONFIG_CONFIGFS_FS, but it seems redundant
since this is what configfs is for: a configurable filesystem to interface
to the kernel. We definitely do not want two or more interfaces to
_containers_ so we are reimplementing an already existing infrastructure.
The criticism that users can create containers and then not use them
shouldn't be an issue if it is carefully implemented. In fact, I proposed
that all processes are initially attached to a single system-wide
container at boot regardless if you've loaded any controllers or not just
like how UMA machines work with node 0 for system-wide memory. We should
incur no overhead for having empty or _full_ containers if we haven't
loaded controllers or have configured them properly to include the right
containers.
So if we re-read Paul Menage's containers abstraction away from cpusets
patchset that uses configfs, we can see that we are almost there with the
exception of making it a single-layer "hierarchy" as he has already
proposed. Resource controller "nodes" that these containers can be
attached to are a separate issue at this point and shouldn't be confused.
David
> Consensus/Debated Points
> ------------------------
>
> Consensus:
>
> - Provide resource control over a group of tasks
> - Support movement of task from one resource group to another
> - Dont support heirarchy for now
> - Support limit (soft and/or hard depending on the resource
> type) in controllers. Guarantee feature could be indirectly
> met thr limits.
>
> Debated:
> - syscall vs configfs interface
OK. Let's stop at configfs interface to move...
> - Interaction of resource controllers, containers and cpusets
> - Should we support, for instance, creation of resource
> groups/containers under a cpuset?
> - Should we have different groupings for different resources?
I propose to discuss this question as this is the most important
now from my point of view.
I believe this can be done, but can't imagine how to use this...
> - Support movement of all threads of a process from one group
> to another atomically?
I propose such a solution: if a user asks to move /proc/<pid>
then move the whole task with threads.
If user asks to move /proc/<pid>/task/<tid> then move just
a single thread.
What do you think?
David wrote:
> - While the process containers are only single-level, the controllers are
> _inherently_ hierarchial just like a filesystem. So it appears that
Cpusets certainly enjoys what I would call hierarchical process
containers. I can't tell if your flat container space is just
a "for instance", or you're recommending we only have a flat
container space.
If the later, I disagree.
> So it appears
> that the manipulation of containers would most effectively be done from
> userspace by a syscall approach.
Yup - sure sounds like you're advocating a flat container space
accessed by system calls.
Sure doesn't sound right to me. I like hierarchical containers,
accessed via like a file system.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Wed, 1 Nov 2006, Pavel Emelianov wrote:
> > - Interaction of resource controllers, containers and cpusets
> > - Should we support, for instance, creation of resource
> > groups/containers under a cpuset?
> > - Should we have different groupings for different resources?
>
> I propose to discuss this question as this is the most important
> now from my point of view.
>
> I believe this can be done, but can't imagine how to use this...
>
I think cpusets, as abstracted away from containers by Paul Menage, simply
become a client of the container configfs. Cpusets would become more of a
NUMA-type controller by default.
Different groupings for different resources was already discussed. If we
use the approach of a single-level "hierarchy" for process containers and
then attach them each to a "node" of a controller, then the groupings have
been achieved. It's possible to change the network controller of a
container or move processes from container to container easily through the
filesystem.
> > - Support movement of all threads of a process from one group
> > to another atomically?
>
> I propose such a solution: if a user asks to move /proc/<pid>
> then move the whole task with threads.
> If user asks to move /proc/<pid>/task/<tid> then move just
> a single thread.
>
> What do you think?
This seems to use my proposal of using procfs as an abstraction of process
containers. I haven't looked at the implementation details, but it seems
like the most appropriate place given what it currently supports.
Naturally it should be an atomic move but I don't think it's the most
important detail in terms of efficiency because moving threads should not
be such a frequent occurrence anyway. This begs the question about how
forks are handled for processes with regard to the various controllers
that could be implemented and whether they should all be decendants of the
parent container by default or have the option of spawning a new
controller all together. This would be an attribute of controllers and
not containers, however.
David
On Wed, 1 Nov 2006, Paul Jackson wrote:
> David wrote:
> > - While the process containers are only single-level, the controllers are
> > _inherently_ hierarchial just like a filesystem. So it appears that
>
> Cpusets certainly enjoys what I would call hierarchical process
> containers. I can't tell if your flat container space is just
> a "for instance", or you're recommending we only have a flat
> container space.
>
This was using the recommendation of "each process belongs to a single
container that can be attached to controller nodes later." So while it is
indeed possible for the controllers, whatever they are, to be hierarchical
(and most assuredly should be), what is the objection against grouping
processes in single-level containers? The only difference is that now
when we assign processes to specific controllers with their attributes set
as we desire, we are assigning a container (or group) processes instead of
individual ones.
David
Balbir wrote:
> Paul Menage wrote:
> > On 10/31/06, Balbir Singh <[email protected]> wrote:
> >> I thought this would be hard to do in general, but with a page -->
> >> container mapping that will come as a result of the memory controller,
> >> will it still be that hard?
> >
> > I meant that it's pretty much impossible with the current APIs
> > provided by the kernel. That's why one of the most useful things that
> > a memory controller can provide is accounting and limiting of page
> > cache usage.
> >
> > Paul
>
> Thanks for clarifying that! I completely agree, page cache control is
> very important!
Doesn't "zone_reclaim" (added by Christoph Lameter over the last
several months) go a long way toward resolving this page cache control
problem?
Essentially, if my understanding is correct, zone reclaim has tasks
that are asking for memory first do some work towards keeping enough
memory free, such as doing some work reclaiming slab memory and pushing
swap and pushing dirty buffers to disk.
Tasks must help out as is needed to keep the per-node free memory above
watermarks.
This way, you don't actually have to account for who owns what, with
all the problems arbitrating between claims on shared resources.
Rather, you just charge the next customer who comes in the front
door (aka, mm/page_alloc.c:__alloc_pages()) a modest overhead if
they happen to show up when free memory supplies are running short.
On average, it has the same affect as a strict accounting system,
of charging the heavy users more (more CPU cycles in kernel vmscan
code and clock cycles waiting on disk heads). But it does so without
any need of accurate per-user accounting.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Tue, Oct 31, 2006 at 08:39:27PM -0800, David Rientjes wrote:
> So here's our three process containers, A, B, and C, with our tasks m-t:
>
> -----A----- -----B----- -----C-----
> | | | | | | | |
> m n o p q r s t
>
> Here's our memory controller groups D and E and our containers set within
> them:
>
> -----D----- -----E-----
> | | |
> A B C
This would forces all tasks in container A to belong to the same mem/io ctlr
groups. What if that is not desired? How would we achieve something like
this:
tasks (m) should belong to mem ctlr group D,
tasks (n, o) should belong to mem ctlr group E
tasks (m, n, o) should belong to i/o ctlr group G
(this example breaks the required condition/assumption that a task belong to
exactly only one process container).
Is this a unrealistic requirement? I suspect not and should give this
flexibilty, if we ever have to support task-grouping that is
unique to each resource. Fundamentally process grouping exists because
of various resource and not otherwise.
At this point, what purpose does having/exposing-to-user the generic process
container abstraction A, B and C achieve?
IMHO what is more practical is to let res ctlr groups (like D, E, F, G)
be comprised of individual tasks (rather than containers).
Note that all this is not saying that Paul Menages's patches are
pointless. In fact his generalization of cpusets to achieve process
grouping is indeed a good idea. I am only saying that his mechanism
should be used to define groups-of-tasks under each resource, rather
than to have groups-of-containers under each resource.
--
Regards,
vatsa
On Wed, 2006-11-01 at 11:01 +0300, Pavel Emelianov wrote:
> [snip]
>
> >> 2. Having configfs as the only interface doesn't alow
> >> people having resource controll facility w/o configfs.
> >> Resource controller must not depend on any "feature".
That's not true. It's possible for a resource control system that uses
a filesystem interface to operate without it's filesystem interface. In
fact, for performance reasons I think it's necessary.
Even assuming your point is true, since you agree there should be only
one interface does it matter that choosing one prevents implementing
another?
Why must a resource controller never depend on another "feature"?
> > One flexibility configfs (and any fs-based interface) offers is, as Matt
> > had pointed out sometime back, the ability to delage management of a
> > sub-tree to a particular user (without requiring root permission).
> >
> > For ex:
> >
> > /
> > |
> > -----------------
> > | |
> > vatsa (70%) linux (20%)
> > |
> > ----------------------------------
> > | | |
> > browser (10%) compile (50%) editor (10%)
> >
> > In this, group 'vatsa' has been alloted 70% share of cpu. Also user
> > 'vatsa' has been given permissions to manage this share as he wants. If
> > the cpu controller supports hierarchy, user 'vatsa' can create further
> > sub-groups (browser, compile ..etc) -without- requiring root access.
>
> I can do the same using bcctl tool and sudo :)
bcctl and, to a lesser extent, sudo are more esoteric.
Open, read, write, mkdir, unlink, etc. are all system calls so it seems
we all agree that system calls are the way to go. ;) Now if only we
could all agree on which system calls...
> > Also it is convenient to manipulate resource hierarchy/parameters thr a
> > shell-script if it is fs-based.
> >
> >> 3. Configfs may be easily implemented later as an additional
> >> interface. I propose the following solution:
> >
> > Ideally we should have one interface - either syscall or configfs - and
> > not both.
To incorporate all feedback perhaps we should replace "configfs" with
"filesystem".
Cheers,
-Matt Helsley
On Wed, Nov 01, 2006 at 09:29:37PM +0530, Srivatsa Vaddagiri wrote:
> This would forces all tasks in container A to belong to the same mem/io ctlr
> groups. What if that is not desired? How would we achieve something like
> this:
>
> tasks (m) should belong to mem ctlr group D,
> tasks (n, o) should belong to mem ctlr group E
> tasks (m, n, o) should belong to i/o ctlr group G
>
> (this example breaks the required condition/assumption that a task belong to
> exactly only one process container).
>
> Is this a unrealistic requirement? I suspect not and should give this
> flexibilty, if we ever have to support task-grouping that is
> unique to each resource. Fundamentally process grouping exists because
> of various resource and not otherwise.
In this article, http://lwn.net/Articles/94573/, Linus is quoted to want
something close to the above example, I think.
--
Regards,
vatsa
On Tue, Oct 31, 2006 at 08:46:00AM -0800, Paul Menage wrote:
> The idea is that in general, people aren't going to want to have
> separate hierarchies for different resources - they're going to have
> the hierarchies be the same for all resources. So in general when they
> move a process from one container to another, they're going to want to
> move that task to use all the new resources limits/guarantees
> simultaneously.
Sure, a reasonable enough requirement.
> Having completely independent hierarchies makes this more difficult -
> you have to manually maintain multiple different hierarchies from
> userspace.
I suspect we can avoid maintaining separate hierarchies if not required.
mkdir /dev/res_groups
mount -t container -o cpu,mem,io none /dev/res_groups
mkdir /dev/res_groups/A
mkdir /dev/res_groups/B
Directories A and B would now contain res ctl files associated with all
resources (viz cpu, mem, io) and also a 'members' file listing the tasks
belonging to those groups.
Do you think the above mechanism is implementable? Even if it is, I dont know
how the implementation will get complicated because of this requirement.
> Suppose a task forks while you're moving it from one
> container to another? With the approach that each process is in one
> container, and each container is in a set of resource nodes, at least
This requirement that each process should be exactly in one process container
is perhaps not good, since it will not give the fleixibility to define groups
unique to each resource (see my reply earlier to David Rientjes).
> the child task is either entirely in the new resource limits or
> entirely in the old limits - if userspace has to update several
> hierarchies at once non-atomically then a freshly forked child could
> end up with a mixture of resource nodes.
If the user intended to have a common grouping hierarchy for all
resources, then this movement of tasks can be "atomic" as far as user is
concerned, as per the above example:
echo task_pid > /dev/res_groups/B/members
should cause the task transition to the new group in one shot?
> >I am thinking we can avoid maintaining these two hierarchies, by
> >something on these lines:
> >
> > mkdir /dev/cpu
> > mount -t container -ocpu container /dev/cpu
> >
> > -> Represents a hierarchy for cpu control purpose.
> >
> > tsk->cpurc = represent the node in the cpu
> > controller hierarchy. Also maintains
> > resource allocation information for
> > this node.
> >
>
> If we were going to do something like this, hopefully it would look
> more like an array of generic container subsystems, rather than a
> separate named pointer for each subsystem.
Sounds good.
> I think we have an overloading of terminology here. By "container" I
> just mean "group of processes tracked for resource control and other
> purposes". Can we use a term like "virtual server" if you're doing
> virtualization? I.e. a virtual server would be a specialization of a
> container (effectively analagous to a resource controller)
Ok, sure.
> >I suspect this may simplify the "container" filesystem, since it doesnt
> >have to track multiple hierarchies at the same time, and improve lock
> >contention too (modifying the cpu controller hierarchy can take a different
> >lock than the mem controller hierarchy).
>
> Do you think that lock contention when modifying hierarchies is
> generally going to be an issue - how often do tasks get moved around
> in the hierarchy, compared to the other operations going on on the
> system?
I suspect the manipulation to resource group hierarchy (and the
resulting lock contention) will be more frequent than to the cpuset
hierarchy, if we have to support scenarios like here:
http://lkml.org/lkml/2006/9/5/178
I will try and get a better picture of how frequent would such task
migration be in practice from few people I know who are interested in this
feature within IBM.
--
Regards,
vatsa
On Mon, Oct 30, 2006 at 02:51:24AM -0800, Paul Menage wrote:
> The cpusets code which this was based on simply locked the task list,
> and traversed it to find threads in the cpuset of interest; you could
> do the same thing in any other resource controller.
Sure ..the point was about efficiency (whether you plough through
thousands of tasks to find those 10 tasks which belong to a group or
you have a list which gets to the 10 tasks immediately). But then the
cost of maintaining such a list is noted.
> Not keeping a list of tasks in the container makes fork/exit more
> efficient, and I assume is the reason that cpusets made that design
> decision. If we really wanted to keep a list of tasks in a container
> it wouldn't be hard, but should probably be conditional on at least
> one of the registered resource controllers to avoid unnecessary
> overhead when none of the controllers actually care (in a similar
> manner to the fork/exit callbacks, which only take the container
> callback mutex if some container subsystem is interested in fork/exit
> events).
Makes sense.
> How important is it for controllers/subsystems to be able to
> deregister themselves, do you think? I could add it relatively easily,
> but it seemed unnecessary in general.
Not very important perhaps.
> I've not really played with it yet, but I don't see any reason why the
> beancounter resource control concept couldn't also be built over
> generic containers. The user interface would be different, of course
> (filesysmem vs syscall), but maybe even that could be emulated if
> there was a need for backwards compatibility.
Hmm ..cpusets is in mainline already and hence we do need to worry abt
backward compatibility. If we were to go ahead with your patches, do we have
the same backward compatibility concern for beancounter as well? :)
> > Consensus:
> >
> > - Provide resource control over a group of tasks
> > - Support movement of task from one resource group to another
> > - Dont support heirarchy for now
>
> Both CKRM/RG and generic containers support a hierarchy.
I guess the consensus (as was made at OLS BoF :
http://lkml.org/lkml/2006/7/26/237) was more wrt controllers than
the infrastructure.
>
> > - Support limit (soft and/or hard depending on the resource
> > type) in controllers. Guarantee feature could be indirectly
> > met thr limits.
>
> That's an issue for resource controllers, rather than the underlying
> infrastructure, I think.
Hmm ..I dont think so. If we were to support both guarantee and limit,
then the infrastructure has to provide interfaces to set both values
for a group.
--
Regards,
vatsa
On Wed, Nov 01, 2006 at 08:04:01AM -0800, Matt Helsley wrote:
> > >> 3. Configfs may be easily implemented later as an additional
> > >> interface. I propose the following solution:
> > >
> > > Ideally we should have one interface - either syscall or configfs - and
> > > not both.
>
> To incorporate all feedback perhaps we should replace "configfs" with
> "filesystem".
Yes, you are right.
--
Regards,
vatsa
On Wed, Nov 01, 2006 at 11:01:31AM +0300, Pavel Emelianov wrote:
> > Sorry dont get you here. Are you saying we should support different
> > grouping for different controllers?
>
> Not me, but other people in this thread.
Hmm ..I thought OpenVz folks were interested in having different
groupings for different resources i.e grouping for CPU should be
independent of the grouping for memory.
http://lkml.org/lkml/2006/8/18/98
Isnt that true?
--
Regards,
vatsa
On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:
> > Debated:
> > - syscall vs configfs interface
>
> OK. Let's stop at configfs interface to move...
Excellent!
> > - Should we have different groupings for different resources?
>
> I propose to discuss this question as this is the most important
> now from my point of view.
>
> I believe this can be done, but can't imagine how to use this...
As I mentioned in my earlier mail, I thought openvz folks did want this
flexibility:
http://lkml.org/lkml/2006/8/18/98
Also:
http://lwn.net/Articles/94573/
But I am ok if we dont support this feature in the initial round of
development.
Having grouping for different resources could be a hairy to deal
with and could easily mess up applications (for ex: a process in a 80%
CPU class but in a 10% memory class could lead to underutilization of
its cpu share, because it cannot allocated memory as fast as it wants to run),
it is assumed that administrator will carefully manage these settings.
> > - Support movement of all threads of a process from one group
> > to another atomically?
>
> I propose such a solution: if a user asks to move /proc/<pid>
> then move the whole task with threads.
> If user asks to move /proc/<pid>/task/<tid> then move just
> a single thread.
>
> What do you think?
Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
For ex:
# ls /proc/2906/task
2906 2907 2908 2909
2906 is the main thread which created the remaining threads.
This would lead to an ambiguity when user does something like below:
echo 2906 > /some_res_file_system/some_new_group
Is he intending to move just the main thread, 2906, to the new group or
all the threads? It could be either.
This needs some more thought ...
--
Regards,
vatsa
On Tue, Oct 31, 2006 at 08:39:27PM -0800, David Rientjes wrote:
> - How is forking handled with the various controllers? Do child
> processes automatically inherit all the controller groups of its
> parent? If not (or if its dependant on a user-configured attribute
I think it would be simpler to go with the assumption that child process should
automatically inherit the same resource controller groups as its parent.
Although I think, CKRM did attempt to provide the flexibility of
changing this behavior using rule-based classification engine (Matt/Chandra,
correct me if I am wrong here).
--
Regards,
vatsa
On Wed, 1 Nov 2006, Srivatsa Vaddagiri wrote:
> This would forces all tasks in container A to belong to the same mem/io ctlr
> groups. What if that is not desired? How would we achieve something like
> this:
>
> tasks (m) should belong to mem ctlr group D,
> tasks (n, o) should belong to mem ctlr group E
> tasks (m, n, o) should belong to i/o ctlr group G
>
With the example you would need to place task m in one container called
A_m and tasks n and o in another container called A_n,o. Then join A_m to
D, A_n,o to E, and both to G.
I agree that this doesn't appear to be very easy to setup by the sysadmin
or any automated means. But in terms of the kernel, each of these tasks
would have a pointer back to its container and that container would point
to its assigned resource controller. So it's still a double dereference
to access the controller from any task_struct.
So if we proposed a hierarchy of containers, we could have the following:
----------A----------
| | |
-----B----- m -----C------
| | |
n -----D----- o
| |
p q
So instead we make the requirement that only one container can be attached
to any given controller. So if container A is attached to a disk I/O
controller, for example, then it includes all processes. If D is attached
to it instead, only p and q are affected by its constraints.
This would be possible by adding a field to the struct container that
would point to its parent cpu, net, mem, etc. container or NULL if it is
itself.
The difference:
Single-level container hierarchy
struct task_struct {
...
struct container *my_container;
}
struct container {
...
struct controller *my_cpu_controller;
struct controller *my_mem_controller;
}
Multi-level container hierarchy
struct task_struct {
...
struct container *my_container;
}
struct container {
...
/* Root containers, NULL if itself */
struct container *my_cpu_root_container;
struct container *my_mem_root_container;
/* Controllers, NULL if has parent */
struct controller *my_cpu_controller;
struct controller *my_mem_controller;
}
This eliminates the need to put a pointer to each resource controller
within each task_struct.
> (this example breaks the required condition/assumption that a task belong to
> exactly only one process container).
>
Yes, and that was the requirement that the above example was based upon.
David
Srivatsa Vaddagiri wrote:
>>> - Support limit (soft and/or hard depending on the resource
>>> type) in controllers. Guarantee feature could be indirectly
>>> met thr limits.
I just thought I'd weigh in on this. As far as our usage pattern is
concerned, guarantees cannot be met via limits.
I want to give "x" cpu to container X, "y" cpu to container Y, and "z"
cpu to container Z.
If these are percentages, x+y+z must be less than 100.
However, if Y does not use its share of the cpu, I would like the
leftover cpu time to be made available to X and Z, in a ratio based on
their allocated weights.
With limits, I don't see how I can get the ability for containers to
make opportunistic use of cpu that becomes available.
I can see that with things like memory this could become tricky (How do
you free up memory that was allocated to X when Y decides that it really
wants it after all?) but for CPU I think it's a valid scenario.
Chris
On Wed, 2006-11-01 at 23:42 +0530, Srivatsa Vaddagiri wrote:
> On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:
<snip>
> > > - Support movement of all threads of a process from one group
> > > to another atomically?
> >
> > I propose such a solution: if a user asks to move /proc/<pid>
> > then move the whole task with threads.
> > If user asks to move /proc/<pid>/task/<tid> then move just
> > a single thread.
> >
> > What do you think?
>
> Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
>
> For ex:
>
> # ls /proc/2906/task
> 2906 2907 2908 2909
>
> 2906 is the main thread which created the remaining threads.
>
> This would lead to an ambiguity when user does something like below:
>
> echo 2906 > /some_res_file_system/some_new_group
>
> Is he intending to move just the main thread, 2906, to the new group or
> all the threads? It could be either.
>
> This needs some more thought ...
I thought the idea was to take in a proc path instead of a single
number. You could then distinguish between the whole thread group and
individual threads by parsing the string. You'd move a single thread if
you find both the tgid and the tid. If you only get a tgid you'd move
the whole thread group. So:
<pid> -> if it's a thread group leader move the whole
thread group, otherwise just move the thread
/proc/<tgid> -> move the whole thread group
/proc/<tgid>/task/<tid> -> move the thread
Alternatives that come to mind are:
1. Read a flag with the pid
2. Use a special file which expects only thread groups as input
Cheers,
-Matt Helsley
On Wed, 2006-11-01 at 01:53 -0800, David Rientjes wrote:
> On Wed, 1 Nov 2006, Pavel Emelianov wrote:
>
> > > - Interaction of resource controllers, containers and cpusets
> > > - Should we support, for instance, creation of resource
> > > groups/containers under a cpuset?
> > > - Should we have different groupings for different resources?
> >
> > I propose to discuss this question as this is the most important
> > now from my point of view.
> >
> > I believe this can be done, but can't imagine how to use this...
> >
>
> I think cpusets, as abstracted away from containers by Paul Menage, simply
> become a client of the container configfs. Cpusets would become more of a
> NUMA-type controller by default.
>
> Different groupings for different resources was already discussed. If we
> use the approach of a single-level "hierarchy" for process containers and
At least in my mental model the depth of the hierarchy has nothing to do
with different groupings for different resources. They are just separate
hierarchies and where they are mounted does not affect their behavior.
> then attach them each to a "node" of a controller, then the groupings have
> been achieved. It's possible to change the network controller of a
> container or move processes from container to container easily through the
> filesystem.
>
> > > - Support movement of all threads of a process from one group
> > > to another atomically?
> >
> > I propose such a solution: if a user asks to move /proc/<pid>
> > then move the whole task with threads.
> > If user asks to move /proc/<pid>/task/<tid> then move just
> > a single thread.
> >
> > What do you think?
>
> This seems to use my proposal of using procfs as an abstraction of process
> containers. I haven't looked at the implementation details, but it seems
> like the most appropriate place given what it currently supports.
I'm not so sure procfs is the right mechanism.
> Naturally it should be an atomic move but I don't think it's the most
> important detail in terms of efficiency because moving threads should not
> be such a frequent occurrence anyway. This begs the question about how
> forks are handled for processes with regard to the various controllers
> that could be implemented and whether they should all be decendants of the
> parent container by default or have the option of spawning a new
> controller all together. This would be an attribute of controllers and
"spawning a new controller"?? Did you mean a new container?
> not containers, however.
>
> David
I don't follow. You seem to be mixing and separating the terms
"controller" and "container" and it doesn't fit with the uses of those
terms that I'm familiar with.
Cheers,
-Matt Helsley
Chris Friesen wrote:
> Srivatsa Vaddagiri wrote:
>
>>>> - Support limit (soft and/or hard depending on the resource
>>>> type) in controllers. Guarantee feature could be indirectly
>>>> met thr limits.
>
> I just thought I'd weigh in on this. As far as our usage pattern is
> concerned, guarantees cannot be met via limits.
>
> I want to give "x" cpu to container X, "y" cpu to container Y, and "z"
> cpu to container Z.
>
> If these are percentages, x+y+z must be less than 100.
>
> However, if Y does not use its share of the cpu, I would like the
> leftover cpu time to be made available to X and Z, in a ratio based on
> their allocated weights.
>
> With limits, I don't see how I can get the ability for containers to
> make opportunistic use of cpu that becomes available.
This is basically how "cpuunits" in OpenVZ works. It is not limiting a
container in any way, just assigns some relative "units" to it, with sum
of all units across all containers equal to 100% CPU. Thus, if we have
cpuunits 10, 20, and 30 assigned to containers X, Y, and Z, and run some
CPU-intensive tasks in all the containers, X will be given
10/(10+20+30), or 20% of CPU time, Y -- 20/50, i.e. 40%, while Z gets
60%. Now, if Z is not using CPU, X will be given 33% and Y -- 66%. The
scheduler used is based on a per-VE runqueues, is quite fair, and works
fine and fair for, say, uneven case of 3 containers on a 4 CPU box.
OpenVZ also has a "cpulimit" resource, which is, naturally, a hard limit
of CPU usage for a VE. Still, given the fact that cpunits works just
fine, cpulimit is rarely needed -- makes sense only in special scenarios
where you want to see how app is run on a slow box, or in case of some
proprietary software licensed per CPU MHZ, or smth like that.
Looks like this is what you need, right?
> I can see that with things like memory this could become tricky (How
> do you free up memory that was allocated to X when Y decides that it
> really wants it after all?) but for CPU I think it's a valid scenario.
Yes, CPU controller is quite different of other resource controllers.
Kir.
On 11/1/06, Srivatsa Vaddagiri <[email protected]> wrote:
>
> I suspect we can avoid maintaining separate hierarchies if not required.
>
> mkdir /dev/res_groups
> mount -t container -o cpu,mem,io none /dev/res_groups
> mkdir /dev/res_groups/A
> mkdir /dev/res_groups/B
>
> Directories A and B would now contain res ctl files associated with all
> resources (viz cpu, mem, io) and also a 'members' file listing the tasks
> belonging to those groups.
>
> Do you think the above mechanism is implementable? Even if it is, I dont know
> how the implementation will get complicated because of this requirement.
Yes, certainly implementable, and I don't think it would complicate
the code too much. I alluded to it as a possibility when I first sent
out my patches - I think my main issue with it was the fact that it
results in multiple container pointers per process at compile time,
which could be wasteful.
>
> This requirement that each process should be exactly in one process container
> is perhaps not good, since it will not give the fleixibility to define groups
> unique to each resource (see my reply earlier to David Rientjes).
I saw your example, but can you give a concrete example of a situation
when you might want to do that?
For simplicity combined with flexibility, I think I still favour the
following model:
- all processes are a member of one container
- for each resource type, each container is either in the same
resource node as its parent or a freshly child node of the parent
resource node (determined at container creation time)
This is a subset of my more complex model, but it's pretty easy to
understand from userspace and to implement in the kernel.
>
> > the child task is either entirely in the new resource limits or
> > entirely in the old limits - if userspace has to update several
> > hierarchies at once non-atomically then a freshly forked child could
> > end up with a mixture of resource nodes.
>
> If the user intended to have a common grouping hierarchy for all
> resources, then this movement of tasks can be "atomic" as far as user is
> concerned, as per the above example:
>
> echo task_pid > /dev/res_groups/B/members
>
> should cause the task transition to the new group in one shot?
>
Yes, if we took that model. But if someone does want to have
non-identical hierarchies, then in that model they're still forced
into a non-atomic update situation.
What objections do you have to David's suggestion hat if you want some
processes in a container to be in one resource node and others in
another resource node, then you should just subdivide into two
containers, such that all processes in a container are in the same set
of resource nodes?
Paul
>
> So instead we make the requirement that only one container can be attached
> to any given controller. So if container A is attached to a disk I/O
> controller, for example, then it includes all processes. If D is attached
> to it instead, only p and q are affected by its constraints.
If by "controller" you mean "resource node" this looks on second
glance very similar in concept to the simplified approach I outlined
in my last email. Except that I'd still include a pointer from e.g. D
to the resource node for fast lookup.
Paul
On 11/1/06, Chris Friesen <[email protected]> wrote:
>
> I just thought I'd weigh in on this. As far as our usage pattern is
> concerned, guarantees cannot be met via limits.
>
> I want to give "x" cpu to container X, "y" cpu to container Y, and "z"
> cpu to container Z.
I agree that these are issues - but they don't really affect the
container framework directly.
The framework should be flexible enough to let controllers register
any control parameters (via the filesystem?) that they need, but it
shouldn't contain explicit concepts like guarantees and limits. Some
controllers won't even have this concept (cpusets doesn't really, for
instance, and containers don't have to be just to do with
quantitative resource control).
I sent out a patch a while ago that showed how ResGroups could be
turned into effectively a library on top of a generic container system
- so ResGroups controllers could write to the ResGroups interface, and
let the library handle setting up control parameters and parsing
limits and guarantees. I expect the same thing could be done for UBC.
Paul
On 11/1/06, Matt Helsley <[email protected]> wrote:
> On Wed, 2006-11-01 at 23:42 +0530, Srivatsa Vaddagiri wrote:
> > On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:
>
> <snip>
>
> > > > - Support movement of all threads of a process from one group
> > > > to another atomically?
> > >
> > > I propose such a solution: if a user asks to move /proc/<pid>
> > > then move the whole task with threads.
> > > If user asks to move /proc/<pid>/task/<tid> then move just
> > > a single thread.
> > >
> > > What do you think?
> >
> > Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
> >
> > For ex:
> >
> > # ls /proc/2906/task
> > 2906 2907 2908 2909
> >
> > 2906 is the main thread which created the remaining threads.
> >
> > This would lead to an ambiguity when user does something like below:
> >
> > echo 2906 > /some_res_file_system/some_new_group
> >
> > Is he intending to move just the main thread, 2906, to the new group or
> > all the threads? It could be either.
> >
> > This needs some more thought ...
>
> I thought the idea was to take in a proc path instead of a single
> number. You could then distinguish between the whole thread group and
> individual threads by parsing the string. You'd move a single thread if
> you find both the tgid and the tid. If you only get a tgid you'd move
> the whole thread group. So:
>
> <pid> -> if it's a thread group leader move the whole
> thread group, otherwise just move the thread
> /proc/<tgid> -> move the whole thread group
> /proc/<tgid>/task/<tid> -> move the thread
>
>
> Alternatives that come to mind are:
>
> 1. Read a flag with the pid
> 2. Use a special file which expects only thread groups as input
I think that having a "tasks" file and a "threads" file in each
container directory would be a clean way to handle it:
"tasks" : read/write complete process members
"threads" : read/write individual thread members
Paul
On 11/1/06, Paul Jackson <[email protected]> wrote:
>
> Essentially, if my understanding is correct, zone reclaim has tasks
> that are asking for memory first do some work towards keeping enough
> memory free, such as doing some work reclaiming slab memory and pushing
> swap and pushing dirty buffers to disk.
True, it would help with keeping the machine in an alive state.
But when one task is allocating memory, it's still going to be pushing
out pages with random owners, rather than pushing out its own pages
when it hits it memory limit. That can negatively affect the
performance of other tasks, which is what we're trying to prevent.
You can't just say that the biggest user should get penalised. You
might want to use 75% of a machine for an important production server,
and have the remaining 25% available for random batch jobs - they
shouldn't be able to impact the production server.
Paul
On Wed, 2006-11-01 at 15:50 -0800, Paul Menage wrote:
> On 11/1/06, Matt Helsley <[email protected]> wrote:
> > On Wed, 2006-11-01 at 23:42 +0530, Srivatsa Vaddagiri wrote:
> > > On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:
> >
> > <snip>
> >
> > > > > - Support movement of all threads of a process from one group
> > > > > to another atomically?
> > > >
> > > > I propose such a solution: if a user asks to move /proc/<pid>
> > > > then move the whole task with threads.
> > > > If user asks to move /proc/<pid>/task/<tid> then move just
> > > > a single thread.
> > > >
> > > > What do you think?
> > >
> > > Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
> > >
> > > For ex:
> > >
> > > # ls /proc/2906/task
> > > 2906 2907 2908 2909
> > >
> > > 2906 is the main thread which created the remaining threads.
> > >
> > > This would lead to an ambiguity when user does something like below:
> > >
> > > echo 2906 > /some_res_file_system/some_new_group
> > >
> > > Is he intending to move just the main thread, 2906, to the new group or
> > > all the threads? It could be either.
> > >
> > > This needs some more thought ...
> >
> > I thought the idea was to take in a proc path instead of a single
> > number. You could then distinguish between the whole thread group and
> > individual threads by parsing the string. You'd move a single thread if
> > you find both the tgid and the tid. If you only get a tgid you'd move
> > the whole thread group. So:
> >
> > <pid> -> if it's a thread group leader move the whole
> > thread group, otherwise just move the thread
> > /proc/<tgid> -> move the whole thread group
> > /proc/<tgid>/task/<tid> -> move the thread
> >
> >
> > Alternatives that come to mind are:
> >
> > 1. Read a flag with the pid
> > 2. Use a special file which expects only thread groups as input
>
> I think that having a "tasks" file and a "threads" file in each
> container directory would be a clean way to handle it:
>
> "tasks" : read/write complete process members
> "threads" : read/write individual thread members
>
> Paul
Seems like a good idea to me -- that certainly avoids complex parsing.
Cheers,
-Matt Helsley
On Thu, 2006-11-02 at 02:01 +0300, Kir Kolyshkin wrote:
> Chris Friesen wrote:
> > Srivatsa Vaddagiri wrote:
> >
> >>>> - Support limit (soft and/or hard depending on the resource
> >>>> type) in controllers. Guarantee feature could be indirectly
> >>>> met thr limits.
> >
> > I just thought I'd weigh in on this. As far as our usage pattern is
> > concerned, guarantees cannot be met via limits.
> >
> > I want to give "x" cpu to container X, "y" cpu to container Y, and "z"
> > cpu to container Z.
> >
> > If these are percentages, x+y+z must be less than 100.
> >
> > However, if Y does not use its share of the cpu, I would like the
> > leftover cpu time to be made available to X and Z, in a ratio based on
> > their allocated weights.
> >
> > With limits, I don't see how I can get the ability for containers to
> > make opportunistic use of cpu that becomes available.
> This is basically how "cpuunits" in OpenVZ works. It is not limiting a
> container in any way, just assigns some relative "units" to it, with sum
> of all units across all containers equal to 100% CPU. Thus, if we have
So the user doesn't really specify percentage but values that feed into
ratios used by the underlying controller? If so then it's not terribly
different from the "shares" of single level of Resource Groups.
Resource groups goes one step further and defines a denominator for
child groups to use. This allows the shares to be connected vertically
so that changes don't need to propagate beyond the parent and child
groups.
> cpuunits 10, 20, and 30 assigned to containers X, Y, and Z, and run some
> CPU-intensive tasks in all the containers, X will be given
> 10/(10+20+30), or 20% of CPU time, Y -- 20/50, i.e. 40%, while Z gets
nit: I don't think this math is correct.
Shouldn't they all have the same denominator (60), or am I
misunderstanding something?
If so then it should be:
X = 10/60 16.666...%
Y = 20/60 33.333...%
Z = 30/60 50.0%
Total: 100.0%
> 60%. Now, if Z is not using CPU, X will be given 33% and Y -- 66%. The
> scheduler used is based on a per-VE runqueues, is quite fair, and works
> fine and fair for, say, uneven case of 3 containers on a 4 CPU box.
<snip>
Cheers,
-Matt Helsley
Paul M wrote:
> That can negatively affect the
> performance of other tasks, which is what we're trying to prevent.
That sounds like a worthwhile goal. I agree that zone_reclaim
doesn't do that.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul Menage wrote:
> The framework should be flexible enough to let controllers register
> any control parameters (via the filesystem?) that they need, but it
> shouldn't contain explicit concepts like guarantees and limits.
If the framework was able to handle arbitrary control parameters, that
would certainly be interesting.
Presumably there would be some way for the controllers to be called from
the framework to validate those parameters?
Chris
Matt Helsley wrote:
> On Wed, 2006-11-01 at 15:50 -0800, Paul Menage wrote:
>> On 11/1/06, Matt Helsley <[email protected]> wrote:
>>> On Wed, 2006-11-01 at 23:42 +0530, Srivatsa Vaddagiri wrote:
>>>> On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:
>>> <snip>
>>>
>>>>>> - Support movement of all threads of a process from one group
>>>>>> to another atomically?
>>>>> I propose such a solution: if a user asks to move /proc/<pid>
>>>>> then move the whole task with threads.
>>>>> If user asks to move /proc/<pid>/task/<tid> then move just
>>>>> a single thread.
>>>>>
>>>>> What do you think?
>>>> Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
>>>>
>>>> For ex:
>>>>
>>>> # ls /proc/2906/task
>>>> 2906 2907 2908 2909
>>>>
>>>> 2906 is the main thread which created the remaining threads.
>>>>
>>>> This would lead to an ambiguity when user does something like below:
>>>>
>>>> echo 2906 > /some_res_file_system/some_new_group
>>>>
>>>> Is he intending to move just the main thread, 2906, to the new group or
>>>> all the threads? It could be either.
>>>>
>>>> This needs some more thought ...
>>> I thought the idea was to take in a proc path instead of a single
>>> number. You could then distinguish between the whole thread group and
>>> individual threads by parsing the string. You'd move a single thread if
>>> you find both the tgid and the tid. If you only get a tgid you'd move
>>> the whole thread group. So:
>>>
>>> <pid> -> if it's a thread group leader move the whole
>>> thread group, otherwise just move the thread
>>> /proc/<tgid> -> move the whole thread group
>>> /proc/<tgid>/task/<tid> -> move the thread
>>>
>>>
>>> Alternatives that come to mind are:
>>>
>>> 1. Read a flag with the pid
>>> 2. Use a special file which expects only thread groups as input
>> I think that having a "tasks" file and a "threads" file in each
>> container directory would be a clean way to handle it:
>>
>> "tasks" : read/write complete process members
>> "threads" : read/write individual thread members
>>
>> Paul
>
> Seems like a good idea to me -- that certainly avoids complex parsing.
>
> Cheers,
> -Matt Helsley
>
Yeah, sounds like a good idea. We need to give the controllers some control
over whether they support task movement, thread movement or both.
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
On 11/1/06, Chris Friesen <[email protected]> wrote:
> Paul Menage wrote:
>
> > The framework should be flexible enough to let controllers register
> > any control parameters (via the filesystem?) that they need, but it
> > shouldn't contain explicit concepts like guarantees and limits.
>
> If the framework was able to handle arbitrary control parameters, that
> would certainly be interesting.
>
> Presumably there would be some way for the controllers to be called from
> the framework to validate those parameters?
The approach that I had in mind was that each controller could
register what ever control files it wanted, which would appear in the
filesystem directories for each container; reads and writes on those
files would invoke handlers in the controller. The framework wouldn't
care about the semantics of those control files. See the containers
patch that I posted last month for some examples of this.
Paul
Matt Helsley wrote:
> On Thu, 2006-11-02 at 02:01 +0300, Kir Kolyshkin wrote:
>
>> cpuunits 10, 20, and 30 assigned to containers X, Y, and Z, and run some
>> CPU-intensive tasks in all the containers, X will be given
>> 10/(10+20+30), or 20% of CPU time, Y -- 20/50, i.e. 40%, while Z gets
>>
>
> nit: I don't think this math is correct.
>
> Shouldn't they all have the same denominator (60), or am I
> misunderstanding something?
>
> If so then it should be:
> X = 10/60 16.666...%
> Y = 20/60 33.333...%
> Z = 30/60 50.0%
> Total: 100.0%
>
Ughm. You are totally correct of course, I must've been very tired
yesterday night :-\
Srivatsa Vaddagiri wrote:
> On Wed, Nov 01, 2006 at 11:01:31AM +0300, Pavel Emelianov wrote:
>>> Sorry dont get you here. Are you saying we should support different
>>> grouping for different controllers?
>> Not me, but other people in this thread.
>
> Hmm ..I thought OpenVz folks were interested in having different
> groupings for different resources i.e grouping for CPU should be
> independent of the grouping for memory.
>
> http://lkml.org/lkml/2006/8/18/98
>
> Isnt that true?
That's true. We don't mind having different groupings for
different resources. But what I was sying in this thread is
"I didn't *propose* this thing, I just *agreed* that this
might be usefull for someone."
So if we're going to have different groupings for different
resources what's the use of "container" grouping all "controllers"
together? I see this situation like each task_struct carries
pointers to kmemsize controller, pivate pages controller,
physical pages controller, CPU time controller, disk bandwidth
controller, etc. Right? Or did I miss something?
>> I believe this can be done, but can't imagine how to use this...
>
> As I mentioned in my earlier mail, I thought openvz folks did want this
> flexibility:
>
> http://lkml.org/lkml/2006/8/18/98
>
> Also:
>
> http://lwn.net/Articles/94573/
>
> But I am ok if we dont support this feature in the initial round of
> development.
Yes. Lets start with it - no separate groupings for a while.
BTW I think that hierarchy is a good (and easier to make than)
replacement for separate grouping. Say if I want two groups to
have separate CPU shares and common kmemsize this is the same as
if I want one group for kmemsize with two kids - one for X% of
CPU share and the other for Y%. And this (hierarchy) provides
more flexibility than "plain" although separate grouping.
Moreover configfs can provide a clean interface for it. E.g.
$ mkdir /configfs/beancounters/0
$ mkdir /configfs/beancounters/0/1
$ mkdir /confgifs/beancounters/0/2
and each task_struct will have a single pointer - current
container - but not 10 - for each controller.
What do you think?
> Having grouping for different resources could be a hairy to deal
> with and could easily mess up applications (for ex: a process in a 80%
That's it... One more thing against separate grouping.
[snip]
> Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
>
> For ex:
>
> # ls /proc/2906/task
> 2906 2907 2908 2909
>
> 2906 is the main thread which created the remaining threads.
>
> This would lead to an ambiguity when user does something like below:
>
> echo 2906 > /some_res_file_system/some_new_group
>
> Is he intending to move just the main thread, 2906, to the new group or
> all the threads? It could be either.
>
> This needs some more thought ...
I agree with Paul Menage that having
/configfs/beancounters/<id>/tasks and /.../threads is perfect.
[snip]
> I think that having a "tasks" file and a "threads" file in each
> container directory would be a clean way to handle it:
>
> "tasks" : read/write complete process members
> "threads" : read/write individual thread members
I've just thought of it.
Beancounter may have more than 409 tasks, while configfs
doesn't allow attributes to store more than PAGE_SIZE bytes
on read. So how would you fill so many tasks in one page?
I like the idea of writing pids/tids to these files, but
printing them back is not that easy.
>
> Paul
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Thu, 2006-11-02 at 12:08 +0300, Pavel Emelianov wrote:
> [snip]
>
> > I think that having a "tasks" file and a "threads" file in each
> > container directory would be a clean way to handle it:
> >
> > "tasks" : read/write complete process members
> > "threads" : read/write individual thread members
>
> I've just thought of it.
>
> Beancounter may have more than 409 tasks, while configfs
> doesn't allow attributes to store more than PAGE_SIZE bytes
> on read. So how would you fill so many tasks in one page?
To be clear that's a limitation of configfs as an interface. In the
Resource Groups code, for example, there is no hard limitation on length
of the underlying list. This is why we're talking about a filesystem
interface and not necessarily a configfs interface.
> I like the idea of writing pids/tids to these files, but
> printing them back is not that easy.
That depends on how you do it. For instance, if you don't have an
explicit list of tasks in the group (rough cost: 1 list head per task)
then yes, it could be difficult.
Cheers,
-Matt Helsley
Matt Helsley wrote:
> On Thu, 2006-11-02 at 12:08 +0300, Pavel Emelianov wrote:
>> [snip]
>>
>>> I think that having a "tasks" file and a "threads" file in each
>>> container directory would be a clean way to handle it:
>>>
>>> "tasks" : read/write complete process members
>>> "threads" : read/write individual thread members
>> I've just thought of it.
>>
>> Beancounter may have more than 409 tasks, while configfs
>> doesn't allow attributes to store more than PAGE_SIZE bytes
>> on read. So how would you fill so many tasks in one page?
>
> To be clear that's a limitation of configfs as an interface. In the
> Resource Groups code, for example, there is no hard limitation on length
> of the underlying list. This is why we're talking about a filesystem
> interface and not necessarily a configfs interface.
David Rientjes persuaded me that writing our own file system is
reimplementing the existing thing. If we've agreed with file system
interface then configfs may be used. But the limitations I've
pointed out must be discussed.
Let me remind:
1. limitation of size of data written out of configfs;
2. when configfs is a module user won't be able to
use beancounters.
and one new
3. now in beancounters we have /proc/user_beancounters
file that shows the complete statistics on BC. This
includes all then beancounters in the system with all
resources' held/maxheld/failcounters/etc. This is very
handy and "vividly": a simple 'cat' shows you all you
need. With configfs we lack this very handy feature.
>> I like the idea of writing pids/tids to these files, but
>> printing them back is not that easy.
>
> That depends on how you do it. For instance, if you don't have an
> explicit list of tasks in the group (rough cost: 1 list head per task)
> then yes, it could be difficult.
I propose not to have the list of tasks associated with beancounter
(what for?) but to extend /proc/<pid>/status with 'bcid: <id>' field.
/configfs/beancounters/<id>/(tasks|threads) file should be write-only
then.
What do you think?
On Thu, 2 Nov 2006, Pavel Emelianov wrote:
> >> Beancounter may have more than 409 tasks, while configfs
> >> doesn't allow attributes to store more than PAGE_SIZE bytes
> >> on read. So how would you fill so many tasks in one page?
> >
> > To be clear that's a limitation of configfs as an interface. In the
> > Resource Groups code, for example, there is no hard limitation on length
> > of the underlying list. This is why we're talking about a filesystem
> > interface and not necessarily a configfs interface.
>
> David Rientjes persuaded me that writing our own file system is
> reimplementing the existing thing. If we've agreed with file system
> interface then configfs may be used. But the limitations I've
> pointed out must be discussed.
>
What are we really discussing here? The original issue that you raised
with the infrastructure was an fs vs. syscall interface and I simply
argued in favor of an fs-based approach because containers are inherently
hierarchial. As Paul Jackson mentioned, this is one of the advantages
that cpusets has had since its inclusion in the kernel and the abstraction
of cpusets from containers makes a convincing case for how beneficial it
has been and will continue to be.
Regardless of whether configfs is specifically used for this particular
purpose is irrelevant in deciding fs vs syscall. Certainly it could be
used for lightweight purposes but it by no means is the only possibility
for containers. I have observed no further advocation for a syscall
interface; it seems like a no-brainer that if there are certain
limitations on configfs that you have pointed out that would be
disadvantageous to containers that another fs implementation would
suffice.
> Let me remind:
> 1. limitation of size of data written out of configfs;
> 2. when configfs is a module user won't be able to
> use beancounters.
>
> and one new
> 3. now in beancounters we have /proc/user_beancounters
> file that shows the complete statistics on BC. This
> includes all then beancounters in the system with all
> resources' held/maxheld/failcounters/etc. This is very
> handy and "vividly": a simple 'cat' shows you all you
> need. With configfs we lack this very handy feature.
>
Ok, so each of these issues includes a specific criticism against configfs
for containers. So a different fs-based interface similiar to the cpuset
abstraction from containers is certainly appropriate.
David
On Thu, 2 Nov 2006, Pavel Emelianov wrote:
> So if we're going to have different groupings for different
> resources what's the use of "container" grouping all "controllers"
> together? I see this situation like each task_struct carries
> pointers to kmemsize controller, pivate pages controller,
> physical pages controller, CPU time controller, disk bandwidth
> controller, etc. Right? Or did I miss something?
My understanding is that the only addition to the task_struct is a pointer
to the struct container it belongs to. Then, the various controllers can
register the control files through the fs-based container interface and
all the manipulation can be done at that level. Having each task_struct
containing pointers to individual resource nodes was never proposed.
David
On Wed, Nov 01, 2006 at 03:37:12PM -0800, Paul Menage wrote:
> I saw your example, but can you give a concrete example of a situation
> when you might want to do that?
Paul,
Firstly, after some more thought on this, we can use your current
proposal, if it makes the implementation simpler.
Secondly, regarding how separate grouping per-resource *maybe* usefull,
consider this scenario.
A large university server has various users - students, professors,
system tasks etc. The resource planning for this server could be on these lines:
CPU : Top cpuset
/ \
CPUSet1 CPUSet2
| |
(Profs) (Students)
In addition (system tasks) are attached to topcpuset (so
that they can run anywhere) with a limit of 20%
Memory : Professors (50%), students (30%), system (20%)
Disk : Prof (50%), students (30%), system (20%)
Network : WWW browsing (20%), Network File System (60%), others (20%)
/ \
Prof (15%) students (5%)
Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
into NFS network class.
At the same time firefox/lynx will share an appropriate CPU/Memory class
depending on who launched it (prof/student).
If we had the ability to write pids directly to these resource classes,
then admin can easily setup a script which receives exec notifications
and depending on who is launching the browser he can
# echo browser_pid > approp_resource_class
With your proposal, he now would have to create a separate container for
every browser launched and associate it with approp network and other
resource class. This may lead to proliferation of such containers.
Also lets say that the administrator would like to give enhanced network
access temporarily to a student's browser (since it is night and the user
wants to do online gaming :) OR give one of the students simulation
apps enhanced CPU power,
With ability to write pids directly to resource classes, its just a
matter of :
# echo pid > new_cpu/network_class
(after some time)
# echo pid > old_cpu/network_class
Without this ability, he will have to split the container into a
separate one and then associate the container with the new resource
classes.
So yes, the end result is perhaps achievable either way, the big
different I see is the ease of use.
> For simplicity combined with flexibility, I think I still favour the
> following model:
>
> - all processes are a member of one container
> - for each resource type, each container is either in the same
> resource node as its parent or a freshly child node of the parent
> resource node (determined at container creation time)
>
> This is a subset of my more complex model, but it's pretty easy to
> understand from userspace and to implement in the kernel.
If this model makes the implementation simpler, then I am for it, until
we have gained better insight on its use.
> What objections do you have to David's suggestion hat if you want some
> processes in a container to be in one resource node and others in
> another resource node, then you should just subdivide into two
> containers, such that all processes in a container are in the same set
> of resource nodes?
One observation is the ease of use (as some of the examples above
point out). Other is that it could lead to more containers than
necessary.
--
Regards,
vatsa
On 11/6/06, Srivatsa Vaddagiri <[email protected]> wrote:
> On Wed, Nov 01, 2006 at 03:37:12PM -0800, Paul Menage wrote:
> > I saw your example, but can you give a concrete example of a situation
> > when you might want to do that?
>
> Paul,
> Firstly, after some more thought on this, we can use your current
> proposal, if it makes the implementation simpler.
It does, but I'm more in favour of getting the abstractions right the
first time if we can, rather than implementation simplicity.
>
> Secondly, regarding how separate grouping per-resource *maybe* usefull,
> consider this scenario.
>
> A large university server has various users - students, professors,
> system tasks etc. The resource planning for this server could be on these lines:
>
> CPU : Top cpuset
> / \
> CPUSet1 CPUSet2
> | |
> (Profs) (Students)
>
> In addition (system tasks) are attached to topcpuset (so
> that they can run anywhere) with a limit of 20%
>
> Memory : Professors (50%), students (30%), system (20%)
>
> Disk : Prof (50%), students (30%), system (20%)
>
> Network : WWW browsing (20%), Network File System (60%), others (20%)
> / \
> Prof (15%) students (5%)
>
> Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
> into NFS network class.
>
> At the same time firefox/lynx will share an appropriate CPU/Memory class
> depending on who launched it (prof/student).
>
> If we had the ability to write pids directly to these resource classes,
> then admin can easily setup a script which receives exec notifications
> and depending on who is launching the browser he can
>
> # echo browser_pid > approp_resource_class
>
> With your proposal, he now would have to create a separate container for
> every browser launched and associate it with approp network and other
> resource class. This may lead to proliferation of such containers.
Or create one container per combination (so in this case four,
prof/www, prof/other, student/www, student/other) - then processes can
be moved between the containers to get the appropriate qos of each
type.
So the setup would look something like:
top-level: prof vs student vs system, with new child nodes for cpu,
memory and disk, and no new node for network
second-level, within the prof and student classes: www vs other, with
new child nodes for network, and no new child nodes for cpu.
In terms of the commands to set it up, it might look like (from the top-level)
echo network > inherit
mkdir prof student system
echo disk,cpu,memory > prof/inherit
mkdir prof/www prof/other
echo disk,cpu,memory > student/inherit
mkdir student/www student/other
> Also lets say that the administrator would like to give enhanced network
> access temporarily to a student's browser (since it is night and the user
> wants to do online gaming :) OR give one of the students simulation
> apps enhanced CPU power,
>
> With ability to write pids directly to resource classes, its just a
> matter of :
>
> # echo pid > new_cpu/network_class
> (after some time)
> # echo pid > old_cpu/network_class
>
> Without this ability, he will have to split the container into a
> separate one and then associate the container with the new resource
> classes.
In practice though, do you think the admin would really want to be
have to move individual processes around by hand? Sure, it's possible,
but wouldn't it make more sense to just give the entire student/www
class more network bandwidth? Or more generically, how often are
people going to be needing to move individual processes from one QoS
class to another, rather than changing the QoS for the existing class?
Paul
On Mon, Nov 06, 2006 at 12:23:44PM -0800, Paul Menage wrote:
> In practice though, do you think the admin would really want to be
> have to move individual processes around by hand? Sure, it's possible,
> but wouldn't it make more sense to just give the entire student/www
> class more network bandwidth? Or more generically, how often are
Wouldn't that cause -all- browsers to get enhanced network access? This
is when your intention was to give one particular student's browser
enhanced network access (to do online gaming) while retaining its
existing cpu/mem/io limits or another particular students simulation app
enhanced CPU access while retaining existing mem/io limits.
> people going to be needing to move individual processes from one QoS
> class to another, rather than changing the QoS for the existing class?
If we are talking of tasks moving from one QoS class to another, then it
can be pretty frequent in case of threaded databases and webservers.
I have been told that, atleast in case of databases, depending on the
workload, tasks may migrate from one group to another on every request.
In general, duration of requests fall within the milliseconds to seconds
range. So, IMO, design should support frequent task-migration.
Also, the requirement to tune individual resource availability for
specific apps/processes (ex: boost its CPU usage but retain other existing
limits) may not be unrealistic.
--
Regards,
vatsa
Paul M. wrote:
> It does, but I'm more in favour of getting the abstractions right the
> first time if we can, rather than implementation simplicity.
Yup.
The CONFIG_CPUSETS_LEGACY_API config option is still sticking in my
craw. Binding things at mount time, as you did, seems more useful.
Srivatsa wrote:
> Secondly, regarding how separate grouping per-resource *maybe* usefull,
> consider this scenario.
Yeah - I tend to agree that we should allow for such possibilities.
I see the following usage patterns -- I wonder if we can see a way to
provide for all these. I will speak in terms of just cpusets and
resource groups, as examplars of the variety of controllers that might
make good use of Paul M's containers:
Could we (Paul M?) find a way to build a single kernel that supports:
1) Someone just using cpusets wants to do:
mount -t cpuset cpuset /dev/cpuset
and then see the existing cpuset API. Perhaps other files show
up in the cpuset directories, but at least all the existing
ones provided by the current cpuset API, with their existing
behaviours, are all there.
2) Someone wanting a good CKRM/ResourceGroup interface, doing
whatever those fine folks are wont to do, binding some other
resource group controller to a container hierarchy.
3) Someone, in the future, wanting to "bind" cpusets and resource
groups together, with a single container based name hierarchy
of sets of tasks, providing both the cpuset and resource group
control mechanisms. Code written for (1) or (2) should work,
though there is a little wiggle room for API 'refinements' if
need be.
4) Someone doing (1) and (2) separately and independently on the
same system at the same time, with separate and independent
partitions (aka container hierarchies) of that systems tasks.
If we found usage pattern (4) to difficult to provide cleanly, I might
be willing to drop that one. I'm not sure yet.
Intuitively, I find (3) very attractive, though I don't have any actual
customer requirements for it in hand (we are operating a little past
our customers awareness in this present discussion.)
The initial customer needs are for (1), which preserves an existing
kernel API, and on separate systems, for (2). Providing for both on
the same system, as in (3) with a single container hierarchy or even
(4) with multiple independent hierarchies, is an enhancement.
I forsee a day when user level software, such as batch schedulers, are
written to take advantage of (3), once the kernel supports binding
multiple controllers to a common task container hierarchy. Initially,
some systems will need cpusets, and some will need resource groups, and
the intersection of these requiring both, whether bound as in (3), or
independent as in (4), will be pretty much empty.
In general then, we will have several controllers (need a good way
for user space to list what controllers, such as cpusets and resource
groups, are available on a live system!) and user space code should be
able to create at least one, if not multiple as in (4) above, container
hierarchies, each bound to one or more of these controllers.
Likely some, if not all, controllers will be singular - at most one such
controller of a given time on a system. Though if someone has a really
big brain, and wants to generalize that constraint, that could be
amusing. I guess I could have added a (5) above - allow for multiple
instances of a given controller, each bound to different container
hierarchies. But I'm guessing that is too hard, and not worth the
effort, so I didn't list it.
The notify_on_release mechanism should be elaborated, so that when
multiple controllers (e.g. cpusets and resource groups) are bound to
a common container hierarchy, then user space code can (using API's that
don't exist currently) separately control these exits hooks for each of
these bound controllers. Perhaps simply enabling 'notify_on_release'
for a container invokes the exit hooks (user space callbacks) for -all-
the controllers bound to that container, whereas some new API's enable
picking and choosing which controllers exit hooks are active. For
example, there might be a per-cpuset boolean flag file called
'cpuset_notify_on_release', for controlling that exit hook, separately
from any other exit hooks, and a 'cpuset_notify_on_release_path' file
for setting the path of the executable to invoke on release.
I would expect one kernel build CONFIG option for each controller type.
If any one or more of these controller options was enabled, then you
get containers in your build too - no option about it. I guess that
means that we have a CONFIG option for containers, to mark that code as
conditionally compiled, but that this container CONFIG option is
automatically set iff one or more controllers are included in the build.
Perhaps the interface to binding multiple controllers to a single container
hierarchy is via multiple mount commands, each of type 'container', with
different options specifying which controller(s) to bind. Then the
command 'mount -t cpuset cpuset /dev/cpuset' gets remapped to the command
'mount -t container -o controller=cpuset /dev/cpuset'.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 11/7/06, Paul Jackson <[email protected]> wrote:
>
> I see the following usage patterns -- I wonder if we can see a way to
> provide for all these. I will speak in terms of just cpusets and
> resource groups, as examplars of the variety of controllers that might
> make good use of Paul M's containers:
>
> Could we (Paul M?) find a way to build a single kernel that supports:
>
> 1) Someone just using cpusets wants to do:
> mount -t cpuset cpuset /dev/cpuset
> and then see the existing cpuset API. Perhaps other files show
> up in the cpuset directories, but at least all the existing
> ones provided by the current cpuset API, with their existing
> behaviours, are all there.
This will happen if you configure CONFIG_CPUSETS_LEGACY_API
>
> 2) Someone wanting a good CKRM/ResourceGroup interface, doing
> whatever those fine folks are wont to do, binding some other
> resource group controller to a container hierarchy.
This works now.
>
> 3) Someone, in the future, wanting to "bind" cpusets and resource
> groups together, with a single container based name hierarchy
> of sets of tasks, providing both the cpuset and resource group
> control mechanisms. Code written for (1) or (2) should work,
> though there is a little wiggle room for API 'refinements' if
> need be.
That works now.
>
> 4) Someone doing (1) and (2) separately and independently on the
> same system at the same time, with separate and independent
> partitions (aka container hierarchies) of that systems tasks.
Right now you can't have multiple independent hierarchies - each
subsystem either has the same hierarchy as all the other subsystems,
or has just a single node and doesn't participate in the hierarchy.
>
> The initial customer needs are for (1), which preserves an existing
> kernel API, and on separate systems, for (2). Providing for both on
> the same system, as in (3) with a single container hierarchy or even
> (4) with multiple independent hierarchies, is an enhancement.
>
> I forsee a day when user level software, such as batch schedulers, are
> written to take advantage of (3), once the kernel supports binding
> multiple controllers to a common task container hierarchy. Initially,
> some systems will need cpusets, and some will need resource groups, and
> the intersection of these requiring both, whether bound as in (3), or
> independent as in (4), will be pretty much empty.
I don't know about group (4), but we certainly have a big need for (3).
>
> In general then, we will have several controllers (need a good way
> for user space to list what controllers, such as cpusets and resource
> groups,
I think it's better to treat resource groups as a common framework for
resource controllers, rather than a resource controller itself.
Otherwise we'll have the same issues of wanting to treat separate
resources in separate hierarchies - by treating each RG controller as
a separate entitiy sharing a common resource metaphor and user API,
you get the multiple hierarchy support for free.
>
> Perhaps the interface to binding multiple controllers to a single container
> hierarchy is via multiple mount commands, each of type 'container', with
> different options specifying which controller(s) to bind. Then the
> command 'mount -t cpuset cpuset /dev/cpuset' gets remapped to the command
> 'mount -t container -o controller=cpuset /dev/cpuset'.
Yes, that's the aproach that I'm thinking of currently. It should
require pretty reasonably robotic changes to the existing code.
One of the issues that crops up with it is what do you put in
/proc/<pid>/container if there are multiple hierarchies?
Paul
> This will happen if you configure CONFIG_CPUSETS_LEGACY_API
So why is this CONFIG_* option separate? When would I ever not
want it?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 11/7/06, Paul Jackson <[email protected]> wrote:
> > This will happen if you configure CONFIG_CPUSETS_LEGACY_API
>
> So why is this CONFIG_* option separate? When would I ever not
> want it?
If you weren't bothered about having the legacy semantics. The main
issue is that it adds an extra file to /proc/<pid>. I guess the other
stuff could be made nonconditional without breaking anyone who didn't
try to mount cpusetfs
Paul
> > So why is this CONFIG_* option separate? When would I ever not
> > want it?
>
> If you weren't bothered about having the legacy semantics.
You mean if I wasn't bothered about -not- having the legacy semantics?
Let me put this another way - could you drop the
CONFIG_CPUSETS_LEGACY_API option, and make whatever is needed to
preserve the current cpuset API always present (if CPUSETS themselves
are configured, of course)?
If you're reluctant to do so, why?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 11/7/06, Paul Jackson <[email protected]> wrote:
> > > So why is this CONFIG_* option separate? When would I ever not
> > > want it?
> >
> > If you weren't bothered about having the legacy semantics.
>
> You mean if I wasn't bothered about -not- having the legacy semantics?
>
> Let me put this another way - could you drop the
> CONFIG_CPUSETS_LEGACY_API option, and make whatever is needed to
> preserve the current cpuset API always present (if CPUSETS themselves
> are configured, of course)?
Yes.
>
> If you're reluctant to do so, why?
As I said, mainly /proc pollution.
But it's not a big deal, so I can drop it unless there's a strong
argument from others in favour of keeping it.
Paul
Paul M wrote:
> I think it's better to treat resource groups as a common framework for
> resource controllers, rather than a resource controller itself.
You could well be right here - I was just using resource groups
as another good example of a controller. I'll let others decide
if that's one or several controllers.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul M wroteL
> One of the issues that crops up with it is what do you put in
> /proc/<pid>/container if there are multiple hierarchies?
Thanks for your rapid responses - good.
How about /proc/<pid>/containers being a directory, with each
controller having one regular file entry (so long as we haven't done
the multiple controller instances in my item (5)) containing the path,
relative to some container file system mount point (which container
mount is up to user space code to track) of the container that contains
that task?
Or how about each controller type, such as cpusets, having its own
/proc/<pid>/<controller-type> file, with no generic file
/proc</pid>/container at all. Just extend the current model
seen in /proc/<pid>/cpuset ?
Actually, I rather like that last alternative - forcing the word
'container' into these /proc/<pid>/??? pathnames strikes me as
an exercise in branding, not in technical necessity. But that
could just mean I am still missing a big fat clue somewhere ...
Feel free to keep hitting me with clue sticks, as need be.
It will take a while (as in a year or two) for me and others to train
all the user level code that 'knows' that cpusets are always mounted at
"/dev/cpuset" to find the mount point for the container handling
cpusets anywhere else.
I knew when I hardcoded the "/dev/cpuset" path in various places
in user space that I might need to revisit that, but my crystal
ball wasn't good enough to predict what form this generalization
would take. So I followed one of my favorite maxims - if you can't
get it right, at least keep it stupid, simple, so that whomever does
have to fix it up has the least amount of legacy mechanism to rip out.
However this fits in nicely with my expectation that we will have
only limited need, if any, in the short term, to run systems with
both cpusets and resource groups at the same time. Systems just
needing cpusets can jolly well continue to mount at /dev/cpuset,
in perpetuity. Systems needing other or fancier combinations of
controllers will need to handle alternative mount points, and keep
track somehow in user space of what's mounted where.
And while we're here, how about each controller naming itself with a
well known string compiled into its kernel code, and a file such
as /proc/containers listing what controllers are known to it? Not
surprisingly, I claim the word "cpuset" to name the cpuset controller ;)
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 11/7/06, Paul Jackson <[email protected]> wrote:
> How about /proc/<pid>/containers being a directory, with each
> controller having one regular file entry (so long as we haven't done
> the multiple controller instances in my item (5)) containing the path,
> relative to some container file system mount point (which container
> mount is up to user space code to track) of the container that contains
> that task?
Hmm. Seems a bit fancier than necessary, but maybe reasonable. I'll
probably start with a single file listing all the different container
associations and we can turn it into a directory later as a finishing
touch.
>
> Or how about each controller type, such as cpusets, having its own
> /proc/<pid>/<controller-type> file, with no generic file
> /proc</pid>/container at all. Just extend the current model
> seen in /proc/<pid>/cpuset ?
Is it possible to dynamically extend the /proc/<pid>/ directory? If
not, then every container subsystem would involve a patch in
fs/proc/base.c, which seems a bit nasty.
> However this fits in nicely with my expectation that we will have
> only limited need, if any, in the short term, to run systems with
> both cpusets and resource groups at the same time.
We're currently planning on using cpusets for the memory node
isolation properties, but we have a whole bunch of other resource
controllers that we'd like to be able to hang off the same
infrastructure, so I don't think the need is that limited.
>
> And while we're here, how about each controller naming itself with a
> well known string compiled into its kernel code, and a file such
> as /proc/containers listing what controllers are known to it? Not
The naming is already in my patch. You can tell from the top-level
directory which containers are registered, since each one has an
xxx_enabled file to control whether it's in use; there's not a
separate /proc/containers file yet.
Paul
Paul M wrote:
> Is it possible to dynamically extend the /proc/<pid>/ directory?
Not that I know of -- sounds like a nice idea for a patch.
> We're currently planning on using cpusets for the memory node
> isolation properties, but we have a whole bunch of other resource
> controllers that we'd like to be able to hang off the same
> infrastructure, so I don't think the need is that limited.
So long as you can update the code in your user space stack that
knows about this, then you should have nothing stopping you.
I've got a major (albeit not well publicized) open source user space
C library for working with cpusets which I will have to fix up.
> The naming is already in my patch. You can tell from the top-level
> directory which containers are registered, since each one has an
> xxx_enabled file to control whether it's in use;
But there are other *_enabled per-cpuset flags, not naming controllers,
so that is not a robust way to list container types.
Right now, I'm rather fond of the /proc/containers (or should it
be /proc/controllers?) idea. Though since I don't time to code
the patch today, I'll have to shut up.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 11/7/06, Paul Menage <[email protected]> wrote:
> > Perhaps the interface to binding multiple controllers to a single container
> > hierarchy is via multiple mount commands, each of type 'container', with
> > different options specifying which controller(s) to bind. Then the
> > command 'mount -t cpuset cpuset /dev/cpuset' gets remapped to the command
> > 'mount -t container -o controller=cpuset /dev/cpuset'.
>
> Yes, that's the aproach that I'm thinking of currently. It should
> require pretty reasonably robotic changes to the existing code.
One drawback to this that I can see is the following:
- suppose you mount a containerfs with controllers cpuset and cpu, and
create some nodes, and then unmount it, what happens? do the resource
nodes stick around still?
- suppose you then try to mount a containers with controllers cpuset
and diskio, but the resource nodes are still around, what happens
then?
Is there any way to prevent unmounting (at the dentry level) a busy filesystem?
If we enforced a completely separate hierarchy for each resource
controller (i.e. one resource controller per mount), then it wouldn't
be too hard to hang the node structure off the controller itself, and
there would never be a problem with mounting two controllers with
existing inconsistent hierarchies on the same mount. But that would
rule out being able to hook several resource controllers together in
the same container node.
One alternative to this would be to have a fixed number of container
hierarchies; at mount time you'd mount hierarchy N, and optionally
bind a set of resource controllers to it, or else use the existing
set. Then the hierarchy can be hung off the appropriate entry in the
hierarchy array even when the fs isn't mounted.
Paul
On Tue, 2006-11-07 at 12:02 -0800, Paul Jackson wrote:
> Paul M wrote:
> > I think it's better to treat resource groups as a common framework for
> > resource controllers, rather than a resource controller itself.
>
> You could well be right here - I was just using resource groups
> as another good example of a controller. I'll let others decide
> if that's one or several controllers.
At various stages different controllers were available with the core
patches or separately. The numtasks, cpu, io, socket accept queue, and
memory controllers were available for early CKRM patches. More recently
(April 2006) numtasks, cpu, and memory controllers were available for
Resource Groups.
So I'd say "several".
Cheers,
-Matt Helsley
Paul M wrote:
> One drawback to this that I can see is the following:
>
> - suppose you mount a containerfs with controllers cpuset and cpu, and
> create some nodes, and then unmount it, what happens? do the resource
> nodes stick around still?
Sorry - I let interfaces get confused with objects and operations.
Let me back up a step. I think I have beat on your proposal enough
to translate it into the more abstract terms that I prefer using when
detemining objects, operations and semantics.
It goes like this ... grab a cup of coffee.
We have a container mechanism (the code you extracted from cpusets)
and the potential for one or more instantiations of that mechanism
as container file systems.
This container mechanism provides a pseudo file system structure
that names not disk files, but partitions of the set of tasks
in the system. As always, by partition I mean a set of subsets,
covering and non-intersecting. Each element of a partition of the
set of tasks in a system is a subset of that systems tasks.
The container mechanism gives names and permissions to the elements
(subsets of tasks) of the partition, and provides a convenient place
to attach attributes to those partition elements. The directories in
such a container file system always map one-to-one with the elements of
the partition, and the regular files in each such directory represent
the per-element (per-cpuset, for example) attributes.
Each directory in a container file system has a file called 'tasks'
listing the pids of the tasks (newline separated decimal ASCII format)
in that partition element.
Each container file system needs a name. This corresponds to the
/dev/sda1 style raw device used to name disk based file systems
independent of where or if they are mounted.
Each task should list in its /proc/<pid> directory, for each such
named container file system in the system, the container file system
relative path of whichever directory in that container (element in
the partition it defines) that task belongs to. (An earlier proposal
I made to have an entry for each -controller- in each /proc/<pid>
directory was bogus.)
Because containers define a partition of the tasks in a system, each
task will always be in exactly one of the partition elements of a
container file system. Tasks are moved from one partition element
to another by writing their pid (decimal ASCII) into the 'tasks'
file of the receiving directory.
For some set of events, to include at least the 'release' of a
container element, the user can request that a callout be made to
a user executable. This carries forth a feature previously known
as 'notify_on_release.'
We have several controllers, each of which can be instantiated and
bound to a container file system. One of these controllers provides
for NUMA processor and memory placement control, and is called cpusets.
Perhaps in the future some controllers will support multiple instances,
bound to different container file systems, at the same time.
By different here, I meant not just different mounts of the same
container file system, but different partitions that divide up the
tasks of the system in different ways.
Each controller specifies a set of attributes to be associated with
each partition element of a container. The act of associating a
controllers attributes with partition elements I will call "binding".
We need to be able to create, destroy and list container file systems,
and for each such container file system, we need to be able to bind
and unbind controller instances thereto.
We need to be able to list what controller types exist in the system
capable of being bound to containers. We need to be able to list
for each container file system what controllers are bound to it.
And we need to be able to mount and unmount container file systems
from specific mount point paths in the file system.
We definitely need to be able to bind more than one controller to a
given container file system at the same time. This was my item (3)
in an earlier post today.
We might like to support multiple container file systems at one time.
This seems like a good idea to at least anticipate doing, even if it
turns out to be more work than we can deliver immediately. This was
my item (4) in that earlier post.
We will probably have some controllers in the future that are able
to be bound to more than one container file system at the same time,
and we have now, and likely will always have, some controllers, such
as cpusets, that must be singular - at most one bound instance at a
time in the system This relates to my (buried) item (5) from that
earlier post. The container code may or may not be able to support
controllers that bind to more than one file system at a time; I don't
know yet either how valuable or difficult this would be.
Overloading all these operations on the mount/umount commands seems
cumbersome, obscure and confusing. The essential thing a mount does
is bind a kernel object (such as one of our container instances) to
a mount point (path) in the filesystem. By the way, we should allow
for the possibility that one container instance might be mounted on
multiple points at the same time.
So it seems we need additional API's to support the creation and
destruction of containers, and binding controllers to them.
All controllers define an initial default state, and all tasks
can reference, while in that tasks context in the kernel, for any
controller type built into the system (or loadable module ?!), the
per-task state of that controller, getting at least this default state
even if the controller is not bound.
If a controller is not bound to any container file system, and
immediately after such a binding, before any of its per-container
attribute files have been modified via the container file system API,
the state of a controller as seen by a task will be this default state.
When a controller is unbound, then the state it presented to each
task in the system reverts to this default state.
Container file systems can be unmounted and remounted all the
while retaining their partitioning and any binding to controllers.
Unmounting a container file system just retracts the API mechanism
required to query and manipulate the partitioning and the state per
partition element of bound controllers.
A basic scenario exemplifying these operations might go like this
(notice I've still given no hint of the some of the API's involved):
1) Given a system with controllers Con1, Con2 and Con3, list them.
2) List the currently defined container file systems, finding none.
3) Define a container file system CFS1.
4) Bind controller Con2 to CFS1.
5) Mount CFS1 on /dev/container.
6) Bind controller Con3 to CFS1.
7) List the currently defined container file systems, finding CFS1.
8) List the controllers bound to CFS1, finding Con2 and Con3.
9) Mount CFS1 on a second mount point, say /foo/bar/container.
This gives us two pathnames to refer to the same thing.
10) Refine and modify the partition defined by CFS1, by making
subdirectories and moving tasks about.
11) Define a second container file system - this might fail if our
implementation doesn't support multiple container file systems
at the same time yet. Call this CFS2.
12) Bind controller Con1 to CFS2. This should work.
13) Mount CFS2 on /baz.
14) Bind controller Con2 to CFS2. This may well fail if that
controller must be singular.
15) Unbind controller Con2 from CFS2. After this, any task referencing
its Con2 controller will find the minimal default state.
16) If (14) failed, try it again. We should be able to bind Con2 to
CFS2 now, if not earlier.
17) List the mount points in the system (cat /proc/mounts). Observe
two entries of type container. for CFS1 and CFS2.
18) List the controllers bound to CFS2, finding Con1 and Con2.
19) Unmount CFS2. Its structure remains, however one lacks any API to
observe this.
20) List the controllers bound to CFS2 - still Con1 and Con2.
20) Remount CFS2 on /bornagain.
21) Observe its structure and the binding of Con1 and Con2 to it remain.
22) Unmount CFS2 again.
23) Ask to delete CFS2 - this fails due because it has controllers
bound to it.
24) Unbind controllers Con1 and Con2 from CFS2.
25) Ask to delete CFS2 - this succeeds this time.
26) List the currently defined container file systems, once again
finding just CFS1.
27) List the controllers bound to CFS1, finding just Con3.
28) Examine the regular files in the directory /dev/container, where
CFS1 is currently mounted. Find the files representing the
attributes of controller Con3.
If you indulged in enough coffee to stay awake through all that,
you noticed that I invented some rules on what would or would not
work in certain situations. For example, I decreed in (23) that one
could not delete a container file system if it had any controllers
bound to it. I just made these rules up ...
I find it usually works best if I turn the objects and operations
around in my head a bit, before inventing API's to realize them.
So I don't yet have any firmly jelled views on what the additional
API's to manipulate container file systems and controller binding
should look like.
Perhaps someone else will beat me to it.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On 11/7/06, Paul Jackson <[email protected]> wrote:
> Paul M wrote:
> > One drawback to this that I can see is the following:
> >
> > - suppose you mount a containerfs with controllers cpuset and cpu, and
> > create some nodes, and then unmount it, what happens? do the resource
> > nodes stick around still?
>
> Sorry - I let interfaces get confused with objects and operations.
>
> Let me back up a step. I think I have beat on your proposal enough
> to translate it into the more abstract terms that I prefer using when
> detemining objects, operations and semantics.
>
> It goes like this ... grab a cup of coffee.
>
That's pretty much what I was envisioning, except for the fact that I
was trying to fit the controller/container bindings into the same
mount/umount interface. I still think that might be possible with
judicious use of mount options, but if not we should probably just use
configfs or something like that as a binding API.
Paul
> That's pretty much what I was envisioning,
Good.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
On Tue, Nov 07, 2006 at 07:15:18PM -0800, Paul Jackson wrote:
> It goes like this ... grab a cup of coffee.
Thanks for the nice and big writeup!
> Each directory in a container file system has a file called 'tasks'
> listing the pids of the tasks (newline separated decimal ASCII format)
> in that partition element.
As was discussed in a previous thread, having a 'threads' file also will
be good.
http://lkml.org/lkml/2006/11/1/386
> Because containers define a partition of the tasks in a system, each
> task will always be in exactly one of the partition elements of a
> container file system. Tasks are moved from one partition element
> to another by writing their pid (decimal ASCII) into the 'tasks'
> file of the receiving directory.
Writing to 'tasks' file will move that single thread to the new
container. Writing to 'threads' file will move all the threads of the
process into the new container.
--
Regards,
vatsa
Srivatsa wrote:
> As was discussed in a previous thread, having a 'threads' file also will
> be good.
>
> http://lkml.org/lkml/2006/11/1/386
>
> Writing to 'tasks' file will move that single thread to the new
> container. Writing to 'threads' file will move all the threads of the
> process into the new container.
Yup - agreed.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Paul M wrote:
> except for the fact that I
> was trying to fit the controller/container bindings into the same
> mount/umount interface.
Of course, if you come up with an API using mount for this stuff
that looks nice and intuitive, don't hesitate to propose it.
I don't have any fundamental opposition to just using mount options
here; just a pretty strong guess that it won't be very intuitive
by the time all the necessary operations are encoded.
And this sort of abstractified pseudo meta containerized code is
just the sort of thing that drives normal humans up a wall, or
should I say, into a fog of confusion.
Not only is it worth a bit of work getting the abstractions right,
as you have noted, it's also worth a bit of work to try to get the
API as transparent as we are able.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
Srivatsa Vaddagiri wrote:
> As was discussed in a previous thread, having a 'threads' file also will
> be good.
>
> http://lkml.org/lkml/2006/11/1/386
> Writing to 'tasks' file will move that single thread to the new
> container. Writing to 'threads' file will move all the threads of the
> process into the new container.
That's exactly backwards to the proposal that you linked to.
Chris
On Wed, Nov 08, 2006 at 01:25:18PM -0600, Chris Friesen wrote:
> Srivatsa Vaddagiri wrote:
>
> > As was discussed in a previous thread, having a 'threads' file also will
> > be good.
> >
> > http://lkml.org/lkml/2006/11/1/386
>
> > Writing to 'tasks' file will move that single thread to the new
> > container. Writing to 'threads' file will move all the threads of the
> > process into the new container.
>
> That's exactly backwards to the proposal that you linked to.
Oops ..yes. Thanks for correcting me!
--
Regards,
vatsa
Paul Jackson wrote:
> Srivatsa wrote:
>> As was discussed in a previous thread, having a 'threads' file also will
>> be good.
>>
>> http://lkml.org/lkml/2006/11/1/386
>>
>> Writing to 'tasks' file will move that single thread to the new
>> container. Writing to 'threads' file will move all the threads of the
>> process into the new container.
>
> Yup - agreed.
>
Referring to the discussion at
http://lkml.org/lkml/2006/10/31/210
Which lead to
http://lkml.org/lkml/2006/11/1/101
If OpenVZ is ok with the notify_on_release approach, we can close in on
any further objections to the containers approach of implementing resource
grouping and be open to ideas for extending and enhancing it :)
--
Balbir Singh,
Linux Technology Center,
IBM Software Labs
On Mon, Nov 06, 2006 at 12:23:44PM -0800, Paul Menage wrote:
> > Secondly, regarding how separate grouping per-resource *maybe* usefull,
> > consider this scenario.
> >
> > A large university server has various users - students, professors,
> > system tasks etc. The resource planning for this server could be on these lines:
> >
> > CPU : Top cpuset
> > / \
> > CPUSet1 CPUSet2
> > | |
> > (Profs) (Students)
> >
> > In addition (system tasks) are attached to topcpuset (so
> > that they can run anywhere) with a limit of 20%
> >
> > Memory : Professors (50%), students (30%), system (20%)
> >
> > Disk : Prof (50%), students (30%), system (20%)
> >
> > Network : WWW browsing (20%), Network File System (60%), others (20%)
> > / \
> > Prof (15%) students (5%)
Lets say that network resource controller supports only one level
hierarchy, and hence you can only split it as:
Network : WWW browsing (20%), Network File System (60%), others (20%)
> > Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
> > into NFS network class.
> >
> > At the same time firefox/lynx will share an appropriate CPU/Memory class
> > depending on who launched it (prof/student).
> >
> > If we had the ability to write pids directly to these resource classes,
> > then admin can easily setup a script which receives exec notifications
> > and depending on who is launching the browser he can
> >
> > # echo browser_pid > approp_resource_class
> >
> > With your proposal, he now would have to create a separate container for
> > every browser launched and associate it with approp network and other
> > resource class. This may lead to proliferation of such containers.
>
> Or create one container per combination (so in this case four,
> prof/www, prof/other, student/www, student/other) - then processes can
> be moved between the containers to get the appropriate qos of each
> type.
>
> So the setup would look something like:
>
> top-level: prof vs student vs system, with new child nodes for cpu,
> memory and disk, and no new node for network
>
> second-level, within the prof and student classes: www vs other, with
> new child nodes for network, and no new child nodes for cpu.
>
> In terms of the commands to set it up, it might look like (from the top-level)
>
> echo network > inherit
> mkdir prof student system
> echo disk,cpu,memory > prof/inherit
> mkdir prof/www prof/other
> echo disk,cpu,memory > student/inherit
> mkdir student/www student/other
By these commands, we would forcibly split the WWW bandwidth of 20%
between prof/www and student/www, when it was actually not needed (as
per the new requirement above). This forced split may be fine for a renewable
resource like network bandwidth, but would be inconvenient for something like
RSS, disk quota etc.
(I thought of a scheme where you can avoid this forced split by
maintaining soft/hard links to resource nodes from the container nodes.
Essentially each resource can have its own hierarchy of resource nodes.
Each resource node provides allocation information like min/max shares.
Container nodes point to one or more such resource nodes, implemented
as soft/hard links. This will avoid the forced split I mentioned above.
But I suspect we will run into atomicity issues again when modifying the
container hierarchy).
Essentially by restrictly ourselves to a single hierarchy, we loose the
flexibility of "viewing" each resource usage differently (network by traffic,
cpu by users etc).
Coming to reality, I believe most work load management tools would be
fine to live with this restriction. AFAIK containers can also use this
model without much loss of flexibility. But if you are considering long term
user-interface stability, then this is something I would definitely
think hard about.
--
Regards,
vatsa