2005-09-08 05:39:13

by KUROSAWA Takahiro

[permalink] [raw]
Subject: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

The following patchset adds resource control functionality to the
CPUSETS code by subdividing a cpuset into smaller pieces (SUBCPUSETS).
It is made up from two parts, one for the resource control framework
and another for the specific resource controller. Resource controllers
can be added by using interfaces provided by the SUBCPUSETS.

For now, it enables us to guarantee the CPU time percentage that tasks
attached to the subcpuset consume. The memory resource controller code
is coming soon.

The patchset is for linux-2.6.13 and consists of 5 patches:
[1/5] Fixes minor problem in the current CPUSETS.
[2/5] Implements resource control framework functionality to the CPUSETS
(SUBCPUSETS).
[3/5] Adds document for SUBCPUSETS.
[4/5] Implements the CPU time resource controller.
[5/5] Puts the CPU time resource controller into SUBCPUSETS.


The following step might be to useful see how this patchset works:

# mount -t cpuset none /dev/cpuset
(mount the cpuset filesystem)
# /bin/echo 1 > /dev/cpuset/subcpuset_top
(declare that the cpuset is the toplevel directory of subcpusets;
you can declare the toplevel directory of subcpusets other than
the root cpuset of course)
# mkdir /dev/cpuset/sub-a
# mkdir /dev/cpuset/sub-b
(create subcpuset directories)
# /bin/echo 20 > /dev/cpuset/sub-a/cpu_guar
(guarantee 20% of the CPU time usage to tasks in the "sub-a" subcpuset)
# /bin/echo 80 > /dev/cpuset/sub-b/cpu_guar
(guarantee 80% of the CPU time usage to tasks in the "sub-b" subcpuset)
# echo $$ > /dev/cpuset/sub-a/tasks
(change the cpuset of the shell to the "sub-a" subcpuset)
# (while true; do true; done)&
(start a CPU consuming process in sub-a subcpuset)
# echo $$ > /dev/cpuset/sub-b/tasks
(change the cpuset of the shell to the "sub-b" subcpuset)
# (while true; do true; done)&
(start a CPU consuming process in sub-b subcpuset)
# top
(see how subcpusets works)


Thanks,

KUROSAWA, Takahiro


2005-09-08 07:23:47

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

[ Adding Dinakar to explicit cc list, since I mention his work.
Hopefully he can correct any mispresentations of his work I
might have presented. - pj ]


I've just started reading this - it seems well presented and I think
you have put much effort into it. Thank-you for posting it.

I have not yet taken the time to understand it properly, but a couple
of questions come to mind offhand. I hope these questions will not be
too silly.

1) What is the relation of this patch set to CKRM?

2) Would a structure similar to Dinakar's patches to connect
cpusets and dynamic sched domains (posted to linux-mm)
work here as well?

Let me describe what I see so far, and hope you will be patient with my
confusions. Then perhaps I can better explain my question (2) above.

My initial understanding is that subcpusets provide a way to partial
out the proportion of cpu and memory used by various tasks. A leaf
node cpuset can partition the tasks attached to it into subsets, called
subcpusets, where each subcpuset gets a proportion of cpu and memory
available to the original leaf node cpuset.

These original leaf node cpusets (the parents of the 'subcpusets') form
a non-overlapping and partial covering of the cpus and memory nodes in
a system. It is a _partial_ covering because not every leaf node
cpuset need be subdivided into 'subcpusets', nor does every cpu or
memory node need to be in such a cpuset.

I'm guessing you do not want such cpusets (the parents of subcpusets)
to overlap, because if they did, it would seem to confuse the meaning
of getting a fixed proportion of available cpu and memory resources. I
was a little surprised not to see any additional checks that
cpu_exclusive and mem_exclusive must be set true in these cpusets, to
insure non- overlapping cpusets.

Dinakar's patches used cpu_exclusive cpusets to form a complete
non-overlapping covering of the cpus in a system (the cpus not
in cpu_exclusive cpusets got their own separate covers). Then each
such cover defined a different dynamic sched domain (a separately
scheduled set of tasks).

Your cover elements (your subcpusets) seem to be a somewhat new kind of
cpuset, that can only occur as children of original leaf node cpusets.

Instead of that, I might prefer doing as Dinakar did, and use a single
boolean flag per existing cpuset, to mark a subset of the cpusets that
form the covering. Dinakar got lucky, and was able to use an existing
cpuset boolean flag - cpu_exclusive, which had not been very useful
until then. You will have to add a new flag.

On the other hand, Dinakar had more work to do than you might, because
he needed a complete covering (so had to round up cpus in non exclusive
cpusets to form more covering elements). From what I can tell, you
don't need a complete covering - it seems fine if some cpus are not
managed by this resource control function.

I hope my above remarks are not to far off the mark and do not misrepresent
your work too painfully. And I hope they make at least a little sense.
Sometimes I have a strange way of describing my thoughts.

Thank-you again for this good work. I look forward to your further
considerations. I will make the time soon to read this patch more
closely.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-08 08:18:20

by KUROSAWA Takahiro

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

On Thu, 8 Sep 2005 00:23:23 -0700
Paul Jackson <[email protected]> wrote:

> I've just started reading this - it seems well presented and I think
> you have put much effort into it. Thank-you for posting it.

Thank you for reading my patches!

> I have not yet taken the time to understand it properly, but a couple
> of questions come to mind offhand. I hope these questions will not be
> too silly.
>
> 1) What is the relation of this patch set to CKRM?

My patches consist of two parts, one for subcpuset framework
and another for the cpu resource controller.
Subcpuset framework code doesn't have the relation to CKRM.
CKRM could probably use the cpu resource controller code as
a resource controller of it with some modification.

> 2) Would a structure similar to Dinakar's patches to connect
> cpusets and dynamic sched domains (posted to linux-mm)
> work here as well?

Yes, subcpusets could work with the dynamic sched domains.
The cpu resource controller divides cpu resources by scaling
time_slice of the tasks, so multiple subcpusets can be created
under the cpuset that has one cpu only. So, uniprocessor systems
also could get the benefit of subcpusets-patched cpusets.

> My initial understanding is that subcpusets provide a way to partial
> out the proportion of cpu and memory used by various tasks. A leaf
> node cpuset can partition the tasks attached to it into subsets, called
> subcpusets, where each subcpuset gets a proportion of cpu and memory
> available to the original leaf node cpuset.

That's right.

> I'm guessing you do not want such cpusets (the parents of subcpusets)
> to overlap, because if they did, it would seem to confuse the meaning
> of getting a fixed proportion of available cpu and memory resources. I
> was a little surprised not to see any additional checks that
> cpu_exclusive and mem_exclusive must be set true in these cpusets, to
> insure non- overlapping cpusets.

Right - I don't want the parents of the subcpusets to overlap and
I should add such checks, but wanted to know first what people would
think of subcpusets.

Regards,

KUROSAWA, Takahiro

2005-09-08 12:02:49

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

Takahiro-san wrote, in reply to Paul:
> > 2) Would a structure similar to Dinakar's patches to connect
> > cpusets and dynamic sched domains (posted to linux-mm)
> > work here as well?
>
> Yes, subcpusets could work with the dynamic sched domains.

Ah - I see I was quite unclear about what I was thinking.

I intended to ask not simply would Dinakar's patches work with
subcpusets, but something more radical.

These subcpusets, if I understand correctly, are a bit different
from ordinary cpusets. For instance, it seems one cannot make child
cpusets of them, and one cannot change most of their properties,
such as cpus, mems, cpu_exclusive, mem_exclusive, or notify_on_release.

Also the mems and cpus of each subcpuset are required to be
exactly equal to the corresponding values of its parent.

One of my passions is to avoid special cases across API boundaries.

I am proposing that you don't do subcpusets like this.

Consider the following alternative I will call 'cpuset meters'.

For each resource named 'R' (cpu and mem, for instance):
* Add a boolean flag 'meter_R' to each cpuset. If set, that R is
metered, for the tasks in that cpuset or any descendent cpuset.
* If a cpuset is metered, then files named meter_R_guar, meter_R_lim
and meter_R_cur appear in that cpuset to manage R's usage by tasks
in that cpuset and descendents.
* There are no additional rules that restrict the ability to change
various other cpuset properties such as cpus, mems, cpu_exclusive,
mem_exclusive, or notify_on_release, when a cpuset is metered.
* It might be that some (or by design all) resource controllers do
not allow nesting metered cpusets. I don't know. But one should
(if one has permission) be able to make child cpusets of a metered
cpuset, just like one can of any other cpuset.
* A metered cpuset might well have cpus or mems that are not the
same as its parent, just like an unmetered cpuset ordinarly does.

If I understand correctly, one could place a job that managed its
own child cpusets into a metered cpuset, but not into a subcpuset.
Even if you wanted to allow it, it seems jobs in subcpusets cannot
make child cpusets or modify the properties of their current cpuset.
Is that correct?

The above is just a fragment of an idea. I do not know if it is
a good idea or not. And I left off critical details such as just
what the resource guarantees and limits mean.

Most likely, the way I stated it, there is some good reason that is
very obvious to you why metered cpusets don't work. Perhaps some
variation would work and be worthy of consideration as an alternative
to subcpusets.

I hope to reduce to a minimum the special limitations on these cpusets,
whether subcpusets or metered cpusets.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-08 13:01:18

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS


Interesting implementation of resource controls. Cross posting this
to ckrm-tech as well. I am sure CKRM folks have something to say...

Any thoughts about how you want to add more resource control features
on top of/in addition to this setup. (Such as memory etc)


On Thu, Sep 08, 2005 at 12:23:23AM -0700, Paul Jackson wrote:
> I'm guessing you do not want such cpusets (the parents of subcpusets)
> to overlap, because if they did, it would seem to confuse the meaning
> of getting a fixed proportion of available cpu and memory resources. I
> was a little surprised not to see any additional checks that
> cpu_exclusive and mem_exclusive must be set true in these cpusets, to
> insure non- overlapping cpusets.

I agree with Paul here. You would want to build your controllers
on top of exclusive cpusets to keep things sane.

> On the other hand, Dinakar had more work to do than you might, because
> he needed a complete covering (so had to round up cpus in non exclusive
> cpusets to form more covering elements). From what I can tell, you
> don't need a complete covering - it seems fine if some cpus are not
> managed by this resource control function.


I think it makes more sense to add this functionality directly as part
of the existing cpusets instead of creating further leaf cpusets (or subcpusets
as you call it) where we can specify resource control parameters. I think that
approach would be much more intuitive and simple to work with rather than
what you have currently.

-Dinakar

2005-09-08 14:17:43

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

On Thu, Sep 08, 2005 at 06:44:27PM +0530, Dinakar Guniguntala wrote:
>
>
> > On the other hand, Dinakar had more work to do than you might, because
> > he needed a complete covering (so had to round up cpus in non exclusive
> > cpusets to form more covering elements). From what I can tell, you
> > don't need a complete covering - it seems fine if some cpus are not
> > managed by this resource control function.
>
>
> I think it makes more sense to add this functionality directly as part
> of the existing cpusets instead of creating further leaf cpusets (or subcpusets
> as you call it) where we can specify resource control parameters. I think that
> approach would be much more intuitive and simple to work with rather than
> what you have currently.

If what subcpusets is doing is slicing cpusets resources, then wouldn't
it be more intusive to call them slice0, slice1 etc. under the
respective cpuset ?

Thanks
Dipankar

2005-09-08 14:55:31

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

Dipankar wrote:
> If what subcpusets is doing is slicing cpusets resources, then wouldn't
> it be more intusive to call them slice0, slice1 etc. under the
> respective cpuset ?

If we continue with Takahiro-san's design, then I agree that the name
'subcpusets' doesn't have quite the right connotations, and would join
you in exploring alternative names.

First it looks to be worth exploring alternatives that make resource
control more an attribute of existing cpusets than a new (sub) type of
cpuset, as Dinakar and I have both contemplated in earlier posts today.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-08 15:08:19

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

Dinakar wrote:
> Cross posting this to ckrm-tech as well.

Good idea - thanks.

Hopefully Takahiro-san and the CKRM folks can reach an understanding
on how their two proposals relate.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-08 22:51:37

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

On Thu, 2005-09-08 at 18:44 +0530, Dinakar Guniguntala wrote:
> Interesting implementation of resource controls. Cross posting this

I second this :)

Browsed a little through the docs/patches... seems to fit very well into
a resource management solution (hint CKRM :) than CPUSET (resource
isolation).

I can see the usefulness of resource management inside CPUSET. We have
had discussions earlier(in lkml and lse-tech) about how CKRM can play
inside a CPUSET, and this plays directly into that, providing resource
management inside a CPUSET.

The parameters used, guarantee and limit, fits very well into CKRM's
shares usage model.

Takahiro-san, How much effort you think will be needed to make this work
under CKRM. thanks.

> to ckrm-tech as well. I am sure CKRM folks have something to say...
>
> Any thoughts about how you want to add more resource control features
> on top of/in addition to this setup. (Such as memory etc)
>
>
> On Thu, Sep 08, 2005 at 12:23:23AM -0700, Paul Jackson wrote:
> > I'm guessing you do not want such cpusets (the parents of subcpusets)
> > to overlap, because if they did, it would seem to confuse the meaning
> > of getting a fixed proportion of available cpu and memory resources. I
> > was a little surprised not to see any additional checks that
> > cpu_exclusive and mem_exclusive must be set true in these cpusets, to
> > insure non- overlapping cpusets.
>
> I agree with Paul here. You would want to build your controllers
> on top of exclusive cpusets to keep things sane.
>
> > On the other hand, Dinakar had more work to do than you might, because
> > he needed a complete covering (so had to round up cpus in non exclusive
> > cpusets to form more covering elements). From what I can tell, you
> > don't need a complete covering - it seems fine if some cpus are not
> > managed by this resource control function.
>
>
> I think it makes more sense to add this functionality directly as part
> of the existing cpusets instead of creating further leaf cpusets (or subcpusets
> as you call it) where we can specify resource control parameters. I think that
> approach would be much more intuitive and simple to work with rather than
> what you have currently.
>
> -Dinakar
>
>
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
>
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------


2005-09-09 01:38:06

by KUROSAWA Takahiro

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

On Thu, 8 Sep 2005 05:02:32 -0700
Paul Jackson <[email protected]> wrote:

> These subcpusets, if I understand correctly, are a bit different
> from ordinary cpusets. For instance, it seems one cannot make child
> cpusets of them, and one cannot change most of their properties,
> such as cpus, mems, cpu_exclusive, mem_exclusive, or notify_on_release.
>
> Also the mems and cpus of each subcpuset are required to be
> exactly equal to the corresponding values of its parent.

That's right.

> One of my passions is to avoid special cases across API boundaries.
>
> I am proposing that you don't do subcpusets like this.
>
> Consider the following alternative I will call 'cpuset meters'.
>
> For each resource named 'R' (cpu and mem, for instance):
> * Add a boolean flag 'meter_R' to each cpuset. If set, that R is
> metered, for the tasks in that cpuset or any descendent cpuset.
> * If a cpuset is metered, then files named meter_R_guar, meter_R_lim
> and meter_R_cur appear in that cpuset to manage R's usage by tasks
> in that cpuset and descendents.
> * There are no additional rules that restrict the ability to change
> various other cpuset properties such as cpus, mems, cpu_exclusive,
> mem_exclusive, or notify_on_release, when a cpuset is metered.
> * It might be that some (or by design all) resource controllers do
> not allow nesting metered cpusets. I don't know. But one should
> (if one has permission) be able to make child cpusets of a metered
> cpuset, just like one can of any other cpuset.
> * A metered cpuset might well have cpus or mems that are not the
> same as its parent, just like an unmetered cpuset ordinarly does.

Jackson-san's idea looks good for me because users don't need
to create special cpusets (parents of subcpusets or subcpusets).
>From the point of users, maybe they wouldn't like to create
special cpusets.

As for the resource controller that I've posted, it assumes
that there are groups of tasks that share the same cpumasks/nodemasks,
and that there are no hierarchy in that groups in order to make
things easier. I'll investigate how I can attach the resource
controller to the cpuset meters.

> If I understand correctly, one could place a job that managed its
> own child cpusets into a metered cpuset, but not into a subcpuset.
> Even if you wanted to allow it, it seems jobs in subcpusets cannot
> make child cpusets or modify the properties of their current cpuset.
> Is that correct?

Correct. Subcpusets can't make child cpusets, and they don't have
cpumask/nodemask properties actually. Properties like cpumasks/nodemask
of subcpusets are the same as their parent and subcpusets can't modify
those properties.

Thanks,

KUROSAWA, Takahiro

2005-09-09 04:12:13

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

On 9/9/05, KUROSAWA Takahiro <[email protected]> wrote:
> On Thu, 8 Sep 2005 05:02:32 -0700
> Paul Jackson <[email protected]> wrote:
> > One of my passions is to avoid special cases across API boundaries.
> >
> > I am proposing that you don't do subcpusets like this.
> >
> > Consider the following alternative I will call 'cpuset meters'.
> >
> > For each resource named 'R' (cpu and mem, for instance):
> > * Add a boolean flag 'meter_R' to each cpuset. If set, that R is
> > metered, for the tasks in that cpuset or any descendent cpuset.
> > * If a cpuset is metered, then files named meter_R_guar, meter_R_lim
> > and meter_R_cur appear in that cpuset to manage R's usage by tasks
> > in that cpuset and descendents.
> > * There are no additional rules that restrict the ability to change
> > various other cpuset properties such as cpus, mems, cpu_exclusive,
> > mem_exclusive, or notify_on_release, when a cpuset is metered.
> > * It might be that some (or by design all) resource controllers do
> > not allow nesting metered cpusets. I don't know. But one should
> > (if one has permission) be able to make child cpusets of a metered
> > cpuset, just like one can of any other cpuset.
> > * A metered cpuset might well have cpus or mems that are not the
> > same as its parent, just like an unmetered cpuset ordinarly does.
>
> Jackson-san's idea looks good for me because users don't need
> to create special cpusets (parents of subcpusets or subcpusets).
> From the point of users, maybe they wouldn't like to create
> special cpusets.

Yes, from the user POV it must be good to keep the hierarchical model.
Ckrm and cpusets both provide a tree with descendents, children and
parents. This hierarchical model is very nice IMO and provides a
powerful API for the user.

> As for the resource controller that I've posted, it assumes
> that there are groups of tasks that share the same cpumasks/nodemasks,
> and that there are no hierarchy in that groups in order to make
> things easier. I'll investigate how I can attach the resource
> controller to the cpuset meters.

Subcpusets, compared to cpusets and ckrm, gives the user a flat model.
No hierarchy. Limited functionality compared to the hierachical model.

But what I think is important to keep in mind here is that cpusets and
subcpusets do very different things. If I understand cpusets
correctly, each cpuset may share processors or memory nodes with other
cpusets. One task running on a shared processor may starve other
cpusets using the same processor. This design works well with cpusets,
but for resource controllers that must provide some kind of guarantee,
this starvation is unsuitable.

And we already have an hierarchical alternative: ckrm. But look at the
complexity and the amount of code. I believe that the complexity in
ckrm mainly comes from the hierarchical model.

Maybe it is possible to have an hierarchical model and keep the
framework simple and easy to understand while providing guarantees,
I'm not sure. But until that happens, I'm quite happy with a simple,
limited flat model.

/ magnus

2005-09-09 05:57:51

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

magnus wrote:
> Maybe it is possible to have an hierarchical model and keep the
> framework simple and easy to understand while providing guarantees,

Dinakar's patches to use cpu_exclusive cpusets to define dynamic
sched domains accomplish something like this.

What scheduler domains and resource control domains both need
are non-overlapping subsets of the CPUs and/or Memory Nodes.

In the case of sched domains, you normally want the subsets
to cover all the CPUs. You want every CPU to have exactly
one scheduler that is responsible for its scheduling.

In the case of resource control domains, you perhaps don't
care if some CPUs or Memory Nodes have no particular resources
constraints defined for them. In that case, every CPU and
every Memory Node maps to _either_ zero or one resource control
domain.

Either way, a 'flat model' non-overlapping partitioning of the
CPUs and/or Memory Nodes can be obtained from a hierarchical
model (nested sets of subsets) by selecting some of the subsets
that don't overlap ;). In /dev/cpuset, this selection is normally
made by specifying another boolean file (contains '0' or '1')
that controls whether that cpuset is one of the selected subsets.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-09 07:52:04

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

On 9/9/05, Paul Jackson <[email protected]> wrote:
> magnus wrote:
> > Maybe it is possible to have an hierarchical model and keep the
> > framework simple and easy to understand while providing guarantees,
>
> Dinakar's patches to use cpu_exclusive cpusets to define dynamic
> sched domains accomplish something like this.

Ah. I'm not familiar with Dinakar's patches and how these domains
work. I'm actually more interested in this from the memory resource
control point of view, but these areas are of course somehow related.

> What scheduler domains and resource control domains both need
> are non-overlapping subsets of the CPUs and/or Memory Nodes.

Yes, that sounds like a good idea. But because English is not my
native language, I just want to reassure that we mean the same thing.
Non-overlapping subsets of cpu or memory nodes basically mean that
children of a cpuset only clear bits in the bitmap, never sets them.
Please correct me if I'm wrong.

> In the case of sched domains, you normally want the subsets
> to cover all the CPUs. You want every CPU to have exactly
> one scheduler that is responsible for its scheduling.

Hm, I'm not sure how this relates to memory management, but you
probably need a place to store per-bit (node/cpu) guarantee count
regardless of memory or cpu resource control.

I think one major problem is how the guarantee should be divided
between all subcpusets that share one bit. And using one percent value
(or page count) per cpuset if more than 1 bit is set raises a similar
guarantee question.

One solution to the guarantee dividing problem could be to have a
percent value per bit set in the bitmap, maybe something like this:

# echo 0=50% > /dev/cpuset/foo/cpus

> In the case of resource control domains, you perhaps don't
> care if some CPUs or Memory Nodes have no particular resources
> constraints defined for them. In that case, every CPU and
> every Memory Node maps to _either_ zero or one resource control
> domain.

These "resource control domains", are they similar to your "meter"
suggestion earlier, or are they something else?

> Either way, a 'flat model' non-overlapping partitioning of the
> CPUs and/or Memory Nodes can be obtained from a hierarchical
> model (nested sets of subsets) by selecting some of the subsets
> that don't overlap ;). In /dev/cpuset, this selection is normally
> made by specifying another boolean file (contains '0' or '1')
> that controls whether that cpuset is one of the selected subsets.

This boolean file, do you mean one of your "meter" files or something else?

Thank you for your comments!

/ magnus - will have a nice weekend now... =)

2005-09-09 08:40:21

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

Magnus wrote:
> Non-overlapping subsets of cpu or memory nodes basically mean that
> children of a cpuset only clear bits in the bitmap, never sets them.

X is a subset of Y if every element of X is also in Y.

My phrase "a subset of the CPUs" really just meant "some set of CPUs"
on the system. A subset of all CPUs is a set of some CPUs, perhaps
empty, perhaps including them all, perhaps in between.

The phrase "non-overlapping" mean that the several subsets had no
overlap - no CPU was in more than one of the subsets.

Ditto ... for Memory Nodes.

To be honest, I don't think in terms of bits. The bitmaps, cpumasks
and nodemasks are just ways of representing sets. I also uses various
lists and arrays to represent sets of CPUs and sets of Memory Nodes,
depending on what's convenient.

My math background, before getting into computers, was in Set Theory.


> a place to store per-bit (node/cpu) guarantee count

I do not understand this snippet.


> I think one major problem is how the guarantee should be divided
> between all subcpusets that share one bit

That's why I said 'non-overlapping'. That means no shared bits (no
shared CPUs or Memory Nodes).


> "resource control domains"

These were whatever sets of CPUs or Memory Nodes had resource
control functions. Both Takahiro-san's subcpusets and my cpusets
with meter attributes were examples of such. No doubt CKRM has
such too, though I've forgotten too much of CKRM to say what.


> This boolean file, do you mean one of your "meter" files or something
> else?

Look under /dev/cpuset (after mounting it: mount -t cpuset cpuset /dev/cpuset).
You will seem that some of the files, such as the files named:
mem_exclusive
cpu_exclusive
notify_on_release

always contain either the string "0\n" or "1\n". These are boolean files.
They are file representations of boolean values (True or False).

A cpuset with resource metering attributes needed:
1) one boolean file, to mark that it had such metering attributes.
2) several meter specific files, holding various parameter settings,
such as what percentage of the CPUs allowed in this cpuset may
be used by the tasks in this cpuset.


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-09 11:41:57

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

Hi,

> magnus wrote:
> > Maybe it is possible to have an hierarchical model and keep the
> > framework simple and easy to understand while providing guarantees,
>
> Dinakar's patches to use cpu_exclusive cpusets to define dynamic
> sched domains accomplish something like this.
>
> What scheduler domains and resource control domains both need
> are non-overlapping subsets of the CPUs and/or Memory Nodes.
>
> In the case of sched domains, you normally want the subsets
> to cover all the CPUs. You want every CPU to have exactly
> one scheduler that is responsible for its scheduling.
>
> In the case of resource control domains, you perhaps don't
> care if some CPUs or Memory Nodes have no particular resources
> constraints defined for them. In that case, every CPU and
> every Memory Node maps to _either_ zero or one resource control
> domain.
>
> Either way, a 'flat model' non-overlapping partitioning of the
> CPUs and/or Memory Nodes can be obtained from a hierarchical
> model (nested sets of subsets) by selecting some of the subsets
> that don't overlap ;). In /dev/cpuset, this selection is normally
> made by specifying another boolean file (contains '0' or '1')
> that controls whether that cpuset is one of the selected subsets.

What do you think if you make cpusets for sched domain be able to
have their siblings, which have the same attribute and share
their resources between them.

I guess it would be simple.

Thanks,
Hirokazu Takahashi.

2005-09-09 13:31:47

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

Takahashi-san wrote:
> What do you think if you make cpusets for sched domain be able to
> have their siblings, which have the same attribute and share
> their resources between them.

I do not understand this question. I guess "cpusets for sched
domains" means "cpusets whose 'cpu_exclusive' attribute is
marked true, but which have no child cpusets so marked."

But even that guess I am unsure of, and the rest of the sentence
"which have the same ..." I don't even have a guess what means.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-09 22:32:30

by Matt Helsley

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

On Fri, 2005-09-09 at 13:12 +0900, Magnus Damm wrote:
> On 9/9/05, KUROSAWA Takahiro <[email protected]> wrote:
> > On Thu, 8 Sep 2005 05:02:32 -0700
> > Paul Jackson <[email protected]> wrote:
> > > One of my passions is to avoid special cases across API boundaries.
> > >
> > > I am proposing that you don't do subcpusets like this.
> > >
> > > Consider the following alternative I will call 'cpuset meters'.
> > >
> > > For each resource named 'R' (cpu and mem, for instance):
> > > * Add a boolean flag 'meter_R' to each cpuset. If set, that R is
> > > metered, for the tasks in that cpuset or any descendent cpuset.
> > > * If a cpuset is metered, then files named meter_R_guar, meter_R_lim
> > > and meter_R_cur appear in that cpuset to manage R's usage by tasks
> > > in that cpuset and descendents.
> > > * There are no additional rules that restrict the ability to change
> > > various other cpuset properties such as cpus, mems, cpu_exclusive,
> > > mem_exclusive, or notify_on_release, when a cpuset is metered.
> > > * It might be that some (or by design all) resource controllers do
> > > not allow nesting metered cpusets. I don't know. But one should
> > > (if one has permission) be able to make child cpusets of a metered
> > > cpuset, just like one can of any other cpuset.
> > > * A metered cpuset might well have cpus or mems that are not the
> > > same as its parent, just like an unmetered cpuset ordinarly does.
> >
> > Jackson-san's idea looks good for me because users don't need
> > to create special cpusets (parents of subcpusets or subcpusets).
> > From the point of users, maybe they wouldn't like to create
> > special cpusets.
>
> Yes, from the user POV it must be good to keep the hierarchical model.
> Ckrm and cpusets both provide a tree with descendents, children and
> parents. This hierarchical model is very nice IMO and provides a
> powerful API for the user.
>
> > As for the resource controller that I've posted, it assumes
> > that there are groups of tasks that share the same cpumasks/nodemasks,
> > and that there are no hierarchy in that groups in order to make
> > things easier. I'll investigate how I can attach the resource
> > controller to the cpuset meters.
>
> Subcpusets, compared to cpusets and ckrm, gives the user a flat model.
> No hierarchy. Limited functionality compared to the hierachical model.
>
> But what I think is important to keep in mind here is that cpusets and
> subcpusets do very different things. If I understand cpusets
> correctly, each cpuset may share processors or memory nodes with other
> cpusets. One task running on a shared processor may starve other
> cpusets using the same processor. This design works well with cpusets,
> but for resource controllers that must provide some kind of guarantee,
> this starvation is unsuitable.
>
> And we already have an hierarchical alternative: ckrm. But look at the
> complexity and the amount of code. I believe that the complexity in
> ckrm mainly comes from the hierarchical model.

I've been trying to find ways to significantly reduce CKRM's kernel
footprint. I recently posted an RFC patch to CKRM-Tech describing a
5000-line reduction:
http://sourceforge.net/mailarchive/forum.php?thread_id=8132624&forum_id=35191
Feedback on the approach presented in the RFC patch would be
appreciated.

> Maybe it is possible to have an hierarchical model and keep the
> framework simple and easy to understand while providing guarantees,
> I'm not sure. But until that happens, I'm quite happy with a simple,
> limited flat model.
>
> / magnus

2005-09-10 07:14:57

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

Hi,

> > What do you think if you make cpusets for sched domain be able to
> > have their siblings, which have the same attribute and share
> > their resources between them.
>
> I do not understand this question. I guess "cpusets for sched
> domains" means "cpusets whose 'cpu_exclusive' attribute is
> marked true, but which have no child cpusets so marked."

Yes.

> But even that guess I am unsure of, and the rest of the sentence
> "which have the same ..." I don't even have a guess what means.

Sorry for the poor explanation.
I just thought that it wouldn't be bad to allow "each cpuset whose
cpu_exclusive attribute is mark true" to have its clones like the
figure below. In this case, cpu-2 and cpu-3 will be used exclusively
for the clones --- CPUSET 1, 2, and 3 ---.

I guess it seems very similar to one of your ideas except for reusing
cpu_exclusive flag. Do you think reusing the flag is good idea?


+-------------------+----------------+----------------+
| | | |
CPUSET 0 CPUSET 1 CPUSET 2 CPUSET 3
sched domain A sched domain B sched domain B sched domain B
cpus: 0, 1 cpus: 2, 3 cpus: 2, 3 cpus: 2, 3
cpu_exclusive cpu_exclusive cpu_exclusive cpu_exclusive
meter_cpu meter_cpu meter_cpu
<------should we call it resouce domain?------>


Thanks,
Hirokazu Takahashi.

2005-09-10 08:52:21

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

Well, I suspect I don't understand yet.

Nice picture though - that gives me some idea what you mean.

Do notice that the basic rule of cpu_exclusive cpusets is that their
CPUs don't overlap their siblings. Your Cpusets 1, 2, and 3 seem to be
marked cpu_exclusive in your picture, but all contain the same CPUs 2
and 3, overlapping each other.

I'm guessing what you are trying to draw is:

Tasks on CPUs 0 and 1 have no resource control limits.

Tasks on CPUs 2 and 3 have resource control limits specifying
what percentage of the CPUs 2 and 3 is available to them.

I might draw my solution to that as:

+-----------------------------------+
| |
CPUSET 0 CPUSET 1
sched domain A sched domain B
cpus: 0, 1 cpus: 2, 3
cpu_exclusive=1 cpu_exclusive=1
meter_cpu=0 meter_cpu=0
|
+----------------+----------------+
| | |
CPUSET 1a CPUSET 1b CPUSET 1c
cpus: 2, 3 cpus: 2, 3 cpus: 2, 3
cpu_exclusive=0 cpu_exclusive=0 cpu_exclusive=0
meter_cpu=1 meter_cpu=1 meter_cpu=1
meter_cpu_* meter_cpu_* meter_cpu_*

The meter_cpu_* files in each of Cpusets 1a, 1b, and 1c control what
proportion of the CPU resources in that Cpuset can be used by the tasks
in that Cpuset.

If meter_cpu is false (0) then the meter_cpu_* files do not appear,
which is equivalent to allowing 100% of the CPUs in that Cpuset to
be used by the tasks in that Cpuset (and descendents, of course.)

Don't forget - this all seems like it has significant mission overlap
with CKRM. I hate to repeat this, but the relation of your work to
CKRM needs to be understood before I am likely to agree to accepting
your work into the kernel (not that my acceptance is required; you
really just need Linus to agree, though he of course considers the
positions of others to some inscrutable degree.)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-11 16:05:45

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS

Hi,

> Well, I suspect I don't understand yet.
>
> Nice picture though - that gives me some idea what you mean.
>
> Do notice that the basic rule of cpu_exclusive cpusets is that their
> CPUs don't overlap their siblings. Your Cpusets 1, 2, and 3 seem to be
> marked cpu_exclusive in your picture, but all contain the same CPUs 2
> and 3, overlapping each other.

Yes, I know the current design of cpu_exclusive.
I just thought if I could enhance cpu_exclusive to cover a group of
cpusets to make the hierarchy as flat as possible.

> I'm guessing what you are trying to draw is:
>
> Tasks on CPUs 0 and 1 have no resource control limits.

I thought CPU 1 could have resource control limits.

> Tasks on CPUs 2 and 3 have resource control limits specifying
> what percentage of the CPUs 2 and 3 is available to them.
>
> I might draw my solution to that as:

I understand your idea clearly, which is essentially the same as
Korosawa's design. Your design looks very straight and I have
no objection to it.

My only concern is that it would become harder to control resources
between CPUSET 1a,1b and 1c if some processes are assigned to CPUSET 1
directly. But I just get an idea that it would be OK if CPUSET 1 can
have meter_cpu=1 to share the resources.

> +-----------------------------------+
> | |
> CPUSET 0 CPUSET 1
> sched domain A sched domain B
> cpus: 0, 1 cpus: 2, 3
> cpu_exclusive=1 cpu_exclusive=1
> meter_cpu=0 meter_cpu=0
> |
> +----------------+----------------+
> | | |
> CPUSET 1a CPUSET 1b CPUSET 1c
> cpus: 2, 3 cpus: 2, 3 cpus: 2, 3
> cpu_exclusive=0 cpu_exclusive=0 cpu_exclusive=0
> meter_cpu=1 meter_cpu=1 meter_cpu=1
> meter_cpu_* meter_cpu_* meter_cpu_*
>
> The meter_cpu_* files in each of Cpusets 1a, 1b, and 1c control what
> proportion of the CPU resources in that Cpuset can be used by the tasks
> in that Cpuset.
>
> If meter_cpu is false (0) then the meter_cpu_* files do not appear,
> which is equivalent to allowing 100% of the CPUs in that Cpuset to
> be used by the tasks in that Cpuset (and descendents, of course.)
>
> Don't forget - this all seems like it has significant mission overlap
> with CKRM. I hate to repeat this, but the relation of your work to
> CKRM needs to be understood before I am likely to agree to accepting
> your work into the kernel (not that my acceptance is required; you
> really just need Linus to agree, though he of course considers the
> positions of others to some inscrutable degree.)

OK.

Thanks,
Hirokazu Takahashi.

2005-09-26 09:34:35

by KUROSAWA Takahiro

[permalink] [raw]
Subject: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

This patch adds CPUMETER framework code to the CPUSETS.
CPUMETER is meant for subdividing cpuset resources.
A few files are added in order to control the guarantee and the limit
of the resource amount to the cpuset filesystem. Also, interfaces
for the specific resource controller like CPU and memory.

Signed-off-by: KUROSAWA Takahiro <[email protected]>

--- from-0001/include/linux/cpuset.h
+++ to-work/include/linux/cpuset.h 2005-09-26 17:24:09.480931862 +0900
@@ -14,6 +14,46 @@

#ifdef CONFIG_CPUSETS

+#ifdef CONFIG_CPUMETER
+struct cpumeter_ctlr {
+ char *name; /* controller name */
+ int idx; /* used by cpumeter core */
+ void *(*create_rcdomain)(struct cpuset *cs, cpumask_t cpus,
+ nodemask_t mems);
+ void (*destroy_rcdomain)(void *rcd);
+ void *(*create)(void *rcd, struct cpuset *cs);
+ void (*destroy)(void *ctldata);
+ int (*set_lim)(void *ctldata, unsigned long val);
+ int (*set_guar)(void *ctldata, unsigned long val);
+ int (*get_cur)(void *ctldata, unsigned long *valp);
+};
+
+extern int cpumeter_register_controller(struct cpumeter_ctlr *ctlr);
+extern void *cpumeter_get_controller_data(struct cpuset *cs,
+ struct cpumeter_ctlr *ctlr);
+extern void *cpumeter_get_rcdomain(struct cpuset *cs,
+ struct cpumeter_ctlr *ctlr);
+#else /* CONFIG_CPUMETER */
+struct cpumeter_ctlr;
+
+static inline int cpumeter_register_controller(struct cpumeter_ctlr *ctlr)
+{
+ return -EINVAL;
+}
+
+static inline void *cpumeter_get_controller_data(struct cpuset *cs,
+ struct cpumeter_ctlr *ctlr)
+{
+ return NULL;
+}
+
+static inline void *cpumeter_get_rcdomain(struct cpuset *cs,
+ struct cpumeter_ctlr *ctlr)
+{
+ return NULL;
+}
+#endif /* CONFIG_CPUMETER */
+
extern int cpuset_init(void);
extern void cpuset_init_smp(void);
extern void cpuset_fork(struct task_struct *p);
--- from-0001/init/Kconfig
+++ to-work/init/Kconfig 2005-09-26 17:24:09.481931723 +0900
@@ -238,6 +238,15 @@ config CPUSETS

Say N if unsure.

+config CPUMETER
+ bool "Cpumeter support"
+ depends on CPUSETS
+ help
+ This option enables the resource control of CPUs and memory
+ via the CPUSETS interface.
+
+ Say N if unsure.
+
menuconfig EMBEDDED
bool "Configure standard kernel features (for small systems)"
help
--- from-0001/kernel/cpuset.c
+++ to-work/kernel/cpuset.c 2005-09-26 17:24:09.479932000 +0900
@@ -55,6 +55,19 @@

#define CPUSET_SUPER_MAGIC 0x27e0eb

+#ifdef CONFIG_CPUMETER
+#define CPUMETER_CTLRS_MAX 16
+
+struct cpumeter {
+ void *ctlr_data; /* resource controller data */
+ unsigned long guar; /* resource guarantee */
+ unsigned long lim; /* resource limit */
+};
+
+static struct cpumeter_ctlr *cpumeter_ctlrs[CPUMETER_CTLRS_MAX];
+static int cpumeter_numctlrs = 0;
+#endif /* CONFIG_CPUMETER */
+
struct cpuset {
unsigned long flags; /* "unsigned long" so bitops work */
cpumask_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
@@ -77,6 +90,16 @@ struct cpuset {
* recent time this cpuset changed its mems_allowed.
*/
int mems_generation;
+#ifdef CONFIG_CPUMETER
+ /*
+ * rcdomains: used for the recource control domains
+ * to keep track of total ammount of resources.
+ * meters: used for metering resources assigned for
+ * the cpuset.
+ */
+ void *rcdomains[CPUMETER_CTLRS_MAX];
+ struct cpumeter meters[CPUMETER_CTLRS_MAX];
+#endif /* CONFIG_CPUMETER */
};

/* bits in struct cpuset flags field */
@@ -84,7 +107,8 @@ typedef enum {
CS_CPU_EXCLUSIVE,
CS_MEM_EXCLUSIVE,
CS_REMOVED,
- CS_NOTIFY_ON_RELEASE
+ CS_NOTIFY_ON_RELEASE,
+ CS_METER_OFFSET /* must be the last. */
} cpuset_flagbits_t;

/* convenient tests for these bits */
@@ -108,6 +132,47 @@ static inline int notify_on_release(cons
return !!test_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
}

+static inline int is_metered(const struct cpuset *cs, int idx)
+{
+ return !!test_bit(CS_METER_OFFSET + idx, &cs->flags);
+}
+
+static inline void set_meter(struct cpuset *cs, int idx)
+{
+ set_bit(CS_METER_OFFSET + idx, &cs->flags);
+}
+
+static inline void clear_meter(struct cpuset *cs, int idx)
+{
+ clear_bit(CS_METER_OFFSET + idx, &cs->flags);
+}
+
+#ifdef CONFIG_CPUMETER
+static int cpumeter_add_meter_flags(struct dentry *d);
+static int validate_meters(const struct cpuset *cur,
+ const struct cpuset *trial);
+static int inherit_meters(struct cpuset *cs, struct cpuset *parent);
+static void cpumeter_destroy_meters(struct cpuset *cs);
+
+#else /* CONFIG_CPUMETER */
+
+static inline int cpumeter_add_meter_flags(struct dentry *d) { return 0; }
+
+static inline int validate_meters(const struct cpuset *cur,
+ const struct cpuset *trial)
+{
+ return 0;
+}
+
+static inline int inherit_meters(struct cpuset *cs, struct cpuset *parent)
+{
+ return 0;
+}
+
+static inline void cpumeter_destroy_meters(struct cpuset *cs) {}
+
+#endif /* CONFIG_CPUMETER */
+
/*
* Increment this atomic integer everytime any cpuset changes its
* mems_allowed value. Users of cpusets can track this generation
@@ -217,6 +282,7 @@ static void cpuset_diput(struct dentry *
if (S_ISDIR(inode->i_mode)) {
struct cpuset *cs = dentry->d_fsdata;
BUG_ON(!(is_removed(cs)));
+ cpumeter_destroy_meters(cs);
kfree(cs);
}
iput(inode);
@@ -613,6 +679,9 @@ static int validate_change(const struct
return -EINVAL;
}

+ if (validate_meters(cur, trial))
+ return -EINVAL;
+
return 0;
}

@@ -1052,7 +1121,10 @@ static struct inode_operations cpuset_di
.rmdir = cpuset_rmdir,
};

-static int cpuset_create_file(struct dentry *dentry, int mode)
+static int cpuset_create_file(struct dentry *dentry,
+ int mode,
+ struct inode_operations *iop,
+ struct file_operations *fop)
{
struct inode *inode;

@@ -1065,15 +1137,16 @@ static int cpuset_create_file(struct den
if (!inode)
return -ENOMEM;

- if (S_ISDIR(mode)) {
- inode->i_op = &cpuset_dir_inode_operations;
- inode->i_fop = &simple_dir_operations;
+ if (iop)
+ inode->i_op = iop;
+ if (fop)
+ inode->i_fop = fop;

+ if (S_ISDIR(mode)) {
/* start off with i_nlink == 2 (for "." entry) */
inode->i_nlink++;
} else if (S_ISREG(mode)) {
inode->i_size = 0;
- inode->i_fop = &cpuset_file_operations;
}

d_instantiate(dentry, inode);
@@ -1100,7 +1173,9 @@ static int cpuset_create_dir(struct cpus
dentry = cpuset_get_dentry(parent, name);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
- error = cpuset_create_file(dentry, S_IFDIR | mode);
+ error = cpuset_create_file(dentry, S_IFDIR | mode,
+ &cpuset_dir_inode_operations,
+ &simple_dir_operations);
if (!error) {
dentry->d_fsdata = cs;
parent->d_inode->i_nlink++;
@@ -1119,7 +1194,8 @@ static int cpuset_add_file(struct dentry
down(&dir->d_inode->i_sem);
dentry = cpuset_get_dentry(dir, cft->name);
if (!IS_ERR(dentry)) {
- error = cpuset_create_file(dentry, 0644 | S_IFREG);
+ error = cpuset_create_file(dentry, 0644 | S_IFREG,
+ NULL, &cpuset_file_operations);
if (!error)
dentry->d_fsdata = (void *)cft;
dput(dentry);
@@ -1321,6 +1397,8 @@ static int cpuset_populate_dir(struct de
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
return err;
+ if ((err = cpumeter_add_meter_flags(cs_dentry)) < 0)
+ return err;
return 0;
}

@@ -1359,6 +1437,10 @@ static long cpuset_create(struct cpuset

list_add(&cs->sibling, &cs->parent->children);

+ err = inherit_meters(cs, parent);
+ if (err < 0)
+ goto err;
+
err = cpuset_create_dir(cs, name, mode);
if (err < 0)
goto err;
@@ -1686,3 +1768,739 @@ char *cpuset_task_status_allowed(struct
buffer += sprintf(buffer, "\n");
return buffer;
}
+
+#ifdef CONFIG_CPUMETER
+/*
+ * cpumeter support routine
+ */
+
+static ssize_t cpumeter_file_read_common(struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos,
+ unsigned long val);
+static ssize_t cpumeter_file_get_written_data(const char __user *userbuf,
+ size_t nbytes,
+ unsigned long *valp);
+static ssize_t cpumeter_meter_file_read(struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos);
+static ssize_t cpumeter_meter_file_write(struct file *file,
+ const char __user *userbuf,
+ size_t nbytes,
+ loff_t *unused_ppos);
+static ssize_t cpumeter_guar_file_read(struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos);
+static ssize_t cpumeter_guar_file_write(struct file *file,
+ const char __user *userbuf,
+ size_t nbytes,
+ loff_t *unused_ppos);
+static int cpumeter_add_meter_flag(struct dentry *d, struct cpumeter_ctlr *c);
+static int cpumeter_add_meter_file(struct dentry *dir,
+ char *name,
+ int mode,
+ struct file_operations *fop,
+ struct cpumeter_ctlr *c);
+static int cpumeter_add_meter_files(struct dentry *d, struct cpumeter_ctlr *c);
+static void cpumeter_remove_meter_files(struct dentry *d,
+ struct cpumeter_ctlr *c);
+
+static char cpumeter_guar_suffix[] = "_guar";
+static char cpumeter_lim_suffix[] = "_lim";
+static char cpumeter_cur_suffix[] = "_cur";
+static char cpumeter_meter_prefix[] = "meter_";
+#define CPUMETER_FNAME_MAX 255
+#define CPUMETER_AFFIX_MAX \
+ (sizeof(cpumeter_meter_prefix) + sizeof(cpumeter_guar_suffix) - 2)
+
+int cpumeter_register_controller(struct cpumeter_ctlr *ctlr)
+{
+ int namelen;
+
+ namelen = strlen(ctlr->name);
+ if (namelen + CPUMETER_AFFIX_MAX > CPUMETER_FNAME_MAX)
+ return -ENAMETOOLONG;
+
+ down(&cpuset_sem);
+ if (cpumeter_numctlrs >= CPUMETER_CTLRS_MAX) {
+ up(&cpuset_sem);
+ return -ENOSPC;
+ }
+
+ cpumeter_ctlrs[cpumeter_numctlrs] = ctlr;
+ ctlr->idx = cpumeter_numctlrs;
+ cpumeter_numctlrs++;
+ up(&cpuset_sem);
+
+ if (top_cpuset.dentry) {
+ down(&top_cpuset.dentry->d_inode->i_sem);
+ cpumeter_add_meter_flag(top_cpuset.dentry, ctlr);
+ up(&top_cpuset.dentry->d_inode->i_sem);
+ }
+
+ return 0;
+}
+
+void *cpumeter_get_controller_data(struct cpuset *cs,
+ struct cpumeter_ctlr *c)
+{
+ if (!cs || !is_metered(cs, c->idx))
+ return NULL;
+
+ return cs->meters[c->idx].ctlr_data;
+}
+
+void *cpumeter_get_rcdomain(struct cpuset *cs,
+ struct cpumeter_ctlr *c)
+{
+ if (!cs || !is_metered(cs, c->idx))
+ return NULL;
+
+ return cs->rcdomains[c->idx];
+}
+
+static ssize_t cpumeter_file_read_common(struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos,
+ unsigned long val)
+{
+ char *page, *s, *start;
+ ssize_t retval;
+ size_t n;
+
+ if (!(page = (char *)__get_free_page(GFP_KERNEL)))
+ return -ENOMEM;
+
+ s = page;
+ s += snprintf(s, PAGE_SIZE, "%lu", val);
+ *s++ = '\n';
+ *s = '\0';
+
+ /* Do nothing if *ppos is at the eof or beyond the eof. */
+ if (s - page <= *ppos)
+ return 0;
+
+ start = page + *ppos;
+ n = s - start;
+ retval = n - copy_to_user(buf, start, min(n, nbytes));
+ *ppos += retval;
+
+ free_page((unsigned long)page);
+ return retval;
+}
+
+static ssize_t cpumeter_file_get_written_data(const char __user *userbuf,
+ size_t nbytes,
+ unsigned long *valp)
+{
+ char *buffer;
+ int retval = 0;
+
+ /* Crude upper limit on largest legitimate cpulist user might write. */
+ if (nbytes > 100 + 6 * NR_CPUS)
+ return -E2BIG;
+
+ /* +1 for nul-terminator */
+ if ((buffer = kmalloc(nbytes + 1, GFP_KERNEL)) == 0)
+ return -ENOMEM;
+
+ if (copy_from_user(buffer, userbuf, nbytes)) {
+ retval = -EFAULT;
+ goto out;
+ }
+
+ buffer[nbytes] = 0; /* nul-terminate */
+ *valp = simple_strtoul(buffer, NULL, 0);
+out:
+ kfree(buffer);
+ return retval;
+}
+
+static ssize_t cpumeter_meter_file_read(struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+ struct cpumeter_ctlr *c = file->f_dentry->d_fsdata;
+ unsigned long val;
+
+ down(&cpuset_sem);
+ val = is_metered(cs, c->idx);
+ up(&cpuset_sem);
+
+ return cpumeter_file_read_common(file, buf, nbytes, ppos, val);
+}
+
+static ssize_t cpumeter_meter_file_write(struct file *file,
+ const char __user *userbuf,
+ size_t nbytes,
+ loff_t *unused_ppos)
+{
+ struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+ struct cpuset *parent;
+ struct cpuset trialcs;
+ struct cpumeter_ctlr *c = file->f_dentry->d_fsdata;
+ struct cpumeter *m;
+ unsigned long val;
+ void *ctlr_data;
+ int turning_on;
+ int retval;
+
+ retval = cpumeter_file_get_written_data(userbuf, nbytes, &val);
+ if (retval)
+ return retval;
+
+ turning_on = (val != 0);
+
+ /*
+ * keeping the lock order (i_sem > cpuset_sem).
+ * i_sem should be held because we are going to create/remove files.
+ */
+ down(&cs->dentry->d_inode->i_sem);
+ down(&cpuset_sem);
+ if (is_removed(cs)) {
+ retval = -ENODEV;
+ goto out;
+ }
+
+ parent = cs->parent;
+ m = &cs->meters[c->idx];
+ ctlr_data = m->ctlr_data;
+ trialcs = *cs;
+ if (turning_on)
+ set_meter(&trialcs, c->idx);
+ else
+ clear_meter(&trialcs, c->idx);
+
+ retval = validate_change(cs, &trialcs);
+ if (retval < 0)
+ goto out;
+
+ if (is_metered(cs, c->idx) == is_metered(&trialcs, c->idx)) {
+ retval = nbytes;
+ goto out;
+ }
+
+ if (!turning_on) {
+ if (c->set_guar) {
+ c->set_guar(cs->meters[c->idx].ctlr_data, 0);
+ /* Do not set cs->meters[c->idx].guar=0 here. */
+ }
+ c->destroy(cs->meters[c->idx].ctlr_data);
+ cs->meters[c->idx].ctlr_data = NULL;
+
+ if (parent && is_metered(parent, c->idx)) {
+ parent->meters[c->idx].guar += cs->meters[c->idx].guar;
+ cs->rcdomains[c->idx] = NULL;
+ } else {
+ c->destroy_rcdomain(cs->rcdomains[c->idx]);
+ cs->rcdomains[c->idx] = NULL;
+ }
+
+ cs->meters[c->idx].guar = 0;
+ clear_meter(cs, c->idx);
+ } else {
+ if (parent && is_metered(parent, c->idx)) {
+ cs->rcdomains[c->idx] = parent->rcdomains[c->idx];
+
+ /* Initial guarantee for children should be 0% */
+ cs->meters[c->idx].guar = 0;
+ } else {
+ cs->rcdomains[c->idx] =
+ c->create_rcdomain(cs,
+ cs->cpus_allowed,
+ cs->mems_allowed);
+ if (!cs->rcdomains[c->idx]) {
+ retval = -ENOMEM;
+ goto out;
+ }
+
+ cs->meters[c->idx].guar = 100;
+ }
+
+ cs->meters[c->idx].ctlr_data =
+ c->create(cs->rcdomains[c->idx], cs);
+ if (!cs->meters[c->idx].ctlr_data) {
+ if (parent && is_metered(parent, c->idx)) {
+ c->destroy_rcdomain(cs->rcdomains[c->idx]);
+ cs->rcdomains[c->idx] = NULL;
+ }
+ retval = -ENOMEM;
+ goto out;
+ }
+
+ set_meter(cs, c->idx);
+ if (c->set_guar) {
+ c->set_guar(cs->meters[c->idx].ctlr_data,
+ cs->meters[c->idx].guar);
+ }
+ }
+
+ up(&cpuset_sem);
+
+ if (turning_on)
+ cpumeter_add_meter_files(cs->dentry, c);
+ else
+ cpumeter_remove_meter_files(cs->dentry, c);
+
+ up(&cs->dentry->d_inode->i_sem);
+ return retval;
+
+out:
+ up(&cpuset_sem);
+ up(&cs->dentry->d_inode->i_sem);
+
+ return retval;
+}
+
+static struct file_operations cpumeter_meter_file_operations = {
+ .read = cpumeter_meter_file_read,
+ .write = cpumeter_meter_file_write,
+ .llseek = generic_file_llseek,
+ .open = generic_file_open,
+};
+
+static ssize_t cpumeter_guar_file_read(struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+ struct cpumeter_ctlr *c = file->f_dentry->d_fsdata;
+ unsigned long val;
+
+ down(&cpuset_sem);
+ val = cs->meters[c->idx].guar;
+ up(&cpuset_sem);
+
+ return cpumeter_file_read_common(file, buf, nbytes, ppos, val);
+}
+
+static ssize_t cpumeter_guar_file_write(struct file *file,
+ const char __user *userbuf,
+ size_t nbytes,
+ loff_t *unused_ppos)
+{
+ struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+ struct cpuset *parent;
+ struct cpumeter_ctlr *c = file->f_dentry->d_fsdata;
+ struct cpumeter *m, *pm;
+ unsigned long val;
+ long diff;
+ int retval;
+
+ retval = cpumeter_file_get_written_data(userbuf, nbytes, &val);
+ if (retval)
+ return retval;
+
+ m = &cs->meters[c->idx];
+ down(&cpuset_sem);
+ if (is_removed(cs)) {
+ retval = -ENODEV;
+ goto out;
+ }
+
+ if (c->set_lim && m->lim && val > m->lim) {
+ retval = -EINVAL;
+ goto out;
+ }
+
+ /* If the meter is toplevel, the guarantee can not be changed. */
+ parent = cs->parent;
+ if (!parent || !is_metered(parent, c->idx)) {
+ retval = -EINVAL;
+ goto out;
+ }
+
+ pm = &parent->meters[c->idx];
+ diff = (long)val - (long)m->guar;
+ if (diff == 0) {
+ retval = nbytes;
+ goto out;
+ } else if (diff > (long)pm->guar) {
+ retval = -ENOSPC;
+ goto out;
+ }
+
+ retval = c->set_guar(m->ctlr_data, val);
+ if (retval == 0) {
+ retval = c->set_guar(pm->ctlr_data, pm->guar - diff);
+ if (retval < 0) {
+ c->set_guar(m->ctlr_data, m->guar);
+ goto out;
+ } else {
+ pm->guar -= diff;
+ m->guar += diff;
+ retval = nbytes;
+ }
+ }
+out:
+ up(&cpuset_sem);
+ return retval;
+}
+
+static struct file_operations cpumeter_guar_file_operations = {
+ .read = cpumeter_guar_file_read,
+ .write = cpumeter_guar_file_write,
+ .llseek = generic_file_llseek,
+ .open = generic_file_open,
+};
+
+static ssize_t cpumeter_lim_file_read(struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+ struct cpumeter_ctlr *c = file->f_dentry->d_fsdata;
+ unsigned long val;
+
+ down(&cpuset_sem);
+ val = cs->meters[c->idx].lim;
+ up(&cpuset_sem);
+
+ return cpumeter_file_read_common(file, buf, nbytes, ppos, val);
+}
+
+static ssize_t cpumeter_lim_file_write(struct file *file,
+ const char __user *userbuf,
+ size_t nbytes,
+ loff_t *unused_ppos)
+{
+ struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+ struct cpumeter_ctlr *c = file->f_dentry->d_fsdata;
+ struct cpumeter *m;
+ unsigned long val;
+ int retval;
+
+ retval = cpumeter_file_get_written_data(userbuf, nbytes, &val);
+ if (retval)
+ return retval;
+
+ m = &cs->meters[c->idx];
+ down(&cpuset_sem);
+ if (is_removed(cs)) {
+ retval = -ENODEV;
+ goto out;
+ }
+
+ if (val && c->set_guar && val < m->guar) {
+ retval = -EINVAL;
+ goto out;
+ }
+
+ retval = c->set_lim(m->ctlr_data, val);
+ if (retval == 0) {
+ m->lim = val;
+ retval = nbytes;
+ }
+out:
+ up(&cpuset_sem);
+ return retval;
+}
+
+static struct file_operations cpumeter_lim_file_operations = {
+ .read = cpumeter_lim_file_read,
+ .write = cpumeter_lim_file_write,
+ .llseek = generic_file_llseek,
+ .open = generic_file_open,
+};
+
+static ssize_t cpumeter_cur_file_read(struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct cpuset *cs = __d_cs(file->f_dentry->d_parent);
+ struct cpumeter_ctlr *c = file->f_dentry->d_fsdata;
+ struct cpumeter *m;
+ unsigned long val;
+ int err;
+
+ m = &cs->meters[c->idx];
+
+ down(&cpuset_sem);
+ err = c->get_cur(m->ctlr_data, &val);
+ if (err) {
+ up(&cpuset_sem);
+ return err;
+ }
+ up(&cpuset_sem);
+
+ return cpumeter_file_read_common(file, buf, nbytes, ppos, val);
+}
+
+static struct file_operations cpumeter_cur_file_operations = {
+ .read = cpumeter_cur_file_read,
+ .llseek = generic_file_llseek,
+ .open = generic_file_open,
+};
+
+
+static int cpumeter_add_meter_flag(struct dentry *d, struct cpumeter_ctlr *c)
+{
+ char name[CPUMETER_FNAME_MAX + 1];
+ struct dentry *dentry;
+ int err;
+
+ sprintf(name, "%s%s", cpumeter_meter_prefix, c->name);
+ dentry = cpuset_get_dentry(d, name);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ err = cpuset_create_file(dentry, 0644 | S_IFREG, NULL,
+ &cpumeter_meter_file_operations);
+ if (err) {
+ dput(dentry);
+ return err;
+ }
+
+ dentry->d_fsdata = c;
+ dput(dentry);
+
+ if (is_metered(__d_cs(d), c->idx)) {
+ err = cpumeter_add_meter_files(d, c);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+static int cpumeter_add_meter_flags(struct dentry *d)
+{
+ int err = 0;
+ int i;
+
+ /* cpuset_sem needed because this function references cs->flags. */
+ down(&cpuset_sem);
+ for (i = 0; i < cpumeter_numctlrs; i++) {
+ err = cpumeter_add_meter_flag(d, cpumeter_ctlrs[i]);
+ if (err)
+ break;
+ }
+
+ up(&cpuset_sem);
+
+ return err;
+}
+
+static int cpumeter_add_meter_file(struct dentry *dir,
+ char *name,
+ int mode,
+ struct file_operations *fop,
+ struct cpumeter_ctlr *c)
+{
+ struct dentry *dentry;
+ int error;
+
+ dentry = cpuset_get_dentry(dir, name);
+ if (!IS_ERR(dentry)) {
+ error = cpuset_create_file(dentry, mode, NULL, fop);
+ if (!error)
+ dentry->d_fsdata = c;
+ dput(dentry);
+ } else
+ error = PTR_ERR(dentry);
+
+ return error;
+}
+
+static int cpumeter_add_meter_files(struct dentry *d, struct cpumeter_ctlr *c)
+{
+ char name[CPUMETER_FNAME_MAX + 1];
+ int err;
+
+ if (c->set_guar) {
+ sprintf(name, "%s%s%s", cpumeter_meter_prefix, c->name,
+ cpumeter_guar_suffix);
+ err = cpumeter_add_meter_file(d, name, 0644 | S_IFREG,
+ &cpumeter_guar_file_operations,
+ c);
+ if (err < 0)
+ return err;
+ }
+
+ if (c->set_lim) {
+ sprintf(name, "%s%s%s", cpumeter_meter_prefix, c->name,
+ cpumeter_lim_suffix);
+ err = cpumeter_add_meter_file(d, name, 0644 | S_IFREG,
+ &cpumeter_lim_file_operations,
+ c);
+ if (err < 0)
+ return err;
+ }
+
+ if (c->get_cur) {
+ sprintf(name, "%s%s%s", cpumeter_meter_prefix, c->name,
+ cpumeter_cur_suffix);
+ err = cpumeter_add_meter_file(d, name, 0444 | S_IFREG,
+ &cpumeter_cur_file_operations,
+ c);
+ if (err < 0)
+ return err;
+ }
+
+ return 0;
+}
+
+static void cpumeter_remove_meter_files(struct dentry *dentry,
+ struct cpumeter_ctlr *c)
+{
+ struct list_head *node, *n;
+ struct inode *inode;
+
+ dget(dentry);
+ spin_lock(&dcache_lock);
+ node = dentry->d_subdirs.next;
+ while (node != &dentry->d_subdirs) {
+ struct dentry *d = list_entry(node, struct dentry, d_child);
+ n = node->next;
+
+ inode = d->d_inode;
+ if (!inode)
+ goto next;
+
+ if (d->d_fsdata != c)
+ goto next;
+
+ if (inode->i_fop != &cpumeter_guar_file_operations &&
+ inode->i_fop != &cpumeter_lim_file_operations &&
+ inode->i_fop != &cpumeter_cur_file_operations)
+ goto next;
+
+ list_del_init(node);
+ d = dget_locked(d);
+ spin_unlock(&dcache_lock);
+ d_delete(d);
+ simple_unlink(dentry->d_inode, d);
+ dput(d);
+ spin_lock(&dcache_lock);
+
+ /*
+ * this might be paranoid, but we have released
+ * the dcache_lock...
+ */
+ n = dentry->d_subdirs.next;
+ next:
+ node = n;
+ }
+
+ spin_unlock(&dcache_lock);
+ dput(dentry);
+}
+
+static int validate_meters(const struct cpuset *cur,
+ const struct cpuset *trial)
+{
+ struct cpuset *parent;
+ int is_anything_metered = 0;
+ int is_changed;
+ int i;
+
+ parent = cur->parent;
+
+ /* checks for flag bits */
+ for (i = 0; i < cpumeter_numctlrs; i++) {
+ is_changed = (is_metered(trial, i) != is_metered(cur, i));
+
+ /* meter flags can not be changed if the cs has any child. */
+ if (!list_empty(&cur->children) && is_changed)
+ return -EINVAL;
+
+ /* meter flags can not be changed if the parent is metered */
+ if (parent && is_metered(parent, i) && is_changed)
+ return -EINVAL;
+
+ if (is_metered(cur, i))
+ is_anything_metered++;
+ }
+
+ /* checks for cpus & mems changes */
+ if (is_anything_metered) {
+ if (!cpus_equal(cur->cpus_allowed, trial->cpus_allowed))
+ return -EINVAL;
+
+ if (!nodes_equal(cur->mems_allowed, trial->mems_allowed))
+ return -EINVAL;
+ }
+
+ /* checks for guarantee values */
+ /* XXX not yet */
+
+ return 0;
+}
+
+static int inherit_meters(struct cpuset *cs, struct cpuset *parent)
+{
+ struct cpumeter_ctlr *c;
+ int is_anything_metered = 0;
+ int i;
+
+ /* initialize meters */
+ memset(cs->meters, 0, sizeof(cs->meters));
+
+ /* parent == NULL means the root cpuset. no need to inherit. */
+ if (!parent)
+ return 0;
+
+ /* inerit meter flags and rcdomains */
+ for (i = 0; i < cpumeter_numctlrs; i++) {
+ cs->rcdomains[i] = parent->rcdomains[i];
+ if (!is_metered(parent, i))
+ continue;
+
+ set_meter(cs, i);
+ c = cpumeter_ctlrs[i];
+ cs->meters[i].ctlr_data =
+ c->create(cs->rcdomains[i], cs);
+ if (!cs->meters[i].ctlr_data)
+ goto failed;
+
+ is_anything_metered++;
+ }
+
+ if (is_anything_metered) {
+ /* inherit cpus and mems. */
+ cs->cpus_allowed = parent->cpus_allowed;
+ cs->mems_allowed = parent->mems_allowed;
+ }
+
+ return 0;
+
+failed:
+ for (i = 0; i < cpumeter_numctlrs; i++) {
+ if (!is_metered(cs, i))
+ continue;
+
+ c = cpumeter_ctlrs[i];
+ if (cs->meters[i].ctlr_data)
+ c->destroy(cs->meters[i].ctlr_data);
+ cs->meters[i].ctlr_data = NULL;
+ clear_meter(cs, i);
+ }
+
+ return -ENOMEM;
+}
+
+static void cpumeter_destroy_meters(struct cpuset *cs)
+{
+ struct cpuset *parent = cs->parent;
+ struct cpumeter_ctlr *c;
+ int i;
+
+ for (i = 0; i < cpumeter_numctlrs; i++) {
+ c = cpumeter_ctlrs[i];
+ if (cs->meters[i].ctlr_data) {
+ if (c->set_guar)
+ c->set_guar(cs->meters[i].ctlr_data, 0);
+ c->destroy(cs->meters[i].ctlr_data);
+ cs->meters[i].ctlr_data = NULL;
+ }
+
+ if (parent && is_metered(parent, i)) {
+ /* the rcdomain is inherited from the parent. */
+ parent->meters[i].guar += cs->meters[i].guar;
+ if (c->set_guar)
+ c->set_guar(parent->meters[i].ctlr_data,
+ parent->meters[i].guar);
+ cs->rcdomains[i] = NULL;
+ continue;
+ }
+
+ if (cs->rcdomains[i]) {
+ c->destroy_rcdomain(cs->rcdomains[i]);
+ cs->rcdomains[i] = NULL;
+ }
+ }
+}
+#endif /* CONFIG_CPUMETER */

2005-09-26 09:34:54

by KUROSAWA Takahiro

[permalink] [raw]
Subject: [PATCH 2/3] CPUMETER: CPU resource controller

This patch adds CPU resource controller. It enables us to control
CPU time percentage of tasks grouped by the cpu_rc structure.
It controls time_slice of tasks based on the feedback of difference
between the target value and the current usage in order to control
the percentage of the CPU usage to the target value.

Signed-off-by: KUROSAWA Takahiro <[email protected]>

--- /dev/null
+++ to-work/include/linux/cpu_rc.h 2005-09-26 17:26:19.234918633 +0900
@@ -0,0 +1,65 @@
+#ifndef _LINUX_CPU_RC_H_
+#define _LINUX_CPU_RC_H_
+/*
+ * CPU resource controller interface
+ *
+ * Copyright 2005 FUJITSU LIMITED
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/config.h>
+#include <linux/sched.h>
+#include <linux/cpuset.h>
+
+#ifdef CONFIG_CPU_RC
+
+#ifdef __KERNEL__
+void cpu_rc_init(void);
+unsigned int cpu_rc_scale_timeslice(task_t *tsk, unsigned int slice);
+void cpu_rc_account(task_t *tsk, unsigned long now);
+void cpu_rc_collect_hunger(task_t *tsk);
+
+static inline void cpu_rc_record_activated(task_t *tsk, unsigned long now)
+{
+ tsk->last_activated = now;
+}
+
+static inline void cpu_rc_record_allocation(task_t *tsk,
+ unsigned int slice,
+ unsigned long now)
+{
+ if (slice == 0) {
+ /* minimal allocated time_slice is 1 (see sched_fork()). */
+ slice = 1;
+ }
+
+ tsk->last_slice = slice;
+ tsk->ts_alloced = now;
+}
+#endif /* __KERNEL__ */
+
+#else /* CONFIG_CPU_RC */
+
+#ifdef __KERNEL__
+static inline void cpu_rc_init(void) {}
+static inline void cpu_rc_account(task_t *tsk, unsigned long now) {}
+static inline void cpu_rc_collect_hunger(task_t *tsk) {}
+static inline void cpu_rc_record_activated(task_t *tsk, unsigned long now) {}
+static inline void cpu_rc_record_allocation(task_t *tsk,
+ unsigned int slice,
+ unsigned long now) {}
+
+static inline unsigned int cpu_rc_scale_timeslice(task_t *tsk,
+ unsigned int slice)
+{
+ return slice;
+}
+#endif /* __KERNEL__ */
+
+#endif /* CONFIG_CPU_RC */
+
+#endif /* _LINUX_CPU_RC_H_ */
+
--- from-0001/include/linux/sched.h
+++ to-work/include/linux/sched.h 2005-09-26 17:26:19.236918355 +0900
@@ -769,6 +769,11 @@ struct task_struct {
nodemask_t mems_allowed;
int cpuset_mems_generation;
#endif
+#ifdef CONFIG_CPU_RC
+ unsigned int last_slice;
+ unsigned long ts_alloced;
+ unsigned long last_activated;
+#endif
atomic_t fs_excl; /* holding fs exclusive resources */
};

--- from-0002/init/Kconfig
+++ to-work/init/Kconfig 2005-09-26 17:28:43.911747746 +0900
@@ -247,6 +247,14 @@ config CPUMETER

Say N if unsure.

+config CPU_RC
+ bool "CPU resource controller"
+ help
+ This options will let you control the CPU resource by scaling
+ the timeslice allocated for each tasks.
+
+ Say N if unsure.
+
menuconfig EMBEDDED
bool "Configure standard kernel features (for small systems)"
help
--- from-0001/init/main.c
+++ to-work/init/main.c 2005-09-26 17:26:19.238918078 +0900
@@ -42,6 +42,7 @@
#include <linux/writeback.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
+#include <linux/cpu_rc.h>
#include <linux/efi.h>
#include <linux/unistd.h>
#include <linux/rmap.h>
@@ -524,7 +525,7 @@ asmlinkage void __init start_kernel(void
proc_root_init();
#endif
cpuset_init();
-
+ cpu_rc_init();
check_bugs();

acpi_early_init(); /* before LAPIC and SMP init */
--- from-0001/kernel/Makefile
+++ to-work/kernel/Makefile 2005-09-26 17:26:19.233918772 +0900
@@ -20,6 +20,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CPU_RC) += cpu_rc.o
obj-$(CONFIG_IKCONFIG) += configs.o
obj-$(CONFIG_IKCONFIG_PROC) += configs.o
obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
--- /dev/null
+++ to-work/kernel/cpu_rc.c 2005-09-26 17:28:09.131607286 +0900
@@ -0,0 +1,233 @@
+/*
+ * kernel/cpu_rc.c
+ *
+ * CPU resource controller by scaling time_slice of the task.
+ *
+ * Copyright 2005 FUJITSU LIMITED
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/config.h>
+#include <linux/sched.h>
+#include <linux/proc_fs.h>
+#include <linux/cpu_rc.h>
+
+/* local macros */
+#define CPU_RC_SPREAD_PERIOD (5 * HZ)
+#define CPU_RC_LOAD_SCALE 1000
+#define CPU_RC_GUAR_SCALE 100
+#define CPU_RC_TSFACTOR_MAX CPU_RC_GUAR_SCALE
+#define CPU_RC_TSFACTOR_INC 5
+#define CPU_RC_RECALC_INTERVAL HZ
+
+struct cpu_rc_domain {
+ spinlock_t lock;
+ unsigned int hungry_groups;
+ cpumask_t cpus;
+ int numcpus;
+};
+
+struct cpu_rc {
+ int guarantee;
+ int is_hungry;
+ unsigned int ts_factor;
+ unsigned long last_recalc;
+ struct cpu_rc_domain *rcd;
+ struct {
+ unsigned long timestamp;
+ unsigned int load;
+ int maybe_hungry;
+ } stat[NR_CPUS]; /* XXX need alignment */
+};
+
+static struct cpu_rc *cpu_rc_get(task_t *tsk);
+
+static inline void cpu_rc_lock(struct cpu_rc *cr)
+{
+ spin_lock(&cr->rcd->lock);
+}
+
+static inline void cpu_rc_unlock(struct cpu_rc *cr)
+{
+ spin_unlock(&cr->rcd->lock);
+}
+
+static inline int cpu_rc_is_hungry(struct cpu_rc *cr)
+{
+ return cr->is_hungry;
+}
+
+static inline void cpu_rc_set_hungry(struct cpu_rc *cr)
+{
+ if (!cr->is_hungry) {
+ cr->rcd->hungry_groups++;
+ cr->is_hungry = !cr->is_hungry;
+ }
+}
+
+static inline void cpu_rc_set_satisfied(struct cpu_rc *cr)
+{
+ if (cr->is_hungry) {
+ cr->rcd->hungry_groups--;
+ cr->is_hungry = !cr->is_hungry;
+ }
+}
+
+static inline int cpu_rc_is_anyone_hungry(struct cpu_rc *cr)
+{
+ return cr->rcd->hungry_groups > 0;
+}
+
+static inline void cpu_rc_recalc_tsfactor(struct cpu_rc *cr)
+{
+ unsigned int load;
+ int maybe_hungry;
+ int i, n;
+
+ n = 0;
+ load = 0;
+ maybe_hungry = 0;
+
+ cpu_rc_lock(cr);
+ for_each_cpu_mask(i, cr->rcd->cpus) {
+ load += cr->stat[i].load;
+ maybe_hungry += cr->stat[i].maybe_hungry;
+ cr->stat[i].maybe_hungry = 0;
+ n++;
+ }
+ load = load / n;
+
+ if (load * CPU_RC_GUAR_SCALE >= cr->guarantee * CPU_RC_LOAD_SCALE) {
+ cpu_rc_set_satisfied(cr);
+ } else if (maybe_hungry > 0) {
+ cpu_rc_set_hungry(cr);
+ } else {
+ cpu_rc_set_satisfied(cr);
+ }
+
+ if (!cpu_rc_is_anyone_hungry(cr)) {
+ /* Everyone satisfied. Extend time_slice. */
+ cr->ts_factor += CPU_RC_TSFACTOR_INC;
+ } else {
+ if (cpu_rc_is_hungry(cr)) {
+ /* Extend time_slice a little. */
+ cr->ts_factor++;
+ } else {
+ /* time_slice should be scaled. */
+ cr->ts_factor = cr->ts_factor * cr->guarantee
+ * CPU_RC_LOAD_SCALE
+ / (load * CPU_RC_GUAR_SCALE);
+ }
+ }
+
+ if (cr->ts_factor == 0) {
+ cr->ts_factor = 1;
+ } else if (cr->ts_factor > CPU_RC_TSFACTOR_MAX) {
+ cr->ts_factor = CPU_RC_TSFACTOR_MAX;
+ }
+
+ cr->last_recalc = jiffies;
+
+ cpu_rc_unlock(cr);
+}
+
+unsigned int cpu_rc_scale_timeslice(task_t *tsk, unsigned int slice)
+{
+ struct cpu_rc *cr;
+ unsigned int scaled;
+
+ cr = cpu_rc_get(tsk);
+ if (cr == NULL) {
+ return slice;
+ }
+
+ if (jiffies - cr->last_recalc > CPU_RC_RECALC_INTERVAL) {
+ cpu_rc_recalc_tsfactor(cr);
+ }
+
+ scaled = slice * cr->ts_factor / CPU_RC_TSFACTOR_MAX;
+ if (scaled == 0) {
+ scaled = 1;
+ }
+
+ return scaled;
+}
+
+void cpu_rc_account(task_t *tsk, unsigned long now)
+{
+ struct cpu_rc *cr;
+ int cpu = smp_processor_id();
+ unsigned long last;
+ unsigned int load, tsk_load;
+ unsigned long base, update;
+
+ if (tsk == idle_task(task_cpu(tsk))) {
+ return;
+ }
+
+ cr = cpu_rc_get(tsk);
+ if (cr == NULL) {
+ return;
+ }
+
+ base = now - tsk->ts_alloced;
+ if (base == 0) {
+ /* duration too small. can not collect statistics. */
+ return;
+ }
+
+ tsk_load = CPU_RC_LOAD_SCALE * (tsk->last_slice - tsk->time_slice)
+ + (CPU_RC_LOAD_SCALE - 1);
+ if (base > CPU_RC_SPREAD_PERIOD) {
+ tsk_load = CPU_RC_SPREAD_PERIOD * tsk_load / base;
+ }
+
+ last = cr->stat[cpu].timestamp;
+ update = now - last;
+ if (update > CPU_RC_SPREAD_PERIOD) {
+ /* statistics data obsolete. */
+ load = 0;
+ update = CPU_RC_SPREAD_PERIOD;
+ } else {
+ load = cr->stat[cpu].load * (CPU_RC_SPREAD_PERIOD - update);
+ }
+
+ cr->stat[cpu].timestamp = now;
+ cr->stat[cpu].load = (load + tsk_load) / CPU_RC_SPREAD_PERIOD;
+}
+
+void cpu_rc_collect_hunger(task_t *tsk)
+{
+ struct cpu_rc *cr;
+ unsigned long wait;
+ int cpu = smp_processor_id();
+
+ if (tsk == idle_task(task_cpu(tsk))) {
+ return;
+ }
+
+ if (tsk->last_activated == 0) {
+ return;
+ }
+
+ cr = cpu_rc_get(tsk);
+ if (cr == NULL) {
+ tsk->last_activated = 0;
+ return;
+ }
+
+ wait = jiffies - tsk->last_activated;
+ if (CPU_RC_GUAR_SCALE * tsk->last_slice
+ / (wait + tsk->last_slice) < cr->guarantee / cr->rcd->numcpus) {
+ cr->stat[cpu].maybe_hungry++;
+ }
+
+ tsk->last_activated = 0;
+}
+
+void cpu_rc_init(void)
+{
+}
--- from-0001/kernel/sched.c
+++ to-work/kernel/sched.c 2005-09-26 17:26:19.231919049 +0900
@@ -41,6 +41,7 @@
#include <linux/rcupdate.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
+#include <linux/cpu_rc.h>
#include <linux/percpu.h>
#include <linux/kthread.h>
#include <linux/seq_file.h>
@@ -168,10 +169,17 @@

static unsigned int task_timeslice(task_t *p)
{
+ unsigned int timeslice;
+
if (p->static_prio < NICE_TO_PRIO(0))
- return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
+ timeslice = SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
else
- return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
+ timeslice = SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
+
+ if (!TASK_INTERACTIVE(p))
+ timeslice = cpu_rc_scale_timeslice(p, timeslice);
+
+ return timeslice;
}
#define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran) \
< (long long) (sd)->cache_hot_time)
@@ -660,6 +668,7 @@ static int effective_prio(task_t *p)
*/
static inline void __activate_task(task_t *p, runqueue_t *rq)
{
+ cpu_rc_record_activated(p, jiffies);
enqueue_task(p, rq->active);
rq->nr_running++;
}
@@ -1294,6 +1303,7 @@ int fastcall wake_up_state(task_t *p, un
void fastcall sched_fork(task_t *p, int clone_flags)
{
int cpu = get_cpu();
+ unsigned long now = jiffies;

#ifdef CONFIG_SMP
cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
@@ -1330,9 +1340,12 @@ void fastcall sched_fork(task_t *p, int
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
+ cpu_rc_account(current, now);
p->first_time_slice = 1;
current->time_slice >>= 1;
p->timestamp = sched_clock();
+ cpu_rc_record_allocation(current, current->time_slice, now);
+ cpu_rc_record_allocation(p, p->time_slice, now);
if (unlikely(!current->time_slice)) {
/*
* This case is rare, it happens when the parent has only
@@ -1390,6 +1403,7 @@ void fastcall wake_up_new_task(task_t *
p->array = current->array;
p->array->nr_active++;
rq->nr_running++;
+ cpu_rc_record_activated(p, jiffies);
}
set_need_resched();
} else
@@ -1440,16 +1454,21 @@ void fastcall sched_exit(task_t * p)
{
unsigned long flags;
runqueue_t *rq;
+ unsigned long now = jiffies;

/*
* If the child was a (relative-) CPU hog then decrease
* the sleep_avg of the parent as well.
*/
rq = task_rq_lock(p->parent, &flags);
+ cpu_rc_account(p, now);
if (p->first_time_slice) {
+ cpu_rc_account(p->parent, now);
p->parent->time_slice += p->time_slice;
if (unlikely(p->parent->time_slice > task_timeslice(p)))
p->parent->time_slice = task_timeslice(p);
+ cpu_rc_record_allocation(p->parent,
+ p->parent->time_slice, now);
}
if (p->sleep_avg < p->parent->sleep_avg)
p->parent->sleep_avg = p->parent->sleep_avg /
@@ -2487,6 +2506,7 @@ void scheduler_tick(void)
runqueue_t *rq = this_rq();
task_t *p = current;
unsigned long long now = sched_clock();
+ unsigned long jnow = jiffies;

update_cpu_clock(p, rq, now);

@@ -2521,6 +2541,9 @@ void scheduler_tick(void)
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
set_tsk_need_resched(p);
+#ifdef CONFIG_CPU_RC
+ /* XXX need accounting even for rt_task? */
+#endif

/* put it at the end of the queue: */
requeue_task(p, rq->active);
@@ -2530,9 +2553,12 @@ void scheduler_tick(void)
if (!--p->time_slice) {
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
+ cpu_rc_account(p, jnow);
p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
+ cpu_rc_record_allocation(p, p->time_slice, jnow);
+ cpu_rc_record_activated(p, jnow);

if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
@@ -2891,6 +2917,7 @@ switch_tasks:
rcu_qsctr_inc(task_cpu(prev));

update_cpu_clock(prev, rq, now);
+ cpu_rc_collect_hunger(next);

prev->sleep_avg -= run_time;
if ((long)prev->sleep_avg <= 0)

2005-09-26 09:34:55

by KUROSAWA Takahiro

[permalink] [raw]
Subject: [PATCH 3/3] CPUMETER: connect the CPU resource controller to CPUMETER

This patch puts the CPU resource controller into CPUMETER framework.
With this patch, we can control the CPU resource from the cpuset
filesystem.

Signed-off-by: KUROSAWA Takahiro <[email protected]>

--- from-0003/init/Kconfig
+++ to-work/init/Kconfig 2005-09-26 17:31:40.247109876 +0900
@@ -249,6 +249,7 @@ config CPUMETER

config CPU_RC
bool "CPU resource controller"
+ depends on CPUMETER
help
This options will let you control the CPU resource by scaling
the timeslice allocated for each tasks.
--- from-0003/kernel/cpu_rc.c
+++ to-work/kernel/cpu_rc.c 2005-09-22 11:51:57.000000000 +0900
@@ -228,6 +228,124 @@ void cpu_rc_collect_hunger(task_t *tsk)
tsk->last_activated = 0;
}

+/*
+ * interface functions for cpumeter
+ */
+static void *cpu_rc_create_rcdomain(struct cpuset *cs,
+ cpumask_t cpus, nodemask_t mems)
+{
+ struct cpu_rc_domain *rcd;
+
+ if (cpus_empty(cpus)) {
+ return NULL;
+ }
+
+ rcd = kmalloc(sizeof(*rcd), GFP_KERNEL);
+ if (rcd == NULL) {
+ return NULL;
+ }
+
+ rcd->cpus = cpus;
+ spin_lock_init(&rcd->lock);
+ rcd->hungry_groups = 0;
+
+ rcd->numcpus = cpus_weight(cpus);
+
+ return rcd;
+}
+
+static void cpu_rc_destroy_rcdomain(void *rcd)
+{
+ kfree(rcd);
+}
+
+static void *cpu_rc_create(void *rcd, struct cpuset *cs)
+{
+ struct cpu_rc *cr;
+
+ cr = kmalloc(sizeof(*cr), GFP_KERNEL);
+ if (cr == NULL) {
+ return NULL;
+ }
+
+ memset(cr, 0, sizeof(*cr));
+ cr->rcd = rcd;
+ cr->ts_factor = CPU_RC_TSFACTOR_MAX;
+
+ return cr;
+}
+
+static void cpu_rc_destroy(void *p)
+{
+ struct cpu_rc *cr = p;
+
+ cpu_rc_lock(cr);
+ cpu_rc_set_satisfied(cr);
+ cpu_rc_unlock(cr);
+
+ kfree(p);
+}
+
+static int cpu_rc_set_guar(void *ctldata, unsigned long val)
+{
+ struct cpu_rc *cr = ctldata;
+
+ if (cr == NULL) {
+ return -EINVAL;
+ }
+
+ cpu_rc_lock(cr);
+ cr->guarantee = val;
+ cpu_rc_unlock(cr);
+
+ return 0;
+}
+
+static int cpu_rc_get_cur(void *ctldata, unsigned long *valp)
+{
+ struct cpu_rc *cr = ctldata;
+ unsigned int load;
+ int i, n;
+
+ if (cr == NULL) {
+ return -EINVAL;
+ }
+
+ load = 0;
+ n = 0;
+
+ /* Just displaying the value, so no locking... */
+ for_each_cpu_mask(i, cr->rcd->cpus) {
+ load += cr->stat[i].load;
+ n++;
+ }
+
+ *valp = load / n * CPU_RC_GUAR_SCALE / CPU_RC_LOAD_SCALE;
+
+ return 0;
+}
+
+static struct cpumeter_ctlr cpu_rc_ctlr = {
+ .name = "cpu",
+ .create_rcdomain = cpu_rc_create_rcdomain,
+ .destroy_rcdomain = cpu_rc_destroy_rcdomain,
+ .create = cpu_rc_create,
+ .destroy = cpu_rc_destroy,
+ .set_lim = NULL,
+ .set_guar = cpu_rc_set_guar,
+ .get_cur = cpu_rc_get_cur,
+};
+
void cpu_rc_init(void)
{
+ int err;
+ err = cpumeter_register_controller(&cpu_rc_ctlr);
+ if (err) {
+ printk(KERN_ERR "cpu_rc: register to cpumeter failed\n");
+ }
+}
+
+static struct cpu_rc *cpu_rc_get(task_t *tsk)
+{
+ return cpumeter_get_controller_data(tsk->cpuset, &cpu_rc_ctlr);
}

2005-09-26 09:34:34

by KUROSAWA Takahiro

[permalink] [raw]
Subject: [PATCH 0/3] CPUMETER (Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS)

Jackson-san,

Sorry for the late reply.

I've implemented the cpumeter based on your idea with small modification
on handling meter_cpu. To make grouping the metered cpus and nodes
easier, I modified your idea so that the meter_cpu file is also used for
marking the toplevel of the metered CPUSET (CPUSET 1). The toplevel
CPUSET (CPUSET 1) is also metered, that is different from the
subcpuset_top file of SUBCPUSETS.


+-----------------------------------+
| |
CPUSET 0 CPUSET 1
sched domain A sched domain B
cpus: 0, 1 cpus: 2, 3
cpu_exclusive=1 cpu_exclusive=1
meter_cpu=0 meter_cpu=1
meter_cpu_*
|
+----------------+----------------+
| | |
CPUSET 1a CPUSET 1b CPUSET 1c
cpus: 2, 3 cpus: 2, 3 cpus: 2, 3
cpu_exclusive=0 cpu_exclusive=0 cpu_exclusive=0
meter_cpu=1 meter_cpu=1 meter_cpu=1
meter_cpu_* meter_cpu_* meter_cpu_*
|
+------------+------------+
| |
CPUSET 2a CPUSET 2b
cpus: 2, 3 cpus: 2, 3
cpu_exclusive=0 cpu_exclusive=0
meter_cpu=1 meter_cpu=1
meter_cpu_* meter_cpu_*

Here are other rules around meters:
- If meter_cpu is 1, meter_cpu_* files appear.
- The children (CPUSET 1a, 1b, 1c) inherit CPUSET 1's value of
cpus/mems/meter_cpu/... and do not have their specific values.
- The metered CPUSETS can have their children
(this is not allowed in SUBCPUSETS).
- meter_cpu in the children of metered CPUSETS can not be modified
(can not create normal CPUSETS under metered CPUSETS).

I'll send patches right after this mail.

Comments appreciated,
KUROSAWA, Takahiro

2005-09-27 08:40:38

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

[[ Added [email protected] to the cc list ]]

Welcome Takahiro-san. I see you have been very busy.

First I will read through your patch set, and comment on little
details that I happen to notice. Then if I have the energy, I
will step back and try to see a larger view of your fine work.

First minor detail ... I think that the following snippet of code:

+ ssize_t retval;
+ size_t n;
...
+ /* Do nothing if *ppos is at the eof or beyond the eof. */
+ if (s - page <= *ppos)
+ return 0;
+
+ start = page + *ppos;
+ n = s - start;
...
+ free_page((unsigned long)page);
+ return retval;


could be slightly clearer, and avoid a leak, as something like:

ssize_t retval = 0;
size_t n; /* how much not yet read */
...
start = page + *ppos;
n = s - start;
if (n <= 0)
goto done;
...
done:
free_page((unsigned long)page);
return retval;

I find that inequalities involving expressions (such as 's - page')
cause my brain to stumble momentarily. And this avoids leaking a
memory page if n <= 0.

The above reminds me of a bug fix that you provided in the previous
patch set, for the case *ppos >= eof. I wonder if we have duplicated
code here.

I see more duplicated code in cpumeter_file_get_written_data(), where
you are reading what is I guess a single boolean '0' or '1', but
validating that the user provided length is sane by checking it:

+ /* Crude upper limit on largest legitimate cpulist user might write. */
+ if (nbytes > 100 + 6 * NR_CPUS)
+ return -E2BIG;

This is an appropriate check when reading (write system call) an entire
cpu list, not a single number. For a single number, you could even
consider using a local "char buf[16]" array or some such size,
instead of malloc'ing a whole page.

No doubt this "Crude upper limit" check was taken from the existing
routine cpuset_common_file_write() routine, which has to handle a
greater variety of write calls.

When I see code that is copied from another routine, I ask myself
if there is a clean way to avoid the duplication of code. Though
I will confess to being too lazy at the moment to see if there is
a good way to do that here.

+#ifdef CONFIG_CPUMETER
+config CPUMETER
+ bool "Cpumeter support"
+ depends on CPUSETS

It might not be worth having CPUMETER as a separate option on top
of CPUSETS. I guess that depends in part on how much CPUMETER
grows the compiled kernel text size, and what performance impact
it has if any. You might want to look for a way to measure both
those (text size increase and performance penalty on some appropriate
benchmark.) If CPUMETER is not too big an extra burden above and
beyond CPUSETS, then I would probably prefer avoiding the extra
ifdefs and config option, and make it one with CPUSETS.

+ s += snprintf(s, PAGE_SIZE, "%lu", val);
+ *s++ = '\n';
+ *s = '\0';

Can the above be collapsed to:

s += snprintf(s, PAGE_SIZE, "%lu\n", val);

I see that the cpu controller patch, as it must, has hooks in the
critical scheduler paths. How much does this slow down a system
if CONFIG_CPUMETER is enabled, but the system is running in the
default cpuset and cpumeter configuration, such as it would be
after a reboot, before any intentional administration to setup
particular cpusets or meters is attempted?

... hmmm ... I am ready to retire for the night. I will have
to continue my review another day.

You will need to encourage someone else, with scheduler expertise,
to review that portion of the patch. The kernel/sched.c file is
too hard for me; I stick to easier files such as kernel/cpuset.c.

I continue to be quite suspicious that perhaps there should be a
tighter relation between your work and CKRM. For one thing, I suspect
that CKRM has a cpu controller that serves essentially the same purpose
as yours. If that is so, I cannot imagine that we would ever put both
cpu controllers in the kernel. They touch on code that is changing too
rapidly, and too critical for performance.

My wild guess would be that the right answer would be to take the
CKRM cpu controller instead of yours, and connect it to cpusets in the
manner that you have done here. But I have no expertise in cpu
controllers, so am quite unfit to judge which one or the other, or
perhaps some combination of the two cpu controllers, is the best one.

I hesitate to keep asking about CKRM, as I recall how much I disliked
it when Andrew kept asking me much the same questions, about the
relation between cpusets and CKRM. Such is life; one must do ones
duty.

Looking back at your nice opening picture, I see you write:
> cpus/mems/meter_cpu/... and do not have their specific values.
> - The metered CPUSETS can have their children
> (this is not allowed in SUBCPUSETS).
> - meter_cpu in the children of metered CPUSETS can not be modified
> (can not create normal CPUSETS under metered CPUSETS).

This seems more restrictive than necessary. Indeed, it reminds
me of some of the concerns I had with the previous SUBCPUSET
proposal. I think we should only need the following:

* Some cpuset C in the hierarchy is marked cpu_exclusive (hence
its ancestors are also so marked.)
* None of C's descendents are cpu_exclusive. This will make
cpuset C define a sched domain.
* Each of the -immediate- children of C are marked meter_cpu.
* But C is not marked meter_cpu, none of the ancestors of C
are marked meter_cpu, and none of the descendents of C's
children are marked meter_cpu. Just C's children as so marked.
* C's immediate children must have the same CPU's as C. Children
of these children can have any CPU's (that are a subset of C's,
of course.)
* Each of C's immediate children gets a certain portion of the
CPU time, as specified in its meter_cpu_* files, to be shared
amongst all tasks running in that cpuset, or any descendent of
that cpuset.
* This should allow for creating normal cpusets under metered
cpusets.


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-27 09:22:52

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

On Tue, 2005-09-27 at 01:37 -0700, Paul Jackson wrote:

> You will need to encourage someone else, with scheduler expertise,
> to review that portion of the patch. The kernel/sched.c file is
> too hard for me; I stick to easier files such as kernel/cpuset.c.
>
> I continue to be quite suspicious that perhaps there should be a
> tighter relation between your work and CKRM. For one thing, I suspect
> that CKRM has a cpu controller that serves essentially the same purpose
> as yours. If that is so, I cannot imagine that we would ever put both
> cpu controllers in the kernel. They touch on code that is changing too
> rapidly, and too critical for performance.
>
> My wild guess would be that the right answer would be to take the
> CKRM cpu controller instead of yours, and connect it to cpusets in the
> manner that you have done here. But I have no expertise in cpu
> controllers, so am quite unfit to judge which one or the other, or
> perhaps some combination of the two cpu controllers, is the best one.
>

Last time I looked at the CKRM cpu controller code I found
it was quite horrible, with a great deal of duplication and
very intrusive large and complex.

It could have come a long way since then, but this code looks
much neater than the code I reviewed.

I guess the question of the resource controller stuff is going
to come up again sooner or later. I would hope to have just a
single CPU resource controller (presumably based on cpusets),
the simpler the better ;)

Nick

--
SUSE Labs, Novell Inc.



Send instant messages to your online friends http://au.messenger.yahoo.com

2005-09-27 11:39:04

by KUROSAWA Takahiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Jackson-san,

Thank you for the review. I'll fix the bugs that you have found.

On Tue, 27 Sep 2005 01:37:51 -0700
Paul Jackson <[email protected]> wrote:

> The above reminds me of a bug fix that you provided in the previous
> patch set, for the case *ppos >= eof. I wonder if we have duplicated
> code here.

Ah, yes, we need to fix the bug in the cpuset code introduced by me...

> I see that the cpu controller patch, as it must, has hooks in the
> critical scheduler paths. How much does this slow down a system
> if CONFIG_CPUMETER is enabled, but the system is running in the
> default cpuset and cpumeter configuration, such as it would be
> after a reboot, before any intentional administration to setup
> particular cpusets or meters is attempted?

I don't have any benchmark numbers yet, but the cpu controller code
will add some if statements and will not hold any spinlocks to the
scheduler in the default configuration.

> Looking back at your nice opening picture, I see you write:
> > cpus/mems/meter_cpu/... and do not have their specific values.
> > - The metered CPUSETS can have their children
> > (this is not allowed in SUBCPUSETS).
> > - meter_cpu in the children of metered CPUSETS can not be modified
> > (can not create normal CPUSETS under metered CPUSETS).
>
> This seems more restrictive than necessary. Indeed, it reminds
> me of some of the concerns I had with the previous SUBCPUSET
> proposal. I think we should only need the following:

I have a few questions for your idea to make the design
in my mind clearer.

> * Some cpuset C in the hierarchy is marked cpu_exclusive (hence
> its ancestors are also so marked.)
> * None of C's descendents are cpu_exclusive. This will make
> cpuset C define a sched domain.
> * Each of the -immediate- children of C are marked meter_cpu.

Can the cpuset C have immediate children with meter_cpu=0?
Or should it be prohibited to set meter_cpu=0 for the immediate
children of C?

If it is prohibited to set meter_cpu=0 for the immediate children
of C, cpuset_create() needs a check whether the siblings are
metered or not. If the siblings are metered, the newly created
cpuset should also be metered. And probably it is not allowed
to set meter_cpu=1 if the sibling cpusets have meter_cpu=0.

> * But C is not marked meter_cpu, none of the ancestors of C
> are marked meter_cpu, and none of the descendents of C's
> children are marked meter_cpu. Just C's children as so marked.

Good idea, but I'd like to make sure...
Is it prohibited for any decendant of C's children to set meter_cpu=1 ?

> * C's immediate children must have the same CPU's as C. Children
> of these children can have any CPU's (that are a subset of C's,
> of course.)
> * Each of C's immediate children gets a certain portion of the
> CPU time, as specified in its meter_cpu_* files, to be shared
> amongst all tasks running in that cpuset, or any descendent of
> that cpuset.
> * This should allow for creating normal cpusets under metered
> cpusets.


Best regards,
--
KUROSAWA, Takahiro

2005-09-27 15:49:26

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Takahiro-san asks perceptive question:
> If it is prohibited to set meter_cpu=0 for the immediate children
> of C, cpuset_create() needs a check whether the siblings are
> metered or not.

Very good question. It exposes an impossibility in my proposal, as
stated. The rule I had in mind was that either all the children of C
had meter_cpu set, or none of them. But since one can only mark one
cpuset at a time, this is impossible to setup if there is more than
one child already.

Allow me to try to fix my proposal.

Instead of doing the impossible and trying to mark all the children
of C as meter_cpu all at same instant in time, I should just mark the
parent cpuset C, one time. Then if C is so marked, its -children- now
have the meter_cpu_* files, which can default at that instant in time to
providing (1/N) of the cpu to each of the N children of C. I would
say that C can only be marked meter_cpu if:
* C is already marked cpu_exclusive.
* All of its children have the same 'cpus' setting as C.
* Any new child of C created after C is marked meter_cpu will
automatically start with the same 'cpus' setting as C, and
with the meter_cpu_* files.
* It is prohibited to change the 'cpus' of any cpuset whose
parent is marked meter_cpu.
* Changing the 'cpus' of a cpuset such as C that is itself
marked meter_cpu will instantly change the 'cpus' of each
of its children.
* It is prohibited to turn on the cpu_exclusive flag of a cpuset
whose parent is marked meter_cpu, or to turn off the cpu_exclusive
flag on a cpuset that is itself marked meter_cpu.

Similar rules would apply for mem_exclusive and meter_mem.

The metered children of C may have their own children in turn, which
may have cpus any subset of the cpus in C, but which cannot be marked
cpu_exclusive or meter_cpu.

Borrowing your fine art work, and modifying it slightly, this looks
like:

+-----------------------------------+
| |
CPUSET 0 CPUSET 1 (aka 'C')
sched domain A sched domain B
cpus: 0, 1 cpus: 2, 3
cpu_exclusive=1 cpu_exclusive=1
meter_cpu=0 meter_cpu=1
|
+----------------+----------------+
| | |
CPUSET 1a CPUSET 1b CPUSET 1c
cpus: 2, 3 cpus: 2, 3 cpus: 2, 3
cpu_exclusive=0 cpu_exclusive=0 cpu_exclusive=0
meter_cpu=0 meter_cpu=0 meter_cpu=0
meter_cpu_* meter_cpu_* meter_cpu_*
|
+------------+------------+
| |
CPUSET 2a CPUSET 2b
cpus: 2 cpus: 3
meter_cpu=0 meter_cpu=0
cpu_exclusive=0 cpu_exclusive=0

Note here that marking C (CPUSET 1) as meter_cpu exposes the meter_cpu_*
files in the children of C.

> Is it prohibited for any decendant of C's children to set meter_cpu=1 ?

Yes, I presume so, and made up my new rules above assuming that. It is
definitely worth an effort in my opinion to allow creating nested
ordinary (not metered) cpusets below the children of C, but I am
guessing it would be too hard to try to allow nesting of metered
cpusets below metered cpusets. If you have a mind to try that however,
I am more than willing to listen to your proposal.

The above proposal makes it more obvious than ever that I am starting
to overload the meaning of cpu_exclusive and mem_exclusive perhaps a
bit too much.

One or the other of the two *_exclusive flags should be required
preconditions for some of these special properties (sched domains,
GFP_KERNEL memory allocation confinement, oom killer confinement, cpu
metering and memory metering), but perhaps actually enabling any of
these special properties should be an additional and distinct choice.

Therefore I propose some new cpuset flags:
* 'sched_domain' to mark sched domains (now done by the cpu_exclusive
flag),
* 'kernel_memory' to mark the constraints on GFP_KERNEL allocations,
* 'oom_killer' to mark the constraints on oom killing,
* your 'meter_cpu' flag to mark a set of metered cpus, and
* your 'meter_mem' flag to mark a set of metered mems.

Each of these new flags would require the appropriate cpu_exclusive or
mem_exclusive flag on the same cpuset to already be set, but just
setting the *_exclusive flags by themselves would not be enough to get
you the special behaviour. You would also have to set the appropriate
one of these new flags.

So, for example, the condition to define a sched domain would change,
from just being the lowest level cpuset marked cpu_exclusive (or the
left over CPUs not marked exclusive), to being both that -and- having
its "sched_domain" flag marked True (or being the left over CPUs,
again).

At first writing, I like the sound of this. But then I often
think my suggestions are good, when I first write them <grin>.

Without these new flags, the interface has an odd assymmetry to it.
Just setting cpu_exclusive could get you a sched domain for instance,
but you had to have both cpu_exclusive and meter_cpu to get the
cpu metering code. The only reason for this was that Dinakar got his
sched domain patch in before you got your cpu meter patch in, which is
a poor reason if I do say so.

These extra flags have an additional benefit. They make explicit
to the user level what additional semantics are switched on, rather
than hiding them as implicit side affects of the cpu_exclusive
configuration.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-27 15:53:45

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Nick wrote:
> this code looks much neater than the code I reviewed.

This does not surprise me. Takahiro-san writes nice code.

I will abstain from picking one cpu controller over another,
as that is not an area of my expertise. But I definitely
agree with your hope:

> to have just a single CPU resource controller

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-27 21:45:47

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

On Tue, 2005-09-27 at 19:22 +1000, Nick Piggin wrote:
>
> Last time I looked at the CKRM cpu controller code I found
> it was quite horrible, with a great deal of duplication and
> very intrusive large and complex.

I admit it :)... and that was the reason why we did not post that to
lkml.

> It could have come a long way since then, but this code looks

Since we were not planning to use it there isn't much change in the
code :(
> much neater than the code I reviewed.
>
> I guess the question of the resource controller stuff is going
> to come up again sooner or later. I would hope to have just a
> single CPU resource controller (presumably based on cpusets),
> the simpler the better ;)

We were planning to start on a simplified CPU controller that can
provide the functionalities CKRM is expected to provide.

As I stated in an earlier email cpusubsets looks promising for CKRM, but
we are not able to spend more time on it as of now as the team is very
busy trimming down CKRM.
>
> Nick
>
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------


2005-09-28 06:21:48

by KUROSAWA Takahiro

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

On Tue, 27 Sep 2005 08:49:05 -0700
Paul Jackson <[email protected]> wrote:

> Instead of doing the impossible and trying to mark all the children
> of C as meter_cpu all at same instant in time, I should just mark the
> parent cpuset C, one time.

I agree with your idea.
Marking the parent cpuset C instead of marking each child seems to make
things quite easier.

> Then if C is so marked, its -children- now
> have the meter_cpu_* files, which can default at that instant in time to
> providing (1/N) of the cpu to each of the N children of C.

Maybe the default value of meter_cpu_guar (guarantee) could be 0
in this design, since guarantee 0 means no guarantee. Resources
can even be allocated without guarantee.

> I would say that C can only be marked meter_cpu if:
> * C is already marked cpu_exclusive.
> * All of its children have the same 'cpus' setting as C.
> * Any new child of C created after C is marked meter_cpu will
> automatically start with the same 'cpus' setting as C, and
> with the meter_cpu_* files.
> * It is prohibited to change the 'cpus' of any cpuset whose
> parent is marked meter_cpu.
> * Changing the 'cpus' of a cpuset such as C that is itself
> marked meter_cpu will instantly change the 'cpus' of each
> of its children.
> * It is prohibited to turn on the cpu_exclusive flag of a cpuset
> whose parent is marked meter_cpu, or to turn off the cpu_exclusive
> flag on a cpuset that is itself marked meter_cpu.

These seem good for me.

> The metered children of C may have their own children in turn, which
> may have cpus any subset of the cpus in C, but which cannot be marked
> cpu_exclusive or meter_cpu.
>
> Borrowing your fine art work, and modifying it slightly, this looks
> like:
>
> +-----------------------------------+
> | |
> CPUSET 0 CPUSET 1 (aka 'C')
> sched domain A sched domain B
> cpus: 0, 1 cpus: 2, 3
> cpu_exclusive=1 cpu_exclusive=1
> meter_cpu=0 meter_cpu=1
> |
> +----------------+----------------+
> | | |
> CPUSET 1a CPUSET 1b CPUSET 1c
> cpus: 2, 3 cpus: 2, 3 cpus: 2, 3
> cpu_exclusive=0 cpu_exclusive=0 cpu_exclusive=0
> meter_cpu=0 meter_cpu=0 meter_cpu=0
> meter_cpu_* meter_cpu_* meter_cpu_*
> |
> +------------+------------+
> | |
> CPUSET 2a CPUSET 2b
> cpus: 2 cpus: 3
> meter_cpu=0 meter_cpu=0
> cpu_exclusive=0 cpu_exclusive=0
>
> Note here that marking C (CPUSET 1) as meter_cpu exposes the meter_cpu_*
> files in the children of C.

If we split the cpus {2, 3} into {2} and {3} by creating CPUSET 2a
and CPUSET 2b, the guarantee assigned to CPUSET 1a might not be
satisfied. For example, the maximum cpu resource usage of tasks
in CPUSET 2a should essentially be 50% because tasks in CPUSET 2a
can only use the half number of cpus. If we set meter_cpu_guar=60%
on CPUSET 1a, the problem might occur. Maybe (or maybe not), we can
leave this problem as it is.

> > Is it prohibited for any decendant of C's children to set meter_cpu=1 ?
>
> Yes, I presume so, and made up my new rules above assuming that. It is
> definitely worth an effort in my opinion to allow creating nested
> ordinary (not metered) cpusets below the children of C, but I am
> guessing it would be too hard to try to allow nesting of metered
> cpusets below metered cpusets. If you have a mind to try that however,
> I am more than willing to listen to your proposal.

Probably allowing nested metered cpusets is too hard both to implement
and to use. So far, I woundn't like to implement it.

> Therefore I propose some new cpuset flags:
> * 'sched_domain' to mark sched domains (now done by the cpu_exclusive
> flag),
> * 'kernel_memory' to mark the constraints on GFP_KERNEL allocations,
> * 'oom_killer' to mark the constraints on oom killing,
> * your 'meter_cpu' flag to mark a set of metered cpus, and
> * your 'meter_mem' flag to mark a set of metered mems.

This seems very interesting. Tasks enclosed by a certain cpuset may
want to change the behavior of oom_killer.

--
KUROSAWA, Takahiro

2005-09-28 06:35:04

by KUROSAWA Takahiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

On Tue, 27 Sep 2005 19:22:17 +1000
Nick Piggin <[email protected]> wrote:

> It could have come a long way since then, but this code looks
> much neater than the code I reviewed.

I'm glad to hear that! But the cpu resource controller has
some problems, for example:

- The controller only controls the time_slice value and doesn't
care the balance of cpus (the controller leaves balancing to
the existing balancing code in the scheduler). So far I don't
have any good idea to solve this.

- The controller holds a spinlock once per 1 second. I don't
know this is permissive or not, but the current scheduler
doesn't hold any spinlocks normally, so...

--
KUROSAWA, Takahiro

2005-09-28 06:43:48

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Takahiro-san wrote:
> Maybe the default value of meter_cpu_guar (guarantee) could be 0
> in this design, since guarantee 0 means no guarantee.

Ah yes - that is a better default. Excellent.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-28 07:08:52

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Takahiro-san, replying to pj:
> > |
> > CPUSET 1a
> > cpus: 2, 3
> > cpu_exclusive=0
> > meter_cpu=0
> > meter_cpu_*
> > |
> > +------------+------------+
> > | |
> > CPUSET 2a CPUSET 2b
> > cpus: 2 cpus: 3
> > meter_cpu=0 meter_cpu=0
> > cpu_exclusive=0 cpu_exclusive=0
> >
> > Note here that marking C (CPUSET 1) as meter_cpu exposes the meter_cpu_*
> > files in the children of C.
>
> If we split the cpus {2, 3} into {2} and {3} by creating CPUSET 2a
> and CPUSET 2b, the guarantee assigned to CPUSET 1a might not be
> satisfied. For example, the maximum cpu resource usage of tasks
> in CPUSET 2a should essentially be 50% because tasks in CPUSET 2a
> can only use the half number of cpus.

Ah, yes, this could be difficult and not worth doing.

It might help if I stated more of what I mean, which I didn't before.

I intended that all tasks in the combination of cpusets 1a, 2a, and 2b
would collectively be allowed what ever percentage of cpu cycles the
meter_cpu_* files in cpuset 1a prescribed. I did not intend to suggest
any particular balance between these tasks in 1a, 2a and 2b would be
enforced. In particular, I did not expect for anything like a 50%
split between the tasks in 2a and 2b to be enforced. For the purposes
of your cpu controller, just treat the entire set of tasks in all
three of these cpusets as one pool, governed by one meter_cpu_*
setting, just as if all these tasks were in cpuset 1a, and as if
cpusets 2a and 2b didn't exist.

This might be easier to do - I don't know. If this is still hard,
then I guess my dream of allowing metered cpusets to have child
cpusets with a "proper subset" of CPUs (strictly fewer CPUs) is too
hard to do now. In the abstract, this seems like a serious limitation
to me. But practically speaking, it might not be a big problem.
I don't know.

If it is still hard, even this way, it may just have to wait, until
someone needs it bad enough to find a way. If we have layed a good
foundation and allowed for this possibility in our architecture,
then perhaps we will have done all we can do for now.

> Probably allowing nested metered cpusets is too hard both to implement
> and to use. So far, I woundn't like to implement it.

A wise choice.

> This seems very interesting. Tasks enclosed by a certain cpuset may
> want to change the behavior of oom_killer.

Yes - I think so too.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-28 07:53:37

by KUROSAWA Takahiro

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

On Wed, 28 Sep 2005 00:08:39 -0700
Jackson-san wrote:
> > If we split the cpus {2, 3} into {2} and {3} by creating CPUSET 2a
> > and CPUSET 2b, the guarantee assigned to CPUSET 1a might not be
> > satisfied. For example, the maximum cpu resource usage of tasks
> > in CPUSET 2a should essentially be 50% because tasks in CPUSET 2a
> > can only use the half number of cpus.
>
> Ah, yes, this could be difficult and not worth doing.
>
> It might help if I stated more of what I mean, which I didn't before.
>
> I intended that all tasks in the combination of cpusets 1a, 2a, and 2b
> would collectively be allowed what ever percentage of cpu cycles the
> meter_cpu_* files in cpuset 1a prescribed. I did not intend to suggest
> any particular balance between these tasks in 1a, 2a and 2b would be
> enforced. In particular, I did not expect for anything like a 50%
> split between the tasks in 2a and 2b to be enforced. For the purposes
> of your cpu controller, just treat the entire set of tasks in all
> three of these cpusets as one pool, governed by one meter_cpu_*
> setting, just as if all these tasks were in cpuset 1a, and as if
> cpusets 2a and 2b didn't exist.

This seems good for me.
I'd like to make sure that tasks in CPUSET 2a and CPUSET 2b actually
have the cpumask of CPUSET 1a. Is this correct?

If it's correct, a small hack to cpuset_cpus_allowed() would only be
needed to do this.

> This might be easier to do - I don't know.

This is quite easier. Thanks for your clarifying the idea.

--
KUROSAWA, Takahiro

2005-09-28 09:25:59

by KUROSAWA Takahiro

[permalink] [raw]
Subject: [PATCH][BUG] fix memory leak on reading cpuset files after seeking beyond eof

On Tue, 27 Sep 2005 20:39:02 +0900
KUROSAWA Takahiro <[email protected]> wrote:

> On Tue, 27 Sep 2005 01:37:51 -0700
> Paul Jackson <[email protected]> wrote:
>
> > The above reminds me of a bug fix that you provided in the previous
> > patch set, for the case *ppos >= eof. I wonder if we have duplicated
> > code here.
> Ah, yes, we need to fix the bug in the cpuset code introduced by me...

I fixed the bug. Sorry for my previous incomplete patch.
The following patch is against linux-2.6.14-rc2.

Signed-off-by: KUROSAWA Takahiro <[email protected]>

--- from-0001/kernel/cpuset.c
+++ to-work/kernel/cpuset.c 2005-09-28 17:42:00.759401736 +0900
@@ -969,7 +969,7 @@ static ssize_t cpuset_common_file_read(s
ssize_t retval = 0;
char *s;
char *start;
- size_t n;
+ ssize_t n;

if (!(page = (char *)__get_free_page(GFP_KERNEL)))
return -ENOMEM;
@@ -999,12 +999,13 @@ static ssize_t cpuset_common_file_read(s
*s++ = '\n';
*s = '\0';

- /* Do nothing if *ppos is at the eof or beyond the eof. */
- if (s - page <= *ppos)
- return 0;
-
start = page + *ppos;
n = s - start;
+
+ /* Do nothing if *ppos is at the eof or beyond the eof. */
+ if (n <= 0)
+ goto out;
+
retval = n - copy_to_user(buf, start, min(n, nbytes));
*ppos += retval;
out:

2005-09-28 10:13:39

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Hello,

> > It could have come a long way since then, but this code looks
> > much neater than the code I reviewed.
>
> I'm glad to hear that! But the cpu resource controller has
> some problems, for example:

These are issues to be improved.

> - The controller only controls the time_slice value and doesn't
> care the balance of cpus (the controller leaves balancing to
> the existing balancing code in the scheduler). So far I don't
> have any good idea to solve this.

Is there any problem to enhance load_balance()? Applying some weight
to each process, it would be used for calculating load of each runqueue.

> - The controller holds a spinlock once per 1 second. I don't
> know this is permissive or not, but the current scheduler
> doesn't hold any spinlocks normally, so...

I think each runqueue has its own spinlock, which is held quite often.


Thanks,
Hirokazu Takahashi.

2005-09-28 10:32:43

by KUROSAWA Takahiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

On Wed, 28 Sep 2005 19:08:22 +0900 (JST)
Hirokazu Takahashi <[email protected]> wrote:

> > - The controller only controls the time_slice value and doesn't
> > care the balance of cpus (the controller leaves balancing to
> > the existing balancing code in the scheduler). So far I don't
> > have any good idea to solve this.
> Is there any problem to enhance load_balance()? Applying some weight
> to each process, it would be used for calculating load of each runqueue.

Probably we can solve this issue by applying some weight to each tasks
and calculating the load of runqueues based on that weight.
But I can not get to the implementation out of your idea so far.

> > - The controller holds a spinlock once per 1 second. I don't
> > know this is permissive or not, but the current scheduler
> > doesn't hold any spinlocks normally, so...
> I think each runqueue has its own spinlock, which is held quite often.

Ah, thanks for pointing out my mistake. Maybe I confused that with
something else.
Then it should be more permissive than I had thought, at least :)

--
KUROSAWA, Takahiro

2005-09-28 13:42:21

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH][BUG] fix memory leak on reading cpuset files after seeking beyond eof

Takahiro-san wrote:
> I fixed the bug. Sorry for my previous incomplete patch.
> The following patch is against linux-2.6.14-rc2.

The patch looks perfect. But it needs a good opening comment and
Subject.

The header comment of the patch needs to be just the words that should
go in the source control (git) log comment, for someone to read to
understand this patch, years from now, when they have long since
forgotten our little email thread here. Andrew Morton's and Linus
Torvalds's automatic patch handling tools just pull that comment out,
unedited, and put it into the log of changes for the kernel/cpuset.c
file, when they accept a patch.

The Subject needs to be a good name for the patch. I always use a
short phrase starting with the word "cpuset" if it is a cpuset patch.
If it is a fix, Andrew seems to enjoy having the word 'fix' as the
last word of the Subject. The tools of Andrew and Linus will take that
exact Subject line, trim off whatever is inside the '[...]', replace
spaces with hyphens '-', and use that for the name of the patch file.

Directly include Andrew or Linus in the Cc list, if you want them
to apply the patch. They might notice the patch even if you don't Cc
them, but it is less certain in that case.

Ah - one cute trick - I hand edited the pathname of the file that shows
in the first two lines of the actual patch, to show the release of Linux
"linux-2.6.14-rc2" you worked against. I hope I didn't break the patch
by hand editing it. If I were good, I would send the patch just to
myself first, and make sure it still applied. But I am lazy and over
confident.

See also the following three documents, on how to submit patches
to the Linux kernel:

1) Documentation/SubmittingPatches, a file in the kernel source
2) Andrew Morton's "The Perfect Patch", available at:
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt
3) Jeff Garzik's "Linux kernel patch submission format", at:
http://linux.yyz.us/patch-format.html

I will resend this patch with a suitable header comment and Subject
in a few minutes. You will easily see what I mean when you see my next
message.

The header comment on what I will resend now will be quite short,
as this is an easy patch to explain. The header comments on most of
the patches I send are much longer -- probably too long, but Andrew
seems reluctant to complain about comments that are too long. He
is much more likely to complain if they are too short.

Excellent work. Thank-you.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-28 13:43:14

by Paul Jackson

[permalink] [raw]
Subject: [PATCH] cpuset read past eof memory leak fix

Don't leak a page of memory if user reads a cpuset file past eof.

Signed-off-by: KUROSAWA Takahiro <[email protected]>
Signed-off-by: Paul Jackson <[email protected]>

--- linux-2.6.14-rc2.orig/kernel/cpuset.c
+++ linux-2.6.14-rc2/kernel/cpuset.c 2005-09-28 17:42:00.759401736 +0900
@@ -969,7 +969,7 @@ static ssize_t cpuset_common_file_read(s
ssize_t retval = 0;
char *s;
char *start;
- size_t n;
+ ssize_t n;

if (!(page = (char *)__get_free_page(GFP_KERNEL)))
return -ENOMEM;
@@ -999,12 +999,13 @@ static ssize_t cpuset_common_file_read(s
*s++ = '\n';
*s = '\0';

- /* Do nothing if *ppos is at the eof or beyond the eof. */
- if (s - page <= *ppos)
- return 0;
-
start = page + *ppos;
n = s - start;
+
+ /* Do nothing if *ppos is at the eof or beyond the eof. */
+ if (n <= 0)
+ goto out;
+
retval = n - copy_to_user(buf, start, min(n, nbytes));
*ppos += retval;
out:


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-28 14:30:36

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Takahiro-san wrote:
> but the cpu controller code will add some if statements

>From what I read, I suspect that the more interesting question
will be how many additional cache lines and memory pages are
touched, on the most common paths through the scheduler.

However, guessing about performance is a risky business.
Benchmarks will be most helpful for a significant scheduler
patch such as this one.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-28 15:02:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] cpuset read past eof memory leak fix


Applied.

Just a non-technical note: the sign-offs makes me suspect the patch (or an
earlier version of it) may have been written by Kurosawa-san. However, it
wasn't clear, so I didn't make that change, and so it now ends up being in
git with Paul as the author.

If that was wrong, please for the future keep authorship intact by having
a "From: A U Thor <[email protected]>" at the head of the description,
which will make authorship clear and allow the tools to pick that up.

Linus

On Wed, 28 Sep 2005, Paul Jackson wrote:
>
> Don't leak a page of memory if user reads a cpuset file past eof.
>
> Signed-off-by: KUROSAWA Takahiro <[email protected]>
> Signed-off-by: Paul Jackson <[email protected]>
>
> --- linux-2.6.14-rc2.orig/kernel/cpuset.c
> +++ linux-2.6.14-rc2/kernel/cpuset.c 2005-09-28 17:42:00.759401736 +0900
> @@ -969,7 +969,7 @@ static ssize_t cpuset_common_file_read(s
...

2005-09-28 16:50:41

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Takahiro-san wrote:
> This seems good for me.
> I'd like to make sure that tasks in CPUSET 2a and CPUSET 2b actually
> have the cpumask of CPUSET 1a. Is this correct?

Oh I think not. My primary motivation in pushing on this point
of the design was to allow CPUSET 2a and 2b to have a smaller
cpumask than CPUSET 1a. This is used for example to allow a job
that is running in 1a to setup two child cpusets, 2a and 2b,
to run two subtasks that are constrained to smaller portions of
the CPUs allowed to the job in 1a.

How would hacking cpuset_cpus_allowed() help here?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-28 17:54:10

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset read past eof memory leak fix

Linus wrote:
> please for the future keep authorship intact by having
> a "From: ...

You guessed right. Your non-technical note was applicable.

Andrew - why doesn't your "The perfect patch" mention this?
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt

Linus - would you like a patch to Documentation/SubmittingPatches
that mentions this?

Hmmm ... I don't even see the "Acked-by" in these patch guides either.
Probably I should include that in SubmittingPatches as well.

Too bad that first line doesn't start "Author:" instead of "From:".
Oh well - I see Andrew already suggested that, and you declined.
(You should'a listened to him ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-28 18:03:57

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] cpuset read past eof memory leak fix

On Wed, 28 Sep 2005, Paul Jackson wrote:

> Linus wrote:
> > please for the future keep authorship intact by having
> > a "From: ...
>
> You guessed right. Your non-technical note was applicable.
>
> Andrew - why doesn't your "The perfect patch" mention this?
> http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt
>
> Linus - would you like a patch to Documentation/SubmittingPatches
> that mentions this?
>
> Hmmm ... I don't even see the "Acked-by" in these patch guides either.
> Probably I should include that in SubmittingPatches as well.
>
> Too bad that first line doesn't start "Author:" instead of "From:".
> Oh well - I see Andrew already suggested that, and you declined.
> (You should'a listened to him ;).

Paul, did you see Linus's email on the canonical patch format?
(I can understand if you missed it. :)

http://www.ussg.iu.edu/hypermail/linux/kernel/0504.0/1865.html

--
~Randy

2005-09-28 18:05:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] cpuset read past eof memory leak fix



On Wed, 28 Sep 2005, Paul Jackson wrote:
>
> Too bad that first line doesn't start "Author:" instead of "From:".
> Oh well - I see Andrew already suggested that, and you declined.
> (You should'a listened to him ;).

The "From:" rule has been implicit in my tools for a _loong_ time, and
switching to "Author:" would just break the tools for no actual technical
gain. Not just the tools, either, since Andrew isn't the only one who
follows that From: rule - it would break "people".

So you'd have to make the tools accept _both_ "From:" and "Author:", and I
personally prefer an _unambiguously_ slightly misnamed thing over some
"either X or Y" where X is slightly misnamed but accepted because it's the
one more commonly used.

It's like the unix "creat()" system call. Sure, it would make more sense
to add the "e", but it wouldn't actually really _help_ anybody.

As to documenting the "From:" thing - yes, we probably should. It's quite
commonly used.

Linus

2005-09-28 19:05:19

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH] cpuset read past eof memory leak fix

> Paul, did you see Linus's email on the canonical patch format?

I had just found it in my archives, a few minutes earlier. Thanks!

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-29 02:53:31

by KUROSAWA Takahiro

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

On Wed, 28 Sep 2005 09:49:32 -0700
Jackson-san wrote:

> > This seems good for me.
> > I'd like to make sure that tasks in CPUSET 2a and CPUSET 2b actually
> > have the cpumask of CPUSET 1a. Is this correct?
>
> Oh I think not. My primary motivation in pushing on this point
> of the design was to allow CPUSET 2a and 2b to have a smaller
> cpumask than CPUSET 1a. This is used for example to allow a job
> that is running in 1a to setup two child cpusets, 2a and 2b,
> to run two subtasks that are constrained to smaller portions of
> the CPUs allowed to the job in 1a.

Maybe I still misunderstand your idea.
The guarantee assigned to CPUSET 1a might not be satisfied if
tasks are attached to CPUSET 2a only and no tasks are attached to
CPUSET 1a nor CPUSET 2b. Does your idea leave as it is because
the user sets up CPUSETs like that?

Leaving that situation as it is seems enough feasible as one option.
Another option is to prohibit CPUSET 2a/2b from having smaller portions
of CPUs that is assigned to CPUSET 1a.

> How would hacking cpuset_cpus_allowed() help here?

I was going to modify cpuset_cpus_allowed(CPUSET 2a/2b) as returning
cpumask of CPUSET 1a, if tasks in CPUSET 2a and CPUSET 2b have the
cpumask of CPUSET 1a.


BTW, thank you for fixing the [PATCH] mail that I had sent in wrong format.

--
KUROSAWA, Takahiro

2005-09-29 02:59:47

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Takahiro-san wrote:
> Maybe I still misunderstand your idea.

I will have to wait a day before responding - sorry.
My boss has me doing something more urgent.

> thank you for fixing the [PATCH]

You are very welcome.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-09-30 09:39:40

by Simon Derr

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS



On Thu, 29 Sep 2005, KUROSAWA Takahiro wrote:

> > Oh I think not. My primary motivation in pushing on this point
> > of the design was to allow CPUSET 2a and 2b to have a smaller
> > cpumask than CPUSET 1a. This is used for example to allow a job
> > that is running in 1a to setup two child cpusets, 2a and 2b,
> > to run two subtasks that are constrained to smaller portions of
> > the CPUs allowed to the job in 1a.
>
> Maybe I still misunderstand your idea.
> The guarantee assigned to CPUSET 1a might not be satisfied if
> tasks are attached to CPUSET 2a only and no tasks are attached to
> CPUSET 1a nor CPUSET 2b. Does your idea leave as it is because
> the user sets up CPUSETs like that?

Hi Takahiro-san

It seems to me that this "guarantee" can only be guaranteed if the owner
of cpuset 1a:
-runs enough tasks to use all the cpus of cpuset 1a
-does not tamper with the scheduler decisions by creating other cpusets
inside cpuset 1a, or by using sched_setaffinity().

Outside of this, he accepts to get less cpu than "guaranteed".
Sounds acceptable to me.

Simon.

2005-09-30 14:22:18

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: [PATCH 1/3] CPUMETER: add cpumeter framework to the CPUSETS

Good reply, Simon. Thank-you.

My apologies for the delay in responding, Takahiro-san.

As I think Simon was saying, we can only guarantee to make available
to an application a certain amount of CPU cycles. We cannot guarantee
that the application will use them. Essentially, we can only guarantee
that some no -other- application will use them.

If a job of several tasks is running on a cpuset containing multiple
CPUs, and is guaranteed a certain percentage of those CPUs and
limited to some other percentage, then you've got a set of tasks
that altogether cannot use more than so much of those CPUs (the
meter_cpu_limit) and that will always have at least a certain amount
of those CPUs free and available for its use (the meter_cpu_guarantee).

If that job uses child cpusets to force some of its tasks on one of
those CPUs, and some other of its tasks on another of those CPUs,
that makes no difference whatsoever to the guarantees and limits in
the previous paragraph. Those child cpusets should not have any
settings for m->guar or m->lim. Your code must not just look in the
current tasks cpuset for guarantees and limits, because there might
not be any meter settings there. Rather it must look up the cpuset
hierarchy, to find the nearest ancestor cpuset that has meter data.

Indeed, this might be the best reason to NOT do what I am suggesting.

With the current simple locking of cpusets (one global cpuset_sem),
and even with the improvements suggested by Roman that I hope soon
to writeup (two global locks - a semaphore and a spinlock), it might
be too expensive to do this now. I am no scheduler guru, but I am
pretty sure that we do not want to take multiple global semaphores
for each task we examine on the hot path of the scheduler.

We probably need someone with more locking expertise than I have
(so far at least) to refine the cpuset locking enough to make this
practical.

So I think I have just talked myself out of asking you to allow
a metered cpuset, such as 1a in our example, to be split into
child cpusets. Even if we could reach a shared understanding of
what that meant, it would be too slow.

So - metered cpusets may not have child cpusets, until someone
wants it enough to figure out how to make the locking efficient.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-02 04:20:43

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 0/3] CPUMETER (Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS)

Takahiro-san,

I spent a little more reading the cpuset side of your cpumeter patches.

I am hopeful that some substantial restructuring of the code would
integrate it better with the existing cpuset structure, reducing the
size of new code substantially.

I noticed not one, but two, nested CONFIG options added for CPUMETER:
CONFIG_CPUMETER and CONFIG_CPU_RC. If this CPUMETER patch only
affected the cpuset code, I'd be tempted to have no additional CONFIG_*
options, however since the affect on the scheduler might be more
interesting (I am not a scheduler guru), it might make sense to have
one new such option - say the CONFIG_CPUMETER one, that covers it all.
Two seems at least one too many.

I am surprised to see fixed length arrays of size CPUMETER_CTRLS_MAX,
rather than having the cpumeter struct be dynamically allocated,
reference counted and locked. I hope we do not have to have a small
fixed upper limit on the number of distinct cpu controllers.

I suspect that we should have a single additional pointer to cpu
resource controller (rc) domain structure, referenced from both
cpusets and tasks, where that structure is reference counted and
has its own lock. All tasks and all cpuses in a given cpu rc domain
would point to the same structure. This would remove any need in the
critical scheduler code to reference a tasks cpuset or the parent
of that cpuset, and it would allow the cpusets in such a domain to
have child cpusets, which would inherit another reference to the same
cpumeter rc domain, as I have been hoping would be possible.

So the code now added to struct cpuset:

#ifdef CONFIG_CPUMETER
/*
* rcdomains: used for the recource control domains
* to keep track of total ammount of resources.
* meters: used for metering resources assigned for
* the cpuset.
*/
void *rcdomains[CPUMETER_CTLRS_MAX];
struct cpumeter meters[CPUMETER_CTLRS_MAX];
#endif /* CONFIG_CPUMETER */

might collapse to simply:

struct cpumeter *cpumeter;

The use of the last CPUMETER_CTLRS_MAX (16) bits, starting at offset
CS_METER_OFFSET, of the typedef enum cpuset_flagbigs_t for one bit per
controller instance is creative, but hopefully unnecessary. Rather,
if the struct cpumeters are dynamically allocated, then whether the
cpuset->cpumeter pointer is non-NULL will determine if that cpuset
is metered, and the pointer will reference the information about that
metered domain, shared by all cpusets in that domain, and hence by all
tasks in those cpusets. Be sure to update which cpumeter rc domain a
task is in, if that task is moved to another cpuset, by attach_task().

The cpumeter_destroy_meters() routine should not be called from
cpuset_diput(). Rather the reference count on the meter structure
should be incremented and decremented each time a cpuset or task
attaches to or detached from it, and it should be free'd when the
count goes to zero.

There are 736 lines of code added to the end of kernel/cpuset.c,
with many snippets closely resembling existing code from the rest
of kernel/cpuset.c. This is too much duplicated, parallel code,
and too much duplicate, parallel data structures. I am hoping
that this can be collapsed into the existing cpuset.c code, adding
perhaps 100 or 200 lines instead of 736 lines. But I have not
thought about it too hard - this hope might be wishful thinking.
Seeing routines such as "cpumeter_file_read_common()", along side
the existing "cpuset_common_file_read()", worries me. Way too much
duplication of code. And I hope we don't need a whole different
set of file_operations for the guar and lim files.

This code provides a basis for other forms of metering besides just
cpu metering. I wonder if perhaps it would be better to just do
cpu metering here, and remove a bit of the generality that supports
other types of controllers. If the addition of this cpuset based
controller were sufficiently integrated with the existing cpuset
code, and kept simple enough, then adding another controller type,
such as for memory, could just do the same sort of thing easy enough.
It would be easier to read the code if lines such as the following:

sprintf(name, "%s%s%s", cpumeter_meter_prefix, c->name,
cpumeter_guar_suffix);

were instead:

name = "meter_cpu_guar";

You have been most gracious in your consideration of my suggestions
so far. I hope the above thoughts will benefit your work.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-02 07:02:19

by Paul Jackson

[permalink] [raw]
Subject: Ok to change cpuset flags for sched domains? (was [PATCH 1/3] CPUMETER ...)

Dinikar,

How much grief will it cause you if I make the following incompatible
change to the special boolean files in each cpuset directory?

I think I goofed in encouraging you to overload "cpu_exclusive"
with defining dynamic scheduler domains. I should have asked for a
separate flag to be added for that, say "sched_domain", which would
require "cpu_exclusive=1" as a precondition. Other attributes that
require cpu_exclusive or mem_exclusive are showing up, and it makes
more sense for each of them to get their own boolean, and leave the
"*_exclusive" flags to specify just the exclusive (no overlap with
sibling) attribute.

I don't have much code built on top of this yet, so it won't matter
much to me. But I worry about the affect that such an incompatible
change would have on your user code.

Here is the proposed change, as I first described it in Takahiro-san's
thread on cpumeter:

=====================================================================

The above proposal makes it more obvious than ever that I am starting
to overload the meaning of cpu_exclusive and mem_exclusive perhaps a
bit too much.

One or the other of the two *_exclusive flags should be required
preconditions for some of these special properties (sched domains,
GFP_KERNEL memory allocation confinement, oom killer confinement, cpu
metering and memory metering), but perhaps actually enabling any of
these special properties should be an additional and distinct choice.

Therefore I propose some new cpuset flags:
* 'sched_domain' to mark sched domains (now done by the cpu_exclusive
flag),
* 'kernel_memory' to mark the constraints on GFP_KERNEL allocations,
* 'oom_killer' to mark the constraints on oom killing,
* your 'meter_cpu' flag to mark a set of metered cpus, and
* your 'meter_mem' flag to mark a set of metered mems.

Each of these new flags would require the appropriate cpu_exclusive or
mem_exclusive flag on the same cpuset to already be set, but just
setting the *_exclusive flags by themselves would not be enough to get
you the special behaviour. You would also have to set the appropriate
one of these new flags.

So, for example, the condition to define a sched domain would change,
from just being the lowest level cpuset marked cpu_exclusive (or the
left over CPUs not marked exclusive), to being both that -and- having
its "sched_domain" flag marked True (or being the left over CPUs,
again).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-03 13:52:28

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: Ok to change cpuset flags for sched domains? (was [PATCH 1/3] CPUMETER ...)

I have been wanting to follow the cpumeter discussion more closely,
but currently am tied up. I hope to have more time towards the end of
this week.

I had a few queries below, though

On Sun, Oct 02, 2005 at 12:01:59AM -0700, Paul Jackson wrote:
> Dinikar,
>
> How much grief will it cause you if I make the following incompatible
> change to the special boolean files in each cpuset directory?
>
> I think I goofed in encouraging you to overload "cpu_exclusive"
> with defining dynamic scheduler domains. I should have asked for a
> separate flag to be added for that, say "sched_domain", which would
> require "cpu_exclusive=1" as a precondition. Other attributes that
> require cpu_exclusive or mem_exclusive are showing up, and it makes
> more sense for each of them to get their own boolean, and leave the
> "*_exclusive" flags to specify just the exclusive (no overlap with
> sibling) attribute.


One of the reasons for overloading the cpu_exclusive flag was to
ensure that the rebalance code does not try to pull tasks unnecessarily

With the scheme that you are proposing that is a possibility if
you turn on the cpu_exclusive and meter_cpu for example and not
turn on sched_domain. Is there a reason why we would want to have
exclusive cpusets not attached to sched domains at all?

I am not entirely convinced that we can compare sched_domains and
meter_cpus.

However I am still open if there is a convincing reason to have
exclusive cpusets that dont form sched domains.

-Dinakar

2005-10-03 23:36:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] Re: Ok to change cpuset flags for sched domains? (was [PATCH 1/3] CPUMETER ...)

Dinakar, responding to pj:
> > I think I goofed in encouraging you to overload "cpu_exclusive"
> > with defining dynamic scheduler domains.
>
> I am not entirely convinced ...

You're right. It's correct the way it is.

The placement of dynamic sched domains does not cause some
user visible change in behaviour that it makes sense for the
user to want to turn on for some cpusets and off for others.

Rather it is a performance optimization for scheduler scalability,
and the resulting domains must cover all the CPUs in the system.

So the placement of sched domains should not be a user option,
and should be done for all cpu_exclusive domains as it is
done now.

Good.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-10-04 02:50:03

by KUROSAWA Takahiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] CPUMETER (Re: [PATCH 0/5] SUBCPUSETS: a resource control functionality using CPUSETS)

On Sat, 1 Oct 2005 21:20:26 -0700
Jackson-san wrote:

> I spent a little more reading the cpuset side of your cpumeter patches.
>
> I am hopeful that some substantial restructuring of the code would
> integrate it better with the existing cpuset structure, reducing the
> size of new code substantially.

Thanks for the suggestion.
I'll update my patch and send it when I finish updating.