2008-01-29 09:54:55

by Peter Zijlstra

[permalink] [raw]
Subject: scheduler scalability - cgroups, cpusets and load-balancing

Hi All,

Some of the fancy new scheduler features such as the cgroup load
balancer (load_balance_monitor) and the real-time load balancer are a
bit of an scalability issue. They all seem to want a rather strong
global bound to keep a global fairness (which is quite understandable).

[ my own interest is currently real-time group scheduling on multiple
cpus, and that seems to require _very_ strong bonds ]

I think the current stuff would scale up to 8 maybe 16 cpus, but after
that I'd be real worried.

Now we want distributions to enable most of these features. Distros seem
to want containers, but distros also need to support 128+ cpu machines,
so how are we going to solve this.

My thoughts were to make stronger use of disjoint cpu-sets. cgroups and
cpusets are related, in that cpusets provide a property to a cgroup.
However, load_balance_monitor()'s interaction with sched domains
confuses me - it might DTRT, but I can't tell.

[ It looks to me it balances a group over the largest SD the current cpu
has access to, even though that might be larger than the SD associated
with the cpuset of that particular cgroup. ]

Also the RT load-balance needs to become aware of such these sets, I
think Paul J and Steven once talked about it, but can't quite remember
where that ended. From my POV there should be sched-domain based balance
information, not global.

By cutting the problem into smaller pieces, and adding tunables to
weaken to global fairness, I think we can give administrators enough
freedom to make use of these features, even on the largest of machines.

[ so I'd move the load_balance_monitor() tunables into cpusets as well,
I can imagine a smaller cpuset wanting a stronger fairness than a much
larger cpuset. ]

I understand its a somewhat hand-wavey email, but I wanted to start
discussion on the issue, or have someone show me I'm wrong and can stop
worrying :-).

Peter


2008-01-29 10:01:53

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Peter wrote:
> Also the RT load-balance needs to become aware of such these sets, I
> think Paul J and Steven once talked about it, but can't quite remember
> where that ended

See further the thread:

http://lkml.org/lkml/2007/10/22/400

(I don't remember where it ended up either; probably nowhere.
I'm just passing on the link, before doing any reading or thinking.)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 10:51:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing


On Tue, 2008-01-29 at 04:01 -0600, Paul Jackson wrote:
> Peter wrote:
> > Also the RT load-balance needs to become aware of such these sets, I
> > think Paul J and Steven once talked about it, but can't quite remember
> > where that ended
>
> See further the thread:
>
> http://lkml.org/lkml/2007/10/22/400
>
> (I don't remember where it ended up either; probably nowhere.
> I'm just passing on the link, before doing any reading or thinking.)

Thanks for the link. Yes I think your last suggestion of creating
rt-domains ( http://lkml.org/lkml/2007/10/23/419 ) is a good one.

Upon cpuset changes we could then look for the largest disjoint set and
hang the rt balance code from that.

2008-01-29 10:58:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing


Here I go, talking to myself..

On Tue, 2008-01-29 at 10:53 +0100, Peter Zijlstra wrote:

> My thoughts were to make stronger use of disjoint cpu-sets. cgroups and
> cpusets are related, in that cpusets provide a property to a cgroup.
> However, load_balance_monitor()'s interaction with sched domains
> confuses me - it might DTRT, but I can't tell.
>
> [ It looks to me it balances a group over the largest SD the current cpu
> has access to, even though that might be larger than the SD associated
> with the cpuset of that particular cgroup. ]

Hmm, with a bit more thought I think that does indeed DTRT. Because, if
the cpu belongs to a disjoint cpuset, the highest sd (with
load-balancing enabled) would be that. Right?

[ Just a bit of a shame we have all cgroups represented on each cpu. ]

Also, might be a nice idea to split the daemon up if there are indeed
disjoint sets - currently there is only a single daemon which touches
the whole system.

2008-01-29 11:14:19

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Peter wrote:
> Thanks for the link. Yes I think your last suggestion of creating
> rt-domains ( http://lkml.org/lkml/2007/10/23/419 ) is a good one.

We now have a per-cpuset Boolean flag file called 'sched_load_balance'.

In the default case, this flag is set on, and the kernel does its
usual load balancing across all CPUs in that cpuset. This means, under
the covers, that there exists some sched domain such that all CPUs in
that cpuset are in that same sched domain. That sched domain might
contain additional CPUs from outside that cpuset as well. Indeed,
in the default vanilla configuration, that sched domain contains all
CPUs in the system.

If we turn the sched_load_balance flag off for some cpuset, we are
telling the kernel it's ok not to load balance on the CPUs in that
cpuset (unless those CPUs are in some other cpuset that needed load
balancing anyway.)

This 'sched_load_balance' flag is, thus far, "the" cpuset hook
supporting realtime. One can use it to configure a system so that
the kernel does not do normal load balancing on select CPUs, such
as those CPUs dedicated to realtime use.

It sounds like Peter is reminding us that we really have three choices
for a handling a given CPU's load balancing:
1) normal kernel scheduler load balancing,
2) RT load balancing, or
3) no load balancing whatsoever.

If that's the case (if we really need choice 3) then a single Boolean
flag, such as sched_load_balance, is not sufficient to select from
the three choices, and it might make sense to add a second per-cpuset
Boolean flag, say "sched_rt_balance", default off, which if turned on,
enabled choice 2.

If that's not the case (we only need choices 1 and 2) then -logically-
we could overload the meaning of the current sched_load_balance,
to mean, if turned off, not only to stop doing normal balancing, but
to further mean that we should commence RT balancing. However bits
aren't -that- precious here, and this sounds unnecessarily confusing.

So ... would a new per-cpuset Boolean flag such as sched_rt_balance be
appropriate and sufficient to mark those cpusets whose set of CPUs
required RT balancing?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 11:30:26

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Peter wrote, in reply to Peter ;):
> > [ It looks to me it balances a group over the largest SD the current cpu
> > has access to, even though that might be larger than the SD associated
> > with the cpuset of that particular cgroup. ]
>
> Hmm, with a bit more thought I think that does indeed DTRT. Because, if
> the cpu belongs to a disjoint cpuset, the highest sd (with
> load-balancing enabled) would be that. Right?

The code that defines sched domains, kernel/sched.c partition_sched_domains(),
as called from the cpuset code in kernel/cpuset.c rebuild_sched_domains(),
does not make use of the full range of sched_domain possibilities.

In particular, it only sets up some non-overlapping set of sched domains.
Every CPU ends up in at most a single sched domain.

The original reason that one can't define overlapping sched domains via
this cpuset interface (based off the cpuset 'sched_load_balance' flag)
is that I didn't realize it was even possible to overlap sched domains
when I wrote the cpuset code defining sched domains. And then when I
later realized one could overlap sched domains, I (a) didn't see a need
to do so, and (b) couldn't see how to do so via the cpuset interface
without causing my brain to explode.

Now, back to Peter's question, being a bit pedantic, CPUs don't belong
to disjoint cpusets, except in the most minimal situation that there is
only one cpuset covering all CPUs.

Rather what happens, when you have need for some realtime CPUs, is that:
1) you turn off sched_load_balance on the top cpuset,
2) you setup your realtime cpuset as a child cpuset of the top cpuset
such that its CPUs doesn't overlap any of its siblings, and
3) you turn off sched_load_balance in that realtime cpuset.

At that point, sched domains are rebuilt, including providing a
sched domain that just contains the CPUs in that realtime cpuset, and
normal scheduler load balancing ceases on the CPUs in that realtime
cpuset.

> [ Just a bit of a shame we have all cgroups represented on each cpu. ]

Could you restate this -- I suspect it's obvious, but I'm oblivious ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 11:32:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing


On Tue, 2008-01-29 at 05:13 -0600, Paul Jackson wrote:
> Peter wrote:
> > Thanks for the link. Yes I think your last suggestion of creating
> > rt-domains ( http://lkml.org/lkml/2007/10/23/419 ) is a good one.
>
> We now have a per-cpuset Boolean flag file called 'sched_load_balance'.

SD_LOAD_BALANCE, right?

> In the default case, this flag is set on, and the kernel does its
> usual load balancing across all CPUs in that cpuset. This means, under
> the covers, that there exists some sched domain such that all CPUs in
> that cpuset are in that same sched domain. That sched domain might
> contain additional CPUs from outside that cpuset as well. Indeed,
> in the default vanilla configuration, that sched domain contains all
> CPUs in the system.
>
> If we turn the sched_load_balance flag off for some cpuset, we are
> telling the kernel it's ok not to load balance on the CPUs in that
> cpuset (unless those CPUs are in some other cpuset that needed load
> balancing anyway.)
>
> This 'sched_load_balance' flag is, thus far, "the" cpuset hook
> supporting realtime. One can use it to configure a system so that
> the kernel does not do normal load balancing on select CPUs, such
> as those CPUs dedicated to realtime use.

Ah, here I disagree, it is possible to do (hard) realtime scheduling
over multiple cpus, the only draw back is that it requires a very strong
load-balancer, making it unsuitable for large number of cpus.

( of course, having a strong rt load balancer on a large cpuset doesn't
harm, as long as there are no rt tasks to balance )

So if we have a system like so:

__A__
/ | \
B1 B2 B3
/\
/ \
C1 C2

A comprises of cpus 0-127, !SD_LOAD_BALANCE

B1 comprises of cpus 0-63, !SD_LOAD_BALANCE
B2 comprises of cpus 64-119
B3 120-127

C1 0-3
C2 5-63

We end up with 4 disjoint load-balanced sets.

I would then attach the rt balance information to: C1, C2, B2, B3.

If, for example, B1 would be load-balanced, we'd only have 3 disjoint
sets left: B1, B2 and B3, and the rt balance data would be there.

> It sounds like Peter is reminding us that we really have three choices
> for a handling a given CPU's load balancing:
> 1) normal kernel scheduler load balancing,
> 2) RT load balancing, or
> 3) no load balancing whatsoever.
>
> If that's the case (if we really need choice 3) then a single Boolean
> flag, such as sched_load_balance, is not sufficient to select from
> the three choices, and it might make sense to add a second per-cpuset
> Boolean flag, say "sched_rt_balance", default off, which if turned on,
> enabled choice 2.
>
> If that's not the case (we only need choices 1 and 2) then -logically-
> we could overload the meaning of the current sched_load_balance,
> to mean, if turned off, not only to stop doing normal balancing, but
> to further mean that we should commence RT balancing. However bits
> aren't -that- precious here, and this sounds unnecessarily confusing.
>
> So ... would a new per-cpuset Boolean flag such as sched_rt_balance be
> appropriate and sufficient to mark those cpusets whose set of CPUs
> required RT balancing?

So, I don't think we need that, I think we can do with the single flag,
we just need to find these disjoint sets and stick our rt-domain there.

2008-01-29 11:34:53

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Paul, talking to himself:
> At that point, sched domains are rebuilt, including providing a
> sched domain that just contains the CPUs in that realtime cpuset, and
> normal scheduler load balancing ceases on the CPUs in that realtime
> cpuset.

Oops - correction - at that point sched domains are rebuilt, and
the CPUs in that realtime cpuset are not included in any sched
domain at all.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 11:53:38

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Peter wrote;
> So, I don't think we need that, I think we can do with the single flag,
> we just need to find these disjoint sets and stick our rt-domain there.

Ah - perhaps you don't need that flag - but my other cpuset users do ;).

You see, there are two very different ways that 'sched_load_balance' is
used in practice.

The other way is by big batch schedulers. They may be placed in charge
of managing a few hundred CPUs on a system, and might be running a mix
of many small jobs each covering only a few CPUs. They routinely setup
one cpuset for each job, to contain that job to the CPUs and memory
nodes assigned to it. This is actually the original motivating use for
cpusets.

As a bit of optimization, batch schedulers desire to tell the normal
kernel scheduler -not- to bother load balancing across the big set of
CPUs controlled by the batch scheduler, but only to load balance within
each of the smaller per-job cpusets. Load balancing across hundreds
of CPUs when the batch scheduler knows such efforts would be fruitless
is a waste of good CPU cycles in kernel/sched.c.

I really doubt we'd want to have such systems triggering the hard RT
scheduler on whatever CPUs were in the batch schedulers big cpuset
that didn't happened to have an active job currently assigned to them.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 12:04:25

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Paul, responding to Peter:
> > We now have a per-cpuset Boolean flag file called 'sched_load_balance'.
>
> SD_LOAD_BALANCE, right?

No. SD_LOAD_BALANCE is some attribute of sched domains.

The 'sched_load_balance' flag is an attribute of cpusets.

The mapping of cpusets to sched domains required several pages of 'fun
to write' code, which had to go through a couple of years of fixing and
one major rewrite before it (knock on wood) worked correctly. It's not
a one-to-one relation, in other words. See my earlier messages for
further explanation of how this works.

I'm not sure what SD_LOAD_BALANCE does ... I guess from a quick
read that it just optimizes the recognition of singleton sched
domains for which load balancing would be a wasted effort.


> > This 'sched_load_balance' flag is, thus far, "the" cpuset hook
> > supporting realtime. One can use it to configure a system so that
> > the kernel does not do normal load balancing on select CPUs, such
> > as those CPUs dedicated to realtime use.
>
> Ah, here I disagree, it is possible to do (hard) realtime scheduling
> over multiple cpus, the only draw back is that it requires a very strong
> load-balancer, making it unsuitable for large number of cpus.

I don't think we are disagreeing. I was speaking of "normal"
load balancing (what the mainline kernel/sched.c code normally
does). You're speaking of hard realtime load balancing.

I think we agree that these both exist, and require different
load balancing code, the latter 'very strong.'

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 12:04:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing


On Tue, 2008-01-29 at 05:30 -0600, Paul Jackson wrote:
> Peter wrote, in reply to Peter ;):
> > > [ It looks to me it balances a group over the largest SD the current cpu
> > > has access to, even though that might be larger than the SD associated
> > > with the cpuset of that particular cgroup. ]
> >
> > Hmm, with a bit more thought I think that does indeed DTRT. Because, if
> > the cpu belongs to a disjoint cpuset, the highest sd (with
> > load-balancing enabled) would be that. Right?
>
> The code that defines sched domains, kernel/sched.c partition_sched_domains(),
> as called from the cpuset code in kernel/cpuset.c rebuild_sched_domains(),
> does not make use of the full range of sched_domain possibilities.
>
> In particular, it only sets up some non-overlapping set of sched domains.
> Every CPU ends up in at most a single sched domain.

Ah, good to know. I thought it would reflect the hierarchy of the sets
themselves.

> The original reason that one can't define overlapping sched domains via
> this cpuset interface (based off the cpuset 'sched_load_balance' flag)
> is that I didn't realize it was even possible to overlap sched domains
> when I wrote the cpuset code defining sched domains. And then when I
> later realized one could overlap sched domains, I (a) didn't see a need
> to do so, and (b) couldn't see how to do so via the cpuset interface
> without causing my brain to explode.

Good reason :-), this code needs all the reasons it can grasp to not
grow more complexity.

> Now, back to Peter's question, being a bit pedantic, CPUs don't belong
> to disjoint cpusets, except in the most minimal situation that there is
> only one cpuset covering all CPUs.
>
> Rather what happens, when you have need for some realtime CPUs, is that:
> 1) you turn off sched_load_balance on the top cpuset,
> 2) you setup your realtime cpuset as a child cpuset of the top cpuset
> such that its CPUs doesn't overlap any of its siblings, and
> 3) you turn off sched_load_balance in that realtime cpuset.

Ah, I don't think 3 is needed. Quite to the contrary, there is quite a
large body of research work covering the scheduling of (hard and soft)
realtime tasks on multiple cpus.

> At that point, sched domains are rebuilt, including providing a
> sched domain that just contains the CPUs in that realtime cpuset, and
> normal scheduler load balancing ceases on the CPUs in that realtime
> cpuset.

Right, which would also disable the realtime load-balancing we do want.
Hence my suggestion to stick the rt balance data in this sched domain.

> > [ Just a bit of a shame we have all cgroups represented on each cpu. ]
>
> Could you restate this -- I suspect it's obvious, but I'm oblivious ;).

Ah, sure. struct task_group creates cfs_rq/rt_rq entities for each cpu's
runqueue. So an iteration like for_each_leaf_{cfs,rt}_rq() will touch
all task_groups/cgroups, not only those that are actually schedulable on
that cpu.

Now, I think that could be easily solved by adding/removing
{cfs,rt}_rq->leaf_{cfs,rt}_rq_list to/from rq->leaf_{cfs,rt}_rq_list on
enqueue of the first/dequeue of the last entity of its tg on that rq.


2008-01-29 12:12:26

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Peter, replying to Paul:
> > 3) you turn off sched_load_balance in that realtime cpuset.
>
> Ah, I don't think 3 is needed. Quite to the contrary, there is quite a
> large body of research work covering the scheduling of (hard and soft)
> realtime tasks on multiple cpus.

Well, the way it's coded now, the user space code needs to do (3),
because that's the only way they get the system to have anything
other than one big fat sched domain covering the all the CPUs in
the system.

Actually ... I need a picture of a bunny with a pancake hat here,
as I have no idea what you just said ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 12:19:00

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

On Tue, Jan 29, 2008 at 11:57:22AM +0100, Peter Zijlstra wrote:
> On Tue, 2008-01-29 at 10:53 +0100, Peter Zijlstra wrote:
>
> > My thoughts were to make stronger use of disjoint cpu-sets. cgroups and
> > cpusets are related, in that cpusets provide a property to a cgroup.
> > However, load_balance_monitor()'s interaction with sched domains
> > confuses me - it might DTRT, but I can't tell.
> >
> > [ It looks to me it balances a group over the largest SD the current cpu
> > has access to, even though that might be larger than the SD associated
> > with the cpuset of that particular cgroup. ]
>
> Hmm, with a bit more thought I think that does indeed DTRT. Because, if
> the cpu belongs to a disjoint cpuset, the highest sd (with
> load-balancing enabled) would be that. Right?

Hi Peter,
Yes, I was having this in mind when I wrote the load_balance_monitor()
function - to only balance across cpus that form a disjoint cpuset in the
system.

> [ Just a bit of a shame we have all cgroups represented on each cpu. ]

After reading your explanation in the other mail abt what you mean here,
I agree. Your suggestion to remove/add cfs_rq from/to the leaf_cfs_rq_list
upon dequeue_of_last_task/enqueue_of_first_task AND

> Also, might be a nice idea to split the daemon up if there are indeed
> disjoint sets - currently there is only a single daemon which touches
> the whole system.

the above suggestions seems like good ideas. I can also look at reducing
the frequency at which the thread runs ..

--
Regards,
vatsa

2008-01-29 12:21:19

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

vatsa wrote to Peter:
> After reading your explanation in the other mail abt what you mean here,
> I agree.

Ah good - glad someone understood that.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 12:21:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing


On Tue, 2008-01-29 at 05:53 -0600, Paul Jackson wrote:
> Peter wrote;
> > So, I don't think we need that, I think we can do with the single flag,
> > we just need to find these disjoint sets and stick our rt-domain there.
>
> Ah - perhaps you don't need that flag - but my other cpuset users do ;).
>
> You see, there are two very different ways that 'sched_load_balance' is
> used in practice.
>
> The other way is by big batch schedulers. They may be placed in charge
> of managing a few hundred CPUs on a system, and might be running a mix
> of many small jobs each covering only a few CPUs. They routinely setup
> one cpuset for each job, to contain that job to the CPUs and memory
> nodes assigned to it. This is actually the original motivating use for
> cpusets.
>
> As a bit of optimization, batch schedulers desire to tell the normal
> kernel scheduler -not- to bother load balancing across the big set of
> CPUs controlled by the batch scheduler, but only to load balance within
> each of the smaller per-job cpusets. Load balancing across hundreds
> of CPUs when the batch scheduler knows such efforts would be fruitless
> is a waste of good CPU cycles in kernel/sched.c.
>
> I really doubt we'd want to have such systems triggering the hard RT
> scheduler on whatever CPUs were in the batch schedulers big cpuset
> that didn't happened to have an active job currently assigned to them.

My turn to be confused..

If SD_LOAD_BALANCE is only set on the smaller, per-job, sets, how will
the RT balancer trigger on the large set?

2008-01-29 12:36:56

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Peter, responding to Paul:
> > I really doubt we'd want to have such systems triggering the hard RT
> > scheduler on whatever CPUs were in the batch schedulers big cpuset
> > that didn't happened to have an active job currently assigned to them.
>
> My turn to be confused..
>
> If SD_LOAD_BALANCE is only set on the smaller, per-job, sets, how will
> the RT balancer trigger on the large set?

What 'sched_load_balance' does now is help you setup a -partial-
covering of non-overlappping sched domains. In the batch scheduler
example, those CPUs that were:
1) being managed by the batch scheduler, but
2) were not assigned to any active job at the moment
would -not- be in any sched domain.

It's not a question of the SC_LOAD_BALANCE flag. It's a question
of whether a given CPU is even included in any sched domain.

If we did as you are suggesting (if I understand) then instead of
leaving these CPUs out of any sched domain, rather we'd setup a new
kind of sched domain for these CPUs, marked for hard real time load
balancing, rather than the somewhat more scalable, but softer normal
load balancing.

We want no load balancing on those CPUs, not realtime load balancing.
Indeed, I suspect, we especially do not want realtime load balancing
on those CPUs as that kind of load balancing is (I'm suspecting) more
expensive and less scalable than normal load balancing.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 12:45:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing


On Tue, 2008-01-29 at 06:03 -0600, Paul Jackson wrote:
> Paul, responding to Peter:
> > > We now have a per-cpuset Boolean flag file called 'sched_load_balance'.
> >
> > SD_LOAD_BALANCE, right?
>
> No. SD_LOAD_BALANCE is some attribute of sched domains.
>
> The 'sched_load_balance' flag is an attribute of cpusets.
>
> The mapping of cpusets to sched domains required several pages of 'fun
> to write' code, which had to go through a couple of years of fixing and
> one major rewrite before it (knock on wood) worked correctly. It's not
> a one-to-one relation, in other words. See my earlier messages for
> further explanation of how this works.

Ok, I'll take a stab at understanding that code. Perhaps it seems to me
a lot of confusion could be solved by getting a more level playing
ground :-)

> > > This 'sched_load_balance' flag is, thus far, "the" cpuset hook
> > > supporting realtime. One can use it to configure a system so that
> > > the kernel does not do normal load balancing on select CPUs, such
> > > as those CPUs dedicated to realtime use.
> >
> > Ah, here I disagree, it is possible to do (hard) realtime scheduling
> > over multiple cpus, the only draw back is that it requires a very strong
> > load-balancer, making it unsuitable for large number of cpus.
>
> I don't think we are disagreeing. I was speaking of "normal"
> load balancing (what the mainline kernel/sched.c code normally
> does). You're speaking of hard realtime load balancing.
>
> I think we agree that these both exist, and require different
> load balancing code, the latter 'very strong.'

Great :-)


2008-01-29 12:53:08

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

> Ok, I'll take a stab at understanding that code.

See also the section:

1.7 What is sched_load_balance ?

in Documentation/cpusets.txt.

Good luck ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 13:40:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing


On Tue, 2008-01-29 at 06:52 -0600, Paul Jackson wrote:
> > Ok, I'll take a stab at understanding that code.
>
> See also the section:
>
> 1.7 What is sched_load_balance ?
>
> in Documentation/cpusets.txt.
>
> Good luck ;).

It seems Gregory tricked us both:

57d885fea0da0e9541d7730a9e1dcf734981a173

2008-01-29 15:57:29

by Gregory Haskins

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

>>> On Tue, Jan 29, 2008 at 6:50 AM, in message
<1201607401.28547.124.camel@lappy>, Peter Zijlstra <[email protected]>
wrote:

> On Tue, 2008-01-29 at 05:30 -0600, Paul Jackson wrote:
>> Peter wrote, in reply to Peter ;):
>> > > [ It looks to me it balances a group over the largest SD the current cpu
>> > > has access to, even though that might be larger than the SD associated
>> > > with the cpuset of that particular cgroup. ]
>> >
>> > Hmm, with a bit more thought I think that does indeed DTRT. Because, if
>> > the cpu belongs to a disjoint cpuset, the highest sd (with
>> > load-balancing enabled) would be that. Right?
>>
>> The code that defines sched domains, kernel/sched.c
> partition_sched_domains(),
>> as called from the cpuset code in kernel/cpuset.c rebuild_sched_domains(),
>> does not make use of the full range of sched_domain possibilities.
>>
>> In particular, it only sets up some non-overlapping set of sched domains.
>> Every CPU ends up in at most a single sched domain.
>
> Ah, good to know. I thought it would reflect the hierarchy of the sets
> themselves.
>
>> The original reason that one can't define overlapping sched domains via
>> this cpuset interface (based off the cpuset 'sched_load_balance' flag)
>> is that I didn't realize it was even possible to overlap sched domains
>> when I wrote the cpuset code defining sched domains. And then when I
>> later realized one could overlap sched domains, I (a) didn't see a need
>> to do so, and (b) couldn't see how to do so via the cpuset interface
>> without causing my brain to explode.
>
> Good reason :-), this code needs all the reasons it can grasp to not
> grow more complexity.
>
>> Now, back to Peter's question, being a bit pedantic, CPUs don't belong
>> to disjoint cpusets, except in the most minimal situation that there is
>> only one cpuset covering all CPUs.
>>
>> Rather what happens, when you have need for some realtime CPUs, is that:
>> 1) you turn off sched_load_balance on the top cpuset,
>> 2) you setup your realtime cpuset as a child cpuset of the top cpuset
>> such that its CPUs doesn't overlap any of its siblings, and
>> 3) you turn off sched_load_balance in that realtime cpuset.
>
> Ah, I don't think 3 is needed. Quite to the contrary, there is quite a
> large body of research work covering the scheduling of (hard and soft)
> realtime tasks on multiple cpus.

This is correct. We have the balance policy polymorphically associated with each sched_class, and the CFS load-balancer and RT "load" (really, priority) balancer can coexist together at the same time and across arbitrary #s of cores. From an RT perspective, this works great. Its a little trickier (and I dont think we have this quite right, yet) for the CFS side, since that interface deals strictly in terms of load. As such, it gets a little perturbed by these "rude" RT tasks that arbitrarily preempt its tasks. :) I think Steven may have done some work in that area by playing with the associated weight of RT tasks, etc so that the CFS balancer can more accurate account for the externally managed RT load on the system. But AFAIK, its not in the tree yet.


>
>> At that point, sched domains are rebuilt, including providing a
>> sched domain that just contains the CPUs in that realtime cpuset, and
>> normal scheduler load balancing ceases on the CPUs in that realtime
>> cpuset.
>
> Right, which would also disable the realtime load-balancing we do want.
> Hence my suggestion to stick the rt balance data in this sched domain.
>
>> > [ Just a bit of a shame we have all cgroups represented on each cpu. ]
>>
>> Could you restate this -- I suspect it's obvious, but I'm oblivious ;).
>
> Ah, sure. struct task_group creates cfs_rq/rt_rq entities for each cpu's
> runqueue. So an iteration like for_each_leaf_{cfs,rt}_rq() will touch
> all task_groups/cgroups, not only those that are actually schedulable on
> that cpu.
>
> Now, I think that could be easily solved by adding/removing
> {cfs,rt}_rq->leaf_{cfs,rt}_rq_list to/from rq->leaf_{cfs,rt}_rq_list on
> enqueue of the first/dequeue of the last entity of its tg on that rq.

2008-01-29 16:03:23

by Gregory Haskins

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

>>> On Tue, Jan 29, 2008 at 6:30 AM, in message
<[email protected]>, Paul Jackson <[email protected]> wrote:
> Peter wrote, in reply to Peter ;):
>> > [ It looks to me it balances a group over the largest SD the current cpu
>> > has access to, even though that might be larger than the SD associated
>> > with the cpuset of that particular cgroup. ]
>>
>> Hmm, with a bit more thought I think that does indeed DTRT. Because, if
>> the cpu belongs to a disjoint cpuset, the highest sd (with
>> load-balancing enabled) would be that. Right?
>
> The code that defines sched domains, kernel/sched.c
> partition_sched_domains(),
> as called from the cpuset code in kernel/cpuset.c rebuild_sched_domains(),
> does not make use of the full range of sched_domain possibilities.
>
> In particular, it only sets up some non-overlapping set of sched domains.
> Every CPU ends up in at most a single sched domain.
>
> The original reason that one can't define overlapping sched domains via
> this cpuset interface (based off the cpuset 'sched_load_balance' flag)
> is that I didn't realize it was even possible to overlap sched domains
> when I wrote the cpuset code defining sched domains. And then when I
> later realized one could overlap sched domains, I (a) didn't see a need
> to do so, and (b) couldn't see how to do so via the cpuset interface
> without causing my brain to explode.
>
> Now, back to Peter's question, being a bit pedantic, CPUs don't belong
> to disjoint cpusets, except in the most minimal situation that there is
> only one cpuset covering all CPUs.
>
> Rather what happens, when you have need for some realtime CPUs, is that:
> 1) you turn off sched_load_balance on the top cpuset,
> 2) you setup your realtime cpuset as a child cpuset of the top cpuset
> such that its CPUs doesn't overlap any of its siblings, and
> 3) you turn off sched_load_balance in that realtime cpuset.
>
> At that point, sched domains are rebuilt, including providing a
> sched domain that just contains the CPUs in that realtime cpuset, and
> normal scheduler load balancing ceases on the CPUs in that realtime
> cpuset.

Hi Paul,
I am a bit confused as to why you disable load-balancing in the RT cpuset? It shouldn't be strictly necessary in order for the RT scheduler to do its job (unless I am misunderstanding what you are trying to accomplish?). Do you do this because you *have* to in order to make real-time deadlines, or because its just a further optimization?

-Greg


>
>> [ Just a bit of a shame we have all cgroups represented on each cpu. ]
>
> Could you restate this -- I suspect it's obvious, but I'm oblivious ;).


2008-01-29 16:04:17

by Gregory Haskins

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

>>> On Tue, Jan 29, 2008 at 7:12 AM, in message
<[email protected]>, Paul Jackson <[email protected]> wrote:
> Peter, replying to Paul:
>> > 3) you turn off sched_load_balance in that realtime cpuset.
>>
>> Ah, I don't think 3 is needed. Quite to the contrary, there is quite a
>> large body of research work covering the scheduling of (hard and soft)
>> realtime tasks on multiple cpus.
>
> Well, the way it's coded now, the user space code needs to do (3),
> because that's the only way they get the system to have anything
> other than one big fat sched domain covering the all the CPUs in
> the system.

What about exclusive cpusets? Don't they create a new sched-domain or did I misunderstand there?

-Greg

>
> Actually ... I need a picture of a bunny with a pancake hat here,
> as I have no idea what you just said ;).



2008-01-29 16:29:24

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Gregory wrote:
> I am a bit confused as to why you disable load-balancing in the
> RT cpuset? It shouldn't be strictly necessary in order for the
> RT scheduler to do its job (unless I am misunderstanding what you
> are trying to accomplish?). Do you do this because you *have*
> to in order to make real-time deadlines, or because its just a
> further optimization?

My primary motivation for cpusets originally, and for the
sched_load_balance flag now, was not realtime, but "soft partitioning"
of big NUMA systems, especially for batch schedulers. They sometimes
have large cpusets which are only being used to hold smaller, per-job,
cpusets. It is a waste of time (CPU cycles in the kernel sched code)
to load balance those large cpusets. Load balancing doesn't scale
easily to high CPU counts, and it's nice to avoid doing that where
not needed.

See the following lkml message for a fuller explanation:

http://lkml.org/lkml/2008/1/29/85

As a secondary motivation, I thought that disabling load balancing on
the RT cpuset was the right thing to do for RT needs, but I make no
claim to knowing much about RT.

I just now realized that you added a 'root_domain' in a patch in
late Nov and early Dec. I was on the road then, moving from
California to Texas, and not paying much attention to Linux.

A couple of questions on that patch, both involving a comment it adds
to kernel/sched.c:

/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
* fully partitioning the member cpus from any other cpuset. Whenever a new
* exclusive cpuset is created, we also create and attach a new root-domain
* object.
*/

1) What are 'per-domain' variables?

2) The mention of 'exclusive cpuset' is no longer correct.

With the patch 'remove sched domain hooks from cpusets' cpusets
no longer defines sched domains using the cpu_exclusive flag.

With the subsequent sched_load_balance patch (see
http://lkml.org/lkml/2007/10/6/19) cpusets uses a new per-cpuset
flag 'sched_load_balance' to define sched domains.

The following revised comment might be more accurate:

/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each non-overlapping sched domain defines an island domain by
* fully partitioning the member cpus from any other cpuset. Whenever a new
* such a sched domain is created, we also create and attach a new root-domain
* object. These non-overlapping sched domains are determined by the cpuset
* configuration, via a call to partition_sched_domains().
*/

It sounds like you (Gregory, others) want your RT CPUs to be in a sched
domain, unlike the current way things are, where my cpuset code
carefully avoids setting up a sched domain for those CPUs. However I
still have need, in the batch scheduler case explained above, to have
some CPUs not in any sched domain.

If you require these RT sched domains to be setup differently somehow,
in some way that is visible to partition_sched_domains, then that
apparently means we need a per-cpuset flag to mark those RT cpusets.

If you just want an ordinary sched domain setup (just so long as it
contains only the intended RT CPUs, not others) then I guess we don't
technically need any more per-cpuset flags, but I'm worried, because
the API we're presenting to users for this has just gone from subtle to
bizarre. I suspect I'll want to add a flag anyway, if by doing so, I
can make the kernel-user API, via cpusets, easier to understand.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 16:33:52

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Gregory wrote:
> What about exclusive cpusets? Don't they create a
> new sched-domain or did I misunderstand there?

cpu_exclusive cpusets no longer determine sched domains.
I just said more in this in an earlier reply.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 16:48:44

by Gregory Haskins

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

>>> On Tue, Jan 29, 2008 at 11:28 AM, in message
<[email protected]>, Paul Jackson <[email protected]> wrote:
> Gregory wrote:
>> I am a bit confused as to why you disable load-balancing in the
>> RT cpuset? It shouldn't be strictly necessary in order for the
>> RT scheduler to do its job (unless I am misunderstanding what you
>> are trying to accomplish?). Do you do this because you *have*
>> to in order to make real-time deadlines, or because its just a
>> further optimization?
>
> My primary motivation for cpusets originally, and for the
> sched_load_balance flag now, was not realtime, but "soft partitioning"
> of big NUMA systems, especially for batch schedulers. They sometimes
> have large cpusets which are only being used to hold smaller, per-job,
> cpusets. It is a waste of time (CPU cycles in the kernel sched code)
> to load balance those large cpusets. Load balancing doesn't scale
> easily to high CPU counts, and it's nice to avoid doing that where
> not needed.

Understood, and that makes tons of sense.

>
> See the following lkml message for a fuller explanation:
>
> http://lkml.org/lkml/2008/1/29/85
>
> As a secondary motivation, I thought that disabling load balancing on
> the RT cpuset was the right thing to do for RT needs, but I make no
> claim to knowing much about RT.

Well, I make no claim to understand the large batch systems you work on either ;) Everything you said made a ton of sense other than the RT/load-balance thing, but I think we are on the same page now.

>
> I just now realized that you added a 'root_domain' in a patch in
> late Nov and early Dec. I was on the road then, moving from
> California to Texas, and not paying much attention to Linux.

np (though I was wondering why you had no comment before ;)

>
> A couple of questions on that patch, both involving a comment it adds
> to kernel/sched.c:
>
> /*
> * We add the notion of a root-domain which will be used to define per-domain
> * variables. Each exclusive cpuset essentially defines an island domain by
> * fully partitioning the member cpus from any other cpuset. Whenever a new
> * exclusive cpuset is created, we also create and attach a new root-domain
> * object.
> */
>
> 1) What are 'per-domain' variables?

s/per-domain/per-root-domain

>
> 2) The mention of 'exclusive cpuset' is no longer correct.
>
> With the patch 'remove sched domain hooks from cpusets' cpusets
> no longer defines sched domains using the cpu_exclusive flag.
>
> With the subsequent sched_load_balance patch (see
> http://lkml.org/lkml/2007/10/6/19) cpusets uses a new per-cpuset
> flag 'sched_load_balance' to define sched domains.

Doh! Thanks for the heads up.

>
> The following revised comment might be more accurate:
>
> /*
> * We add the notion of a root-domain which will be used to define per-domain
> * variables. Each non-overlapping sched domain defines an island domain by
> * fully partitioning the member cpus from any other cpuset. Whenever a new
> * such a sched domain is created, we also create and attach a new
> root-domain
> * object. These non-overlapping sched domains are determined by the cpuset
> * configuration, via a call to partition_sched_domains().
> */
>
> It sounds like you (Gregory, others) want your RT CPUs to be in a sched
> domain, unlike the current way things are, where my cpuset code
> carefully avoids setting up a sched domain for those CPUs. However I
> still have need, in the batch scheduler case explained above, to have
> some CPUs not in any sched domain.
>
> If you require these RT sched domains to be setup differently somehow,
> in some way that is visible to partition_sched_domains, then that
> apparently means we need a per-cpuset flag to mark those RT cpusets.

I think we only need a plain-vanilla partition, so no flags should be necessary.

-Greg

>
> If you just want an ordinary sched domain setup (just so long as it
> contains only the intended RT CPUs, not others) then I guess we don't
> technically need any more per-cpuset flags, but I'm worried, because
> the API we're presenting to users for this has just gone from subtle to
> bizarre. I suspect I'll want to add a flag anyway, if by doing so, I
> can make the kernel-user API, via cpusets, easier to understand.


2008-01-29 16:51:27

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Gregory wrote:
> This is correct. We have the balance policy polymorphically associated
> with each sched_class, and the CFS load-balancer and RT "load" (really,
> priority) balancer can coexist together at the same time and across
> arbitrary #s of cores

So ... we have the option of having all sched_classes coexist polymorphically.

That I didn't realize until this thread.

Now ... do we -want- to ?)

That is, what is the easiest kernel-user API to work with and understand?

Is it one where we essentially expose sched_class to user space, and let
them pick their sched_class, or pick none of the above (don't balance)?

Or is it one where, other than the special case my batch schedulers need
to not balance at all, we expose nothing more to user space, and provide
all sched_class load balancers to all sched_domains (other than those
not balanced at all)?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 17:28:18

by Gregory Haskins

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

>>> On Tue, Jan 29, 2008 at 11:51 AM, in message
<[email protected]>, Paul Jackson <[email protected]> wrote:
> Gregory wrote:
>> This is correct. We have the balance policy polymorphically associated
>> with each sched_class, and the CFS load-balancer and RT "load" (really,
>> priority) balancer can coexist together at the same time and across
>> arbitrary #s of cores
>
> So ... we have the option of having all sched_classes coexist
> polymorphically.
>
> That I didn't realize until this thread.

Its on a per-task basis when the task elects SCHED_FIFO/RR/BATCH/OTHER, etc. If the task is on a particular RQ, the RQ operates under the policy of that class. There are some cases where the RQ consults the policy of all classes, but they are still influenced by whether there are actual tasks running within the scope of the current cpuset (or root-domain).

>
> Now ... do we -want- to ?)

I think so, yes. But I will give the disclaimer that I don't fully understand your world ;)

You could certainly create a group of cpus with homogeneous policy by creating a cpuset with only tasks of a single class as members. But likewise, if you populate a cpuset with tasks from mixed classes, you have mixed balance policy affecting those cpus.


>
> That is, what is the easiest kernel-user API to work with and understand?
>
> Is it one where we essentially expose sched_class to user space, and let
> them pick their sched_class, or pick none of the above (don't balance)?

IMHO it works well the way it is: The user selects the class for a particular task using sched_setscheduler(), and they select the cpuset (or inherit it) that defines its execution scope. If that scope has balancing enabled, the policy for the member classes is in effect.

(on this topic, note that I do not know if the RT-balancer will respect the cpuset concept of "balance-enabled" anyway. That might have to be fixed)

Again, the disclaimer that I do not have expertise in your area, so perhaps this is naive.

>
> Or is it one where, other than the special case my batch schedulers need
> to not balance at all, we expose nothing more to user space, and provide
> all sched_class load balancers to all sched_domains (other than those
> not balanced at all)?


2008-01-29 19:04:22

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Gregory wrote:
> IMHO it works well the way it is: The user selects the class for a
> particular task using sched_setscheduler(), and they select the cpuset
> (or inherit it) that defines its execution scope. If that scope has
> balancing enabled, the policy for the member classes is in effect.

Ok.

For the various classes of schedulers (sched_class's), it's fine by me
if sched domains are polymorphic, supporting all classes, and it is
left to each task to self-select the scheduling class of its preference.

For the batch scheduler case, this -must- be imposable from outside
the task, by the batch scheduler that is overseeing the job, and it
must support the batch scheduler being able to disable all the
balancers in selected cpusets (selected sched_domains).

We have that now. Each of us only knew of part of the solution,
but we managed to arrive at the desired answer even so ... amazing.

The batch scheduler just has to arrange to get 'sched_load_balance'
turned off in a cpuset and all overlapping cpusets, and then the
CPUS in that cpuset will not belong to -any- sched_domain, and hence
(could you verify I'm right in this detail?) won't be balanced by any
sched_class.

I should update the documentation for sched_load_balance, changing it
from saying that you get realtime by turning off sched_load_balance in
the RT cpuset, to saying that you get realtime by (1) turning off
sched_load_balance in any overlapping cpusets, including all
encompassing parent cpusets, (2) leaving sched_load_balance on in the
RT cpuset itself, and (3) having those realtime tasks each self-select
(elect) the desired SCHED_* using sched_setscheduler().

Condition (1) above is a tad difficult to understand, but servicable,
I guess. The combination of (1) and (2) results in a separate
sched_domain just for the CPUs in the RT cpuset.

> (on this topic, note that I do not know if the RT-balancer will
> respect the cpuset concept of "balance-enabled" anyway. That might
> have to be fixed)

Er eh ... it has no choice. If the user space code has configured a
cpuset with 'sched_load_balance' turned off in that cpuset and all
overlapping cpusets, then there will not even be a sched_domain
covering those CPUs, and hence no balancer, RT or other class, will
even see those CPUs.

Unless I really don't understand the kernel/sched.c sched_domain code
(a distinct possibility), if some CPU is not in any sched_domain, then
it won't get balanced, RT or otherwise.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 19:37:27

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Gregory wrote:
> > 1) What are 'per-domain' variables?
>
> s/per-domain/per-root-domain

Oh dear - now I've got more questions, not fewer.

1) "variables" ... what variables?

2) Is a 'root-domain' just the RT specific portion
of a sched_domain, or is it something else?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 20:35:30

by Gregory Haskins

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

>>> On Tue, Jan 29, 2008 at 2:37 PM, in message
<[email protected]>, Paul Jackson <[email protected]> wrote:
> Gregory wrote:
>> > 1) What are 'per-domain' variables?
>>
>> s/per-domain/per-root-domain
>
> Oh dear - now I've got more questions, not fewer.
>
> 1) "variables" ... what variables?

Well, anything that is declared in "struct root_domain" in kernel/sched.c.

For instance, today in mainline we have:

struct root_domain {
atomic_t refcount;
cpumask_t span;
cpumask_t online;

/*
* The "RT overload" flag: it gets set if a CPU has more than
* one runnable RT task.
*/
cpumask_t rto_mask;
atomic_t rto_count;
};

The first three are just related to general root-domain infrastructure code. The last two in this case are related specifically to the rt-overload feature. In earlier versions of rt-balance, the rt-overload bitmap was a global variable. By moving it into the root_domain structure, there is now an instance per (um, for lack of a better, more up to date word) "exclusive" cpuset. That way, disparate cpusets will not bother each other with overload notifications, etc.

Note that in -rt, we have more variables in this structure (RQ priority info) but that patch hasnt been pulled into sched-devel/linux-2.6 yet.

>
> 2) Is a 'root-domain' just the RT specific portion
> of a sched_domain, or is it something else?

Its meant to be general, but the only current client is the RT sched_class. Reading back through the links you guys have been sending, its very similar in concept to the "rt-domain" stuff that you, Peter, and Steven were discussing a while back.

When I was originally putting this stuff together, I wanted to piggy back this data in the sched-domain code. But I soon realized that the sched-domain trees are per-cpu structures. What I needed was an "umbrella" structure that would allow cpus in a common cpuset to share arbitrary state data, but yet were associated with the sched-domains that the cpuset code setup. The first pass had the structures associated with the sched-domain hierarchy, but I soon realized that it was really a per-rq association so I could simplify the design. I.e.. Rather than have the code walk the sched-domain to find the common "root", I just hung the root directly on the rq itself.

But anyway, to answer the question: The concept is meant to be generic. For instance, if it made sense for Peters cgroup work to sit here as well, we could just add new fields to the struct root_domain and Peter could access them via rq->rd.

I realize that it could possibly have been designed to abstract away the type of objects that the root-domain manages, but I want to keep the initial code as simple as possible. We can always complicate^h^h^h^h^hcleanup the code later ;)

Regards,
-Greg

2008-01-29 20:43:25

by Gregory Haskins

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

>>> On Tue, Jan 29, 2008 at 2:04 PM, in message
<[email protected]>, Paul Jackson <[email protected]> wrote:
> Gregory wrote:
>> IMHO it works well the way it is: The user selects the class for a
>> particular task using sched_setscheduler(), and they select the cpuset
>> (or inherit it) that defines its execution scope. If that scope has
>> balancing enabled, the policy for the member classes is in effect.
>
> Ok.
>
> For the various classes of schedulers (sched_class's), it's fine by me
> if sched domains are polymorphic, supporting all classes, and it is
> left to each task to self-select the scheduling class of its preference.
>
> For the batch scheduler case, this -must- be imposable from outside
> the task, by the batch scheduler that is overseeing the job, and it
> must support the batch scheduler being able to disable all the
> balancers in selected cpusets (selected sched_domains).
>
> We have that now. Each of us only knew of part of the solution,
> but we managed to arrive at the desired answer even so ... amazing.
>
> The batch scheduler just has to arrange to get 'sched_load_balance'
> turned off in a cpuset and all overlapping cpusets, and then the
> CPUS in that cpuset will not belong to -any- sched_domain, and hence
> (could you verify I'm right in this detail?) won't be balanced by any
> sched_class.

I am a little fuzzy on how this would work, so I cant say for certain. :) But it seems like that is accurate.


>
> I should update the documentation for sched_load_balance, changing it
> from saying that you get realtime by turning off sched_load_balance in
> the RT cpuset, to saying that you get realtime by (1) turning off
> sched_load_balance in any overlapping cpusets, including all
> encompassing parent cpusets, (2) leaving sched_load_balance on in the
> RT cpuset itself, and (3) having those realtime tasks each self-select
> (elect) the desired SCHED_* using sched_setscheduler().
>
> Condition (1) above is a tad difficult to understand, but servicable,
> I guess. The combination of (1) and (2) results in a separate
> sched_domain just for the CPUs in the RT cpuset.

Technically you only need (2). I run my 4-8 core development systems in the single default global cpuset, normally. Customers typically do use multiple sets, but we only use the vanilla balanced variety.

>
>> (on this topic, note that I do not know if the RT-balancer will
>> respect the cpuset concept of "balance-enabled" anyway. That might
>> have to be fixed)
>
> Er eh ... it has no choice. If the user space code has configured a
> cpuset with 'sched_load_balance' turned off in that cpuset and all
> overlapping cpusets, then there will not even be a sched_domain
> covering those CPUs, and hence no balancer, RT or other class, will
> even see those CPUs.
>
> Unless I really don't understand the kernel/sched.c sched_domain code
> (a distinct possibility), if some CPU is not in any sched_domain, then
> it won't get balanced, RT or otherwise.

Heh...I cant quite wrap my head around that, but it sounds like you are correct. The only thing I was really pointing out is that the RT code doesn't necessarily look at sched-domain flags before making balancing decisions. So as long as that is not a requirement, I think we are all set.



2008-01-29 20:57:05

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Gregory wrote:
> By moving it into the root_domain structure, there is now an instance
> per (um, for lack of a better, more up to date word) "exclusive"
> cpuset. That way, disparate cpusets will not bother each other with
> overload notifications, etc.

So the root_domain structure is meant to be the portions of the
sched_domains that are shared across all CPUs in that sched_domain ?

And the word 'cpuset', occurring in the above quote twice, should
be 'sched_domain', right ? Surely these aren't cpuset's ;).

And 'exclusive cpuset' really means 'non-overlapping sched_domain' ?

Or am I still confused ?

I would like to get our concepts clear, and terms consistent. That's
important for those others who would try to understand this.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 21:02:50

by Paul Jackson

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

Gregory wrote:
> > ... (1) turning off
> > sched_load_balance in any overlapping cpusets, including all
> > encompassing parent cpusets, (2) leaving sched_load_balance on in the
> > RT cpuset itself, and ...
>
> Technically you only need (2). I run my 4-8 core development systems
> in the single default global cpuset, normally.

Well, if you're running in the default cpuset, then you automatically get (1),
because sched_load_balance is turned off in all overlapping cpusets (there
aren't any overlapping cpusets!)

So, yes, you -do- need both (1) and (2). In your normal system, you
just happen to get (1) effortlessly.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-29 21:09:27

by Gregory Haskins

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

>>> On Tue, Jan 29, 2008 at 3:56 PM, in message
<[email protected]>, Paul Jackson <[email protected]> wrote:
> Gregory wrote:
>> By moving it into the root_domain structure, there is now an instance
>> per (um, for lack of a better, more up to date word) "exclusive"
>> cpuset. That way, disparate cpusets will not bother each other with
>> overload notifications, etc.
>
> So the root_domain structure is meant to be the portions of the
> sched_domains that are shared across all CPUs in that sched_domain ?

Thats exactly right.

>
> And the word 'cpuset', occurring in the above quote twice, should
> be 'sched_domain', right ? Surely these aren't cpuset's ;).

Yeah, I think I am taking shortcuts in the language ;). I wanted the root_domain to be an object of shared data that sits at the "root sched_domain", or in other terms the terminating parent in the hierarchy. And there is one of these suckers created every time a non-overlapping cpuset is created (which was called "exclusive" at the time I wrote it, I believe, but I keep forgetting what you said they are called now ;). So because the non-overlapping cpuset configuration begat the sched_domain hierarchy, I started using them interchangeably. Sorry for the confusion :)

>
> And 'exclusive cpuset' really means 'non-overlapping sched_domain' ?
>
> Or am I still confused ?

No, I think you nailed it.

>
> I would like to get our concepts clear, and terms consistent. That's
> important for those others who would try to understand this.

Very good idea. Thanks for doing this!

-Greg


2008-01-29 21:14:29

by Gregory Haskins

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing

>>> On Tue, Jan 29, 2008 at 4:02 PM, in message
<[email protected]>, Paul Jackson <[email protected]> wrote:
> Gregory wrote:
>> > ... (1) turning off
>> > sched_load_balance in any overlapping cpusets, including all
>> > encompassing parent cpusets, (2) leaving sched_load_balance on in the
>> > RT cpuset itself, and ...
>>
>> Technically you only need (2). I run my 4-8 core development systems
>> in the single default global cpuset, normally.
>
> Well, if you're running in the default cpuset, then you automatically get
> (1),
> because sched_load_balance is turned off in all overlapping cpusets (there
> aren't any overlapping cpusets!)
>
> So, yes, you -do- need both (1) and (2). In your normal system, you
> just happen to get (1) effortlessly.


Ah. Well see, I am just showing my ignorance of this area of the cpuset code then. I stand corrected, and sorry for the noise. :)

-Greg

2008-01-29 22:24:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: scheduler scalability - cgroups, cpusets and load-balancing


On Tue, 29 Jan 2008, Gregory Haskins wrote:
> >
> > I would like to get our concepts clear, and terms consistent. That's
> > important for those others who would try to understand this.
>
> Very good idea. Thanks for doing this!
>

Sorry for coming in so late, I've been banging my head on different bugs
all day.

Just to clear up what our goal for the RT balancer was, and how simple it
is ;-) Basically any task that has an RT priority needs to run ASAP from
the time it woke up if there's a CPU available that it can run on and it
has a higher priority than what is currently running on that CPU.

If an RT task wakes up, and there's a CPU available somewhere for it to
run on, we want the RT task to jump to that CPU and run. RT tasks should
not be waiting around for nice load balancing that optimizes the cache
usage. But we also have a problem as well. We don't want to kill the
cache on large NUMA architectures by looking for places for on RT task to
run. With domains, we first look for a place that an RT task can run on a
local CPU in the node, if so, then place it there, otherwise look at other
nodes to run on.

Note, that the RT balancing is aggressive and not passive. That means that
the balancing takes place at the time the RT task is awoken (perhaps by
the task that is waking it) or at the time a task changes priority. It is
not passive, being that it waits for something else to migrate it
(i.e. migration thread)

Paul, I think you now understand that we don't have some scheduler domain
that is specific for RT. The scheduling class is specific to the priority
of the process and not to what domain it is in. But if you keep a domain
invisible to an RT task, that domain never needs to worry about RT tasks
migrating to it.

The code that Gregory and I have been adding was to try to migrate RT
tasks to CPUs they can run on as quick as possible without algorithms that
cause cacheline bouncing.

-- Steve