LinuxLists.cc - Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

2004-10-07 19:12:52

Subject: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> Once you have the exclusive set in your example, wouldn't the existing
> functionality of CKRM provide you all the functionality the other
> non-exclusive sets require?
>
> Seems to me, we need a way to *restrict use* of certain resources
> (exclusive) and a way to *share use* of certain resources (non-exclusive.)
> CKRM does the latter right now, I believe, but not the former.

I'm losing you right at the top here, Rick. Sorry.

I'm no CKRM wizard, so tell me if I'm wrong.

But doesn't CKRM provide a way to control what percentage of the
compute cycles are available from a pool of cycles?

And don't cpusets provide a way to control which physical CPUs a
task can or cannot use?

Right.

And what I'm hearing is that if you're a job running in a set of shared
resources (i.e., non-exclusive) then by definition you are *not* a job
who cares about which processor you run on. I can't think of a situation
where I'd care about the physical locality, and the proximity of memory
and other nodes, but NOT care that other tasks might steal my cycles.

For parallel threaded apps with rapid synchronization between the
threads, as one gets with say OpenMP or MPI, there's a world of
difference. Giving both threads in a 2-way application of this kind
50% of the cycles on each of 2 processors can be an order of magnitude
slower than giving each thread 100% of one processor. Similarly, the
variability of runtimes for such threads pinned on distinct processors
can be an order of magnitude less than for floating threads.

Ah, so you want processor affinity for the tasks, then, not cpusets.

For shared resource environments where one is purchasing time
on your own computer, there's also world of difference. In many
cases one has paid (whether in real money to another company, or in
inter-departmental funny money - doesn't matter a whole lot here)
money for certain processor power, and darn well expects those
processors to sit idle if you don't use them.

One does? No, in my world, there's constant auditing going on and if
you can get away with having a machine idle, power to ya, but chances
are somebody's going to come and take away at least the cycles and maybe
the whole machine for somebody yammering louder than you about their
budget cuts. You get first cut, but if you're not using it, you don't
get to sit fat and happy.

And the vendor (whether your ISP or your MIS department) of these
resources can't hide the difference. Your work runs faster and with
dramatically more consistent runtimes if the entire processor/memory
units are yours, all yours, whether you use them or not.

When I'm not using them, my work doesn't run faster. It just doesn't run.

There is a fundamental difference between controlling which physical
processors on an SMP or NUMA system one may use, and adding delays
to the tasks of select users to ensure they don't use too much.

In the experience of SGI, and I hear tell of other companies,
workload management by fair share techniques (add delays to tasks
exceeding their allotment) has been found to be dramatically less
useful to customers,

Less useful than ... what? As a substitute for exclusive access to
one or more cpus, which currently is not possible? I can believe that.
But you're saying these companies didn't size their tasks properly to
the cpus they had allocated and yet didn't require exclusivity? How
would non-exclusive sets address this human failing? You have 30 cpus'
worth of tasks to run on 24 cpus. Somebody will take a hit, right,
whether CKRM or cpusets are managing those 24 cpus?

> * There is no clear policy on how to amiably create an exclusive set.
> The main problem is what to do with the tasks already there.

There is a policy, that works well, and those of us in this
business have been using for years. When the system boots,
you put everything that doesn't need to be pinned elsewhere in
a bootcpuset, and leave the rest of the system dark. You then,
whether by manual administrative techniques or a batch scheduler,
hand out dedicated sets of CPU and Memory to jobs, which get exclusive
use of those compute resources (or controlled sharing with only what
you intentionally let share).

This presumes you know, at boot time, how you want things divided.
All of your examples so far have seemed to indicate that policy changes
may well be made *after* boot time. So I'll rephrase: any time you
create an exclusive set after boot time, you may find tasks already
running there. I suggested one policy for dealing with them.

The difference between cpusets and CKRM is not about restricting
versus sharing. Rather cpusets is about controlled allocation of big,
named chunks of a computer - certain numbered CPUs and Memory Nodes
allocated by number. CKRM is about enforcing the rate of usage of
anonymous, fungible resources such as cpu cycles and memory pages.

Unfortunately for CKRM, on modern system architectures of two or more
CPUs, cycles are not interchangeable and fungible, due to the caching.
On NUMA systems, which is the norm for all vendors above 10 or 20 CPUs
(due to our inability to make a backplane fast enough to handle more)
memory pages are not interchangeable and fungible either.

CKRM is not going to merrily move tasks around just because it can,
either, and it will still adhere to common scheduling principles regarding
cache warmth and processor affinity.

You use the example of a two car family, and preferring one over the other.
I'd turn that around and say it's really two exclusive sets of one
car each, rather than a shared set of two cars. In that example, do you
ask your wife before you take "her" car, or do just take it because it's
a shared resource? I know how it works in *my* family :)

You've given a convincing argument for the exclusive side of things.
But my point is that on the non-exclusive side the features you claim
to need seem in confict: if the cpu/memory linkage is important to job
predictability, how can you then claim it's ok to share it with anybody,
even a "friendly" task? If it's ok to share, then you've just thrown
predictability out the window. The cpu/memory linkage is interesting,
but it won't drive the job performance anymore.

I'm trying to nail down requirements. I think we've nailed down the
exclusive one. It's real, and it's currently unmet. The code you've
written looks to provide a good base upon which to meet that requirement.
On the non-exclusive side, I keep hearing conflicting information
about how layout is important for performance but it's ok to share with
arbitrary jobs -- like sharing won't affect performance?

Rick

2004-10-10 02:18:29

by Paul Jackson

[permalink] [raw]

Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Rick replying to Paul:
> > But doesn't CKRM provide a way to control what percentage of the
> > compute cycles are available from a pool of cycles?
> >
> > And don't cpusets provide a way to control which physical CPUs a
> > task can or cannot use?
>
> Right.

I am learning (see other messages of the last couple days on this
thread) that CKRM is supposed to be a general purpose workload manager
framework, and that fair share scheduling (managing percentage of
compute cycles) just happens to be the first instance of such a manager.

> And what I'm hearing is that if you're a job running in a set of shared
> resources (i.e., non-exclusive) then by definition you are *not* a job
> who cares about which processor you run on. I can't think of a situation
> where I'd care about the physical locality, and the proximity of memory
> and other nodes, but NOT care that other tasks might steal my cycles.

There are at least these situations:
1) proximity to special hardware (graphics, networking, storage, ...)
2) non-dedicated tightly coupled multi-threaded apps (OpenMP, MPI)
3) batch managers switching resources between jobs

On (2), if say you want to run eight copies of an application, on a
system that only has eight CPUs, where each copy of the app is an
eight-way tightly coupled app, they will go much faster if each app is
placed across all 8 CPUs, one thread per CPU, than if they are placed
willy-nilly. Or a bit more realistically, if you have a random input
queue of such tightly coupled apps, each with a predetermined number of
threads between one and eight, you will get more work done by pinning
the threads of any given app on different CPUs. The users submitting
the jobs may well not care which CPUs are used for their job, but an
intermediate batch manager probably will care, as it may be solving the
knapsack problem of how to fit a stream of varying sized jobs onto a
given size of hardware.

On (3), a batch manager might say have two small cpusets, and also one
larger cpuset that is the two small ones combined. It might run one job
in each of the two small cpusets for a while, then suspend these two
jobs, in order to run a third job in the larger cpuset. The two small
cpusets don't go away while the third job runs -- you don't want to lose
or have to tear down and rebuild the detailed inter-cpuset placement of
the two small jobs while they are suspended.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-10 02:30:49

by Paul Jackson

[permalink] [raw]

Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Rick wrote:
> One does? No, in my world, there's constant auditing going on and if
> you can get away with having a machine idle, power to ya, but chances
> are somebody's going to come and take away at least the cycles and maybe

I don't doubt that such worlds as yours exist, nor that you live in one.

In some of the worlds my customers live in, they have been hit so many
times with the pains of performance degradation and variation due to
unwanted interaction between applications that they get nervous if a
supposedly unused CPU or Memory looks to be in use. Just the common use
by Linux of unused memory to keep old pages in cache upsets them.

And, perhaps more to the point, while indeed some other department may
soon show up to make use of those lost cycles, the computer had jolly
well better leave those cycles lost _until_ the customer decides to use
them.

Unlike the computer in my dentists office, which should "just do it",
maximizing throughput as best it can, not asking any questions, the
computers in some of my customers high end shops are managed more tightly
(sometimes very tightly) and they expect to control load placement.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-11 22:14:22

by Matthew Dobson

[permalink] [raw]

Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Sat, 2004-10-09 at 19:15, Paul Jackson wrote:
> Rick replying to Paul:
> > And what I'm hearing is that if you're a job running in a set of shared
> > resources (i.e., non-exclusive) then by definition you are *not* a job
> > who cares about which processor you run on. I can't think of a situation
> > where I'd care about the physical locality, and the proximity of memory
> > and other nodes, but NOT care that other tasks might steal my cycles.
>
> There are at least these situations:
> 1) proximity to special hardware (graphics, networking, storage, ...)
> 2) non-dedicated tightly coupled multi-threaded apps (OpenMP, MPI)
> 3) batch managers switching resources between jobs
>
> On (2), if say you want to run eight copies of an application, on a
> system that only has eight CPUs, where each copy of the app is an
> eight-way tightly coupled app, they will go much faster if each app is
> placed across all 8 CPUs, one thread per CPU, than if they are placed
> willy-nilly. Or a bit more realistically, if you have a random input
> queue of such tightly coupled apps, each with a predetermined number of
> threads between one and eight, you will get more work done by pinning
> the threads of any given app on different CPUs. The users submitting
> the jobs may well not care which CPUs are used for their job, but an
> intermediate batch manager probably will care, as it may be solving the
> knapsack problem of how to fit a stream of varying sized jobs onto a
> given size of hardware.
>
> On (3), a batch manager might say have two small cpusets, and also one
> larger cpuset that is the two small ones combined. It might run one job
> in each of the two small cpusets for a while, then suspend these two
> jobs, in order to run a third job in the larger cpuset. The two small
> cpusets don't go away while the third job runs -- you don't want to lose
> or have to tear down and rebuild the detailed inter-cpuset placement of
> the two small jobs while they are suspended.

I think these situations, particularly the first two, are the times you
*want* to use the cpus_allowed mechanism. Pinning a specific thread to
a specific processor (cases (1) & (2)) is *exactly* why the cpus_allowed
mechanism was put into the kernel.

And (3) can pretty easily be achieved by using a combination of
sched_domains and cpus_allowed. In your example of one 4 CPU cpuset and
two 2 CPU sub cpusets (cpu-subsets? :), one could easily create a 4 CPU
domain for the larger job and two 2 CPU domains for the smaller jobs.
Those 2 2 CPU subdomains can be created & destroyed at will, or they
could be simply tagged as "exclusive" when you don't want tasks moving
back and forth between them, and tagged as "non-exclusive" when you want
tasks to be freely balanced across all 4 CPUs in the larger parent
domain.

One of the cool thing about using sched_domains as your partitioning
element is that in reality, tasks run on *CPUs*, not *domains*. So if
you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and
threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can
suspend threads a1, a2, b1 & b2 and remove the domains they were running
in to allow job A (big job with threads A1, A2, A3, & A4) to run on the
larger 4 CPU domain. When you then suspend A1-A4 again to allow the
smaller jobs to proceed, you can pretty trivially create the 2 CPU
domains underneath the 4 CPU domain and resume the jobs. Those jobs (a
& b) have been suspended on the CPUs they were originally running on,
and thus will resume on the same CPUs without any extra effort. They
will simply run on those CPUs, and at load balance time, the domains
attached to those CPUs will be consulted to determine where the tasks
can be relocated to if there is a heavy load. The domains will tell the
scheduler that the tasks cannot be relocated outside the 2 CPUs in each
respective domain. Viola! (sorta ;)

-Matt

2004-10-11 23:07:04

by Paul Jackson

[permalink] [raw]

Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

Matthew wrote:
> One of the cool thing about using sched_domains as your partitioning
> element is that in reality, tasks run on *CPUs*, not *domains*.

Unfortunately, my manager has reminded me of an essential deliverable
that I have for another project, due in two weeks. I'm going to need
every one of those days. So I will have to take a two week sabbatical
from this design work.

It might make sense to reconvene this work on a new thread, with a last
message on this monster thread inviting all interested parties to come
on over. I suspect a few folks will be happy to see this thread wind
down.

I'd guess lse-tech (my preference) or ckrm-tech would be a suitable
forum for this new thread.

Carry on.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-12 08:55:23

by Simon Derr

[permalink] [raw]

Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

> One of the cool thing about using sched_domains as your partitioning
> element is that in reality, tasks run on *CPUs*, not *domains*. So if
> you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and
> threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can
> suspend threads a1, a2, b1 & b2 and remove the domains they were running
> in to allow job A (big job with threads A1, A2, A3, & A4) to run on the
> larger 4 CPU domain. When you then suspend A1-A4 again to allow the
> smaller jobs to proceed, you can pretty trivially create the 2 CPU
> domains underneath the 4 CPU domain and resume the jobs. Those jobs (a
> & b) have been suspended on the CPUs they were originally running on,
> and thus will resume on the same CPUs without any extra effort. They
> will simply run on those CPUs, and at load balance time, the domains
> attached to those CPUs will be consulted to determine where the tasks
> can be relocated to if there is a heavy load. The domains will tell the
> scheduler that the tasks cannot be relocated outside the 2 CPUs in each
> respective domain. Viola! (sorta ;)
Voil? ;-)

I agree that this looks really smooth from a scheduler point of view.

>From a user point of view, remains the issue of suspending the tasks:
-find which tasks to suspend : how do you know that job 'a' consists
exactly of 'a1' and 'a2'
-suspend them (btw, how do you achieve this ? kill -STOP ?)

I've been away from my mail and still trying to catch up, nevermind if the
above does not makes sense to you.

Simon.

2004-10-12 21:23:57

by Matthew Dobson

[permalink] [raw]

Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Mon, 2004-10-11 at 15:58, Paul Jackson wrote:
> Matthew wrote:
> > One of the cool thing about using sched_domains as your partitioning
> > element is that in reality, tasks run on *CPUs*, not *domains*.
>
> Unfortunately, my manager has reminded me of an essential deliverable
> that I have for another project, due in two weeks. I'm going to need
> every one of those days. So I will have to take a two week sabbatical
> from this design work.
>
> It might make sense to reconvene this work on a new thread, with a last
> message on this monster thread inviting all interested parties to come
> on over. I suspect a few folks will be happy to see this thread wind
> down.
>
> I'd guess lse-tech (my preference) or ckrm-tech would be a suitable
> forum for this new thread.
>
> Carry on.

Sounds good, Paul. I think the discussion on this thread was kind of
winding down anyway. In two weeks I'll have some more work done on my
code, particularly trying to get the cpusets/CKRM filesystem interface
to play with my sched_domains code. We'll be able to digest all the the
information, requirements, requests, etc. on this thread and start a
fresh discussion on (or at least closer to) the same page.

-Matt

2004-10-12 21:27:30

by Matthew Dobson

[permalink] [raw]

Subject: Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

On Tue, 2004-10-12 at 01:50, Simon Derr wrote:
> > One of the cool thing about using sched_domains as your partitioning
> > element is that in reality, tasks run on *CPUs*, not *domains*. So if
> > you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and
> > threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can
> > suspend threads a1, a2, b1 & b2 and remove the domains they were running
> > in to allow job A (big job with threads A1, A2, A3, & A4) to run on the
> > larger 4 CPU domain. When you then suspend A1-A4 again to allow the
> > smaller jobs to proceed, you can pretty trivially create the 2 CPU
> > domains underneath the 4 CPU domain and resume the jobs. Those jobs (a
> > & b) have been suspended on the CPUs they were originally running on,
> > and thus will resume on the same CPUs without any extra effort. They
> > will simply run on those CPUs, and at load balance time, the domains
> > attached to those CPUs will be consulted to determine where the tasks
> > can be relocated to if there is a heavy load. The domains will tell the
> > scheduler that the tasks cannot be relocated outside the 2 CPUs in each
> > respective domain. Viola! (sorta ;)
> Voilà ;-)

hehe... My French spelling obviously isn't quite up to par. ;)

> I agree that this looks really smooth from a scheduler point of view.
>
> From a user point of view, remains the issue of suspending the tasks:
> -find which tasks to suspend : how do you know that job 'a' consists
> exactly of 'a1' and 'a2'
> -suspend them (btw, how do you achieve this ? kill -STOP ?)
>
>
> I've been away from my mail and still trying to catch up, nevermind if the
> above does not makes sense to you.
>
> Simon.

Paul didn't go into specifics about how to suspend the job, so neither
did I. Sending SIGSTOP & SIGCONT should work, as you mention... Those
are implementation details which really aren't *that* important to the
discussion. We're still trying to figure out the overall framework and
API to work with, so which method of suspending a thread we'll
eventually use can be tackled down the road. :)

-Matt