2001-07-12 23:41:23

by Mike Kravetz

[permalink] [raw]
Subject: CPU affinity & IPI latency

This discussion was started on '[email protected]'.
I'm widening the distribution in the hope of getting more input.

It started when Andi Kleen noticed that a single 'CPU Hog' task
was being bounced back and forth between the 2 CPUs on his 2-way
system. I had seen similar behavior when running the context
switching test of LMbench. When running lat_ctx with only two
threads on an 8 CPU system, one would ?expect? the two threads
to be confined to two of the 8 CPUs in the system. However, what
I have observed is that the threads are effectively 'round
robined' among all the CPUs and they all end up bearing
an equivalent amount of the CPU load. To more easily observe
this, increase the number of 'TRIPS' in the benchmark to a really
large number.

After a little investigation, I believe this 'situation' is caused
by the latency of the reschedule IPI used by the scheduler. Recall
that in lat_ctx all threads are in a tight loop consisting of:

pipe_read()
pipe_write()

Both threads 'start' on the same CPU and are sitting in pipe_read
waiting for data. A token is written to the pipe and one thread
is awakened. The awakened thread, then immediately writes the token
back to the pipe which ultimately results in a call to reschedule_idle()
that will 'initiate' the scheduling of the other thread. In
reschedule_idle() we can not take the 'fast path' because WE are
currently executing on the other thread's preferred CPU. Therefore,
reschedule_idle() chooses the oldest idle CPU and sends the IPI.
However, before the IPI is received (and schedule() run) on the
remote CPU, the currently running thread calls pipe_read which
blocks and calls schedule(). Since the other task has yet to be
scheduled on the other CPU, it is scheduled to run on the current
CPU. Both tasks continue to execute on the one CPU until such time
that an IPI induced schedule() on the other CPU hits a window where
it finds one of the tasks to schedule. We continue in this way,
migrating the tasks to the oldest idle CPU and eventually cycling our
way through all the CPUs.

Does this explanation sound reasonable?

If so, it would then follow that booting with 'idle=poll' would
help alleviate this situation. However, that is not the case. With
idle=poll the CPU load is not as evenly distributed among the CPUs,
but is still distributed among all of them.

Does the behavior of the 'benchmark' mean anything? Should one
expect tasks to stay their preferred CPUs if possible?

Thoughts/comments
--
Mike Kravetz [email protected]
IBM Linux Technology Center


2001-07-13 00:20:04

by Davide Libenzi

[permalink] [raw]
Subject: RE: CPU affinity & IPI latency


On 12-Jul-2001 Mike Kravetz wrote:
> This discussion was started on '[email protected]'.
> I'm widening the distribution in the hope of getting more input.
>
> It started when Andi Kleen noticed that a single 'CPU Hog' task
> was being bounced back and forth between the 2 CPUs on his 2-way
> system. I had seen similar behavior when running the context
> switching test of LMbench. When running lat_ctx with only two
> threads on an 8 CPU system, one would ?expect? the two threads
> to be confined to two of the 8 CPUs in the system. However, what
> I have observed is that the threads are effectively 'round
> robined' among all the CPUs and they all end up bearing
> an equivalent amount of the CPU load. To more easily observe
> this, increase the number of 'TRIPS' in the benchmark to a really
> large number.
>
> After a little investigation, I believe this 'situation' is caused
> by the latency of the reschedule IPI used by the scheduler. Recall
> that in lat_ctx all threads are in a tight loop consisting of:
>
> pipe_read()
> pipe_write()
>
> Both threads 'start' on the same CPU and are sitting in pipe_read
> waiting for data. A token is written to the pipe and one thread
> is awakened. The awakened thread, then immediately writes the token
> back to the pipe which ultimately results in a call to reschedule_idle()
> that will 'initiate' the scheduling of the other thread. In
> reschedule_idle() we can not take the 'fast path' because WE are
> currently executing on the other thread's preferred CPU. Therefore,
> reschedule_idle() chooses the oldest idle CPU and sends the IPI.
> However, before the IPI is received (and schedule() run) on the
> remote CPU, the currently running thread calls pipe_read which
> blocks and calls schedule(). Since the other task has yet to be
> scheduled on the other CPU, it is scheduled to run on the current
> CPU. Both tasks continue to execute on the one CPU until such time
> that an IPI induced schedule() on the other CPU hits a window where
> it finds one of the tasks to schedule. We continue in this way,
> migrating the tasks to the oldest idle CPU and eventually cycling our
> way through all the CPUs.
>
> Does this explanation sound reasonable?

I would say yes.


>
> If so, it would then follow that booting with 'idle=poll' would
> help alleviate this situation. However, that is not the case. With
> idle=poll the CPU load is not as evenly distributed among the CPUs,
> but is still distributed among all of them.
>
> Does the behavior of the 'benchmark' mean anything? Should one
> expect tasks to stay their preferred CPUs if possible?

Maybe having a per-cpu wake list where the rescheduled task is moved to be woken
up by IPI target.




- Davide

2001-07-13 00:36:57

by Larry McVoy

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

Be careful tuning for LMbench (says the author :-)

Especially this benchmark. It's certainly possible to get dramatically better
SMP numbers by pinning all the lat_ctx processes to a single CPU, because
the benchmark is single threaded. In other words, if we have 5 processes,
call them A, B, C, D, and E, then the benchmark is passing a token from
A to B to C to D to E and around again.

If the amount of data/instructions needed by all 5 processes fits in the
cache and you pin all the processes to the same CPU you'll get much
better performance than simply letting them float.

But making the system do that naively is a bad idea.

This is a really hard area to get right but you can take a page from all
the failed process migration efforts. In general, moving stuff is a bad
idea, it's much better to leave it where it is. Everything scales better
if there is a process queue per CPU and the default is that you leave the
processes on the queue on which they last run. However, if the load average
for a queue starts going up and there is another queue with a substantially
lower load average, then and ONLY then, should you move the process.

I think if you experiment with that you'll see that lat_ctx does well and
so do a lot of other things.

An optimization on that requires hardware support. If you knew the number
of cache misses associated with each time slice, you could factor that in
and start moving processes that have a "too high" cache miss rate, with the
idea being that we want to keep all processes on the same CPU if we can
but if that is causing an excessive cache miss rate, it's time to move.

Another optimization is to always schedule an exec-ed process (as opposed
to a forked process) on a different CPU than its parent. In general, when
you exec you have a clear boundary and it's good to spread those out.

All of this is based on my somewhat dated performance efforts that lead to
LMbench. I don't know of any fundamental changed that invalidate these
opinions but I could be wrong.

This is an area in which I've done a pile of work and I'd be interested
in keeping a finger in any efforts to fix up the scheduler.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2001-07-13 02:06:28

by Mark Hahn

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

> If the amount of data/instructions needed by all 5 processes fits in the
> cache and you pin all the processes to the same CPU you'll get much
> better performance than simply letting them float.

interesting. I remember that around 2.3.40, the scheduler handled
"frequent-schedulers" (like lat_ctx) differently, based on how long
a timeslice the proc used, relative to the estimated time to flush cache.

as I recall, it let them stay on their current CPU. like letting
someone with 1 item go ahead of you in a grocery store checkout ;)

that code is mostly gone (only cacheflush_time remains); I think it
morphed into the current migrate-to-longest-idle heuristic.

does anyone remember why the frequent-schedulers code was killed?
just because it conflated cache-affinity with timeslice?

regards, mark hahn.

2001-07-13 16:38:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


On 13-Jul-2001 Larry McVoy wrote:
> Be careful tuning for LMbench (says the author :-)
>
> Especially this benchmark. It's certainly possible to get dramatically
> better
> SMP numbers by pinning all the lat_ctx processes to a single CPU, because
> the benchmark is single threaded. In other words, if we have 5 processes,
> call them A, B, C, D, and E, then the benchmark is passing a token from
> A to B to C to D to E and around again.
>
> If the amount of data/instructions needed by all 5 processes fits in the
> cache and you pin all the processes to the same CPU you'll get much
> better performance than simply letting them float.
>
> But making the system do that naively is a bad idea.

Agree.


>
> This is a really hard area to get right but you can take a page from all
> the failed process migration efforts. In general, moving stuff is a bad
> idea, it's much better to leave it where it is. Everything scales better
> if there is a process queue per CPU and the default is that you leave the
> processes on the queue on which they last run. However, if the load average
> for a queue starts going up and there is another queue with a substantially
> lower load average, then and ONLY then, should you move the process.

I personally think that a standard scheduler/cpu is the way to go for SMP.
I saw the one IBM guys did and I think that the wrong catch there is trying
always to grab the best task to run over all CPUs.
I think that this concept could be relaxed introducing less chains between each
CPU scheduler.
A cheap load balancer should run, "time to time"(tm), to move tasks when a
certain level of unbalancing has been reached.
This will give each scheduler more independence and will make it more scalable,
IMVHO.


> This is an area in which I've done a pile of work and I'd be interested
> in keeping a finger in any efforts to fix up the scheduler.

We've, somehow, understood it :)



- Davide

2001-07-13 17:06:32

by Mike Kravetz

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

On Thu, Jul 12, 2001 at 05:36:41PM -0700, Larry McVoy wrote:
> Be careful tuning for LMbench (says the author :-)
>
> Especially this benchmark. It's certainly possible to get dramatically better
> SMP numbers by pinning all the lat_ctx processes to a single CPU, because
> the benchmark is single threaded. In other words, if we have 5 processes,
> call them A, B, C, D, and E, then the benchmark is passing a token from
> A to B to C to D to E and around again.
>
> If the amount of data/instructions needed by all 5 processes fits in the
> cache and you pin all the processes to the same CPU you'll get much
> better performance than simply letting them float.
>
> But making the system do that naively is a bad idea.

I agree, and can't imagine the system ever attempting to take this
into account and leave these 5 tasks on the same CPU.

At the other extreme is my observation that 2 tasks on an 8 CPU
system are 'round robined' among all 8 CPUs. I think having the
2 tasks stay on 2 of the 8 CPUs would be an improvement with respect
to CPU affinity. Actually, the scheduler does 'try' to do this.

It is clear that the behavior of lat_ctx bypasses almost all of
the scheduler's attempts at CPU affinity. The real question is,
"How often in running 'real workloads' are the schduler's attempts
at CPU affinity bypassed?".

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-07-13 17:32:25

by Mike Kravetz

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

On Fri, Jul 13, 2001 at 09:41:44AM -0700, Davide Libenzi wrote:
>
> I personally think that a standard scheduler/cpu is the way to go for SMP.
> I saw the one IBM guys did and I think that the wrong catch there is trying
> always to grab the best task to run over all CPUs.

That was me/us. Most of the reason for making 'global scheduling'
decisions was an attempt to maintain the same behavior as the existing
scheduler. We are trying to see how well we can make this scheduler
scale, while maintaining this global behavior. Thought is that if
there was ever any hope of someone adopting this scheduler, they would
be more likely to do so if it attempted to maintain existing behavior.


> I think that this concept could be relaxed introducing less chains between each
> CPU scheduler.
> A cheap load balancer should run, "time to time"(tm), to move tasks when a
> certain level of unbalancing has been reached.
> This will give each scheduler more independence and will make it more scalable,
> IMVHO.

Take a look at the 'CPU pooling & Load balancing' extensions to our
scheduler(lse.sourceforge.net/scheduling). It allows you to divide
the system into CPU pools keep scheduling decisions limited to
these pools. Periodicly, load balancing will be performed among
the pools. This has shown some promise on NUMA architectures (as
one would expect). You could define pool sizes to be 1 CPU and
get the scheduling behavior you desire on SMP.

But, none of this has to do with CPU affinity issues with the
current scheduler.

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-07-13 19:14:40

by Davide Libenzi

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 09:41:44AM -0700, Davide Libenzi wrote:
>>
>> I personally think that a standard scheduler/cpu is the way to go for SMP.
>> I saw the one IBM guys did and I think that the wrong catch there is trying
>> always to grab the best task to run over all CPUs.
>
> That was me/us. Most of the reason for making 'global scheduling'
> decisions was an attempt to maintain the same behavior as the existing
> scheduler. We are trying to see how well we can make this scheduler
> scale, while maintaining this global behavior. Thought is that if
> there was ever any hope of someone adopting this scheduler, they would
> be more likely to do so if it attempted to maintain existing behavior.

The problem, IMHO, is that we're trying to extend what is a correct behaviour on
the UP scheduler ( pickup the best task to run ) to SMP machines.
Global scheduling decisions should be triggered in response of load unbalancing
and not at each schedule() run otherwise we're going to introduce a common lock
that will limit the overall scalability.
My idea about the future of the scheduler is to have a config options users can
chose depending upon the machine use.
By trying to keep a unique scheduler for both UP and MP is like going to give
the same answer to different problems and the scaling factor ( of the scheduler
itself ) on SMP will never improve.
The code inside kernel/sched.c should be reorganized ( it contains even not
scheduler code ) so that the various CONFIG_SCHED* will not introduce any messy
inside the code ( possibly by having the code in separate files
kernel/sched*.c ).




- Davide

2001-07-13 19:41:23

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [Lse-tech] Re: CPU affinity & IPI latency

> On Fri, 13 Jul 2001 12:17:37 PDT, Davide Libenzi wrote:
>
> The problem, IMHO, is that we're trying to extend what is a correct
> behaviour on the UP scheduler ( pickup the best task to run ) to SMP
> machines. Global scheduling decisions should be triggered in response
> of load unbalancing and not at each schedule() run otherwise we're
> going to introduce a common lock that will limit the overall
> scalability. My idea about the future of the scheduler is to have a
> config options users can chose depending upon the machine use.
>
> By trying to keep a unique scheduler for both UP and MP is like going
> to give the same answer to different problems and the scaling factor
> (of the scheduler itself) on SMP will never improve. The code inside
> kernel/sched.c should be reorganized ( it contains even not scheduler
> code ) so that the various CONFIG_SCHED* will not introduce any messy
> inside the code ( possibly by having the code in separate files
> kernel/sched*.c ).
>
> - Davide

In a lot of cases, UP is just a simplified, degenerate case of SMP (which
is itself often a degenerate case of NUMA). Wouldn't it make a lot of
sense to have a single scheduler which 1) was relively simple, 2) was as
good as the current scheduler (or better) on UP, and 3) scaled well on SMP (and
NUMA)? I think the current lse scheduler meets all of those goals pretty
well.

Config options means the user has to choose, I have too many important
choices to make already when building a kernel.

Others have proposed loadable scheduler modules, but the overhead doesn't
seem to justify the separation. Config options mean more testing, more
stable APIs for low level scheduling (or more times when one or the other
is broken).

gerrit

2001-07-13 19:54:47

by Chris Wedgwood

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

On Fri, Jul 13, 2001 at 10:05:21AM -0700, Mike Kravetz wrote:

It is clear that the behavior of lat_ctx bypasses almost all of
the scheduler's attempts at CPU affinity. The real question is,
"How often in running 'real workloads' are the schduler's attempts
at CPU affinity bypassed?".

When encoding mp3s on a dual processor system, naturally I try to
encode two at once.

Most of the time this works as expected, one processed more or less
sticks to each CPU. However, I have noticed that if for some reason,
a third process has to be scheduled (which is inevitable if you
actually want to do anything), then these two processes seem to bounce
back and forward for a few seconds, _even_ after this CPU spike has
gone.

Now, I'm not sure if I'm imagining this or not, as I said, I have two
CPUs and two CPU-bound tasks, to instrument this as all, I really have
to affect what I am looking at (rather like the uncertainty principal)
so I assumed that perhaps my programs to read and process
/proc/<n>/cpu was simply eating several cycles and each process was
yielding more or less at random causing what I saw seeing.



--cw

2001-07-13 20:02:37

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: CPU affinity & IPI latency


On 13-Jul-2001 Gerrit Huizenga wrote:
> In a lot of cases, UP is just a simplified, degenerate case of SMP (which
> is itself often a degenerate case of NUMA). Wouldn't it make a lot of
> sense to have a single scheduler which 1) was relively simple, 2) was as
> good as the current scheduler (or better) on UP, and 3) scaled well on SMP
> (and
> NUMA)? I think the current lse scheduler meets all of those goals pretty
> well.

It's the concept of 'good enough' that seems to have different meanings for
different people.
Personally I could even think that the behaviour for the UP case is 'almost'
the same but, as you can see by watching at the lk threads in the last years,
it's pretty hard to try people to agree on the concept of 'good enough'.
Config options will leave the current scheduler for UP ( or even for not heavy
MP ) while will introduce an option to users that will suffer scheduler
problems.


> Config options means the user has to choose, I have too many important
> choices to make already when building a kernel.

Even when I go at the restaurant I've to chose, but I still prefer that instead
of getting the 'menu du jour'.




- Davide

2001-07-13 21:07:09

by David Lang

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

A real-world example of this issue.

I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess
was bouncing back and forth between the two CPUs. I actually was able to
gzip faster by starting up setiathome to keep one CPU busy and force the
scheduler to keep the gzip on a single CPU (I ran things several times to
verify it was actually faster)

David Lang


On Fri, 13 Jul 2001, Mike Kravetz wrote:

> Date: Fri, 13 Jul 2001 10:05:21 -0700
> From: Mike Kravetz <[email protected]>
> To: Larry McVoy <[email protected]>
> Cc: Davide Libenzi <[email protected]>, [email protected],
> Andi Kleen <[email protected]>, [email protected]
> Subject: Re: CPU affinity & IPI latency
>
> On Thu, Jul 12, 2001 at 05:36:41PM -0700, Larry McVoy wrote:
> > Be careful tuning for LMbench (says the author :-)
> >
> > Especially this benchmark. It's certainly possible to get dramatically better
> > SMP numbers by pinning all the lat_ctx processes to a single CPU, because
> > the benchmark is single threaded. In other words, if we have 5 processes,
> > call them A, B, C, D, and E, then the benchmark is passing a token from
> > A to B to C to D to E and around again.
> >
> > If the amount of data/instructions needed by all 5 processes fits in the
> > cache and you pin all the processes to the same CPU you'll get much
> > better performance than simply letting them float.
> >
> > But making the system do that naively is a bad idea.
>
> I agree, and can't imagine the system ever attempting to take this
> into account and leave these 5 tasks on the same CPU.
>
> At the other extreme is my observation that 2 tasks on an 8 CPU
> system are 'round robined' among all 8 CPUs. I think having the
> 2 tasks stay on 2 of the 8 CPUs would be an improvement with respect
> to CPU affinity. Actually, the scheduler does 'try' to do this.
>
> It is clear that the behavior of lat_ctx bypasses almost all of
> the scheduler's attempts at CPU affinity. The real question is,
> "How often in running 'real workloads' are the schduler's attempts
> at CPU affinity bypassed?".
>
> --
> Mike Kravetz [email protected]
> IBM Linux Technology Center
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-07-13 22:43:41

by Mike Kravetz

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
> A real-world example of this issue.
>
> I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess
> was bouncing back and forth between the two CPUs. I actually was able to
> gzip faster by starting up setiathome to keep one CPU busy and force the
> scheduler to keep the gzip on a single CPU (I ran things several times to
> verify it was actually faster)
>
> David Lang

That does sound like the same behavior I was seeing with lat_ctx. Like
I said in my previous note, the scheduler does try to take CPU affinity
into account. reschedule_idle() does a pretty good job of determining
what CPU a task should run on. In the case of lat_ctx (and I believe
your use of gzip), the 'fast path' in reschedule_idle() is taken because
the CPU associated with the awakened task is idle. However, before
schedule() is run on the 'target' CPU, schedule() is run on another
CPU and the task is scheduled there.

The root cause of this situation is the delay between the time
reschedule_idle() determines what is the best CPU a task should run
on, and the time schedule() is actually ran on that CPU.

I have toyed with the idea of 'temporarily binding' a task to a CPU
for the duration of the delay between reschedule_idle() and schedule().
It would go something like this,

- Define a new field in the task structure 'saved_cpus_allowed'.
With a little collapsing of existing fields, there is room to put
this on the same cache line as 'cpus_allowed'.
- In reschedule_idle() if we determine that the best CPU for a task
is the CPU it is associated with (p->processor), then temporarily
bind the task to that CPU. The task is temporarily bound to the
CPU by overwriting the 'cpus_allowed' field such that the task can
only be scheduled on the target CPU. Of course, the original
value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
- In schedule(), the loop which examines all tasks on the runqueue
will restore the value of 'cpus_allowed'.

This would preserve the 'best CPU' decision made by reschedule_idle().
On the down side, 'temporarily bound' tasks could not be scheduled
until schedule() is run on their associated CPUs. This could potentially
waste CPU cycles, and delay context switches. In addition, it is
quite possible that while a task is 'temporarily bound' the state of
the system could change in such a way that the best CPU is no longer
best.

There appears to be a classic tradeoff between CPU affinity and
context switch time.

Comments?

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-07-14 03:24:28

by Hubertus Franke

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency



Mike, could we utilize the existing mechanism such as has_cpu.

Here is my approach/suggestion.
We introduce in the sched_data[cpu] a resched_task slot;

struct task *resched_task;

When in reschedule_idle() a target cpu is to be decided for task <p>
we check the resched_task slot. If the slot is pointing
to some task, then this task should be considered running
and we should not consider the preemption goodness to the
currently running task as we know it already got IPI'ed.
See also the schedule() function describe later on.

reschedule_idle(struct task *p)
{
:
:

struct task *rst = sched_data[target_cpu].resched_task;
struct task *rmt = sched_data[target_cpu].current;
if (rst != NULL) {
if (preemption_goodness(p,rst, ...) > 1)
p->processor = target_cpu;
p->has_cpu = 1;
rst->has_cpu = 0; /* reset as we overwrite */
sched_data[target_cpu].resched_task = p;
/* so we make old <rst> available for scheduling
* and temp-bind <p> to target_cpu */
* don't have to send IPI as this to handles race
* condition and we are holding scheduling lock
*/
} else {
continue;
/* we know that the current priority won't be
* larger than the one of <rst> otherwise this
* would have been picked up in the schedule
*/
}
} else {
/* standard stuff that we always do */
}
}

In schedule() we need to check whether a reschedule reservation is held.
First we go through the standard <stillrunning> check to compute the
initial
<c> goodness value. Then under

still_running_back:

<loop through the runqueue>
/* note that <rst> will be ignored due to <has_cpu=1> flag */

/* check wether reservation <rst> exists */
rst = sched_data[this_cpu].resched_task;
if (rst != NULL) {
c = goodness(rst,..);
if (c > best_prio) {
best_prio = goodness(rst,..);
next = rst;
sched_data[this_cpu].resched_task = NULL;
} else {
/* need to return rst back to scheduable state */
rst->has_cpu = 0;
}
}


This approach would eliminate the need to check during runqueue scan
to check for each task's saved_cpus_allowed and would also make sure that
only one task reserves running on a particular cpu. Reservations are
preempted through the existing mechanism, namely goodness comparision,
and such "preempted" tasks are returned to general scheduability.
are put back into the

Hubertus Franke
Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: [email protected]
(w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003



Mike Kravetz <[email protected]>@vger.kernel.org on 07/13/2001 06:43:05
PM

Sent by: [email protected]


To: David Lang <[email protected]>
cc: Larry McVoy <[email protected]>, Davide Libenzi
<[email protected]>, [email protected], Andi Kleen
<[email protected]>, [email protected]
Subject: Re: CPU affinity & IPI latency



On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
> A real-world example of this issue.
>
> I was gzipping a large (~800MB) file on a dual athlon box. the gzip
prcess
> was bouncing back and forth between the two CPUs. I actually was able to
> gzip faster by starting up setiathome to keep one CPU busy and force the
> scheduler to keep the gzip on a single CPU (I ran things several times to
> verify it was actually faster)
>
> David Lang

That does sound like the same behavior I was seeing with lat_ctx. Like
I said in my previous note, the scheduler does try to take CPU affinity
into account. reschedule_idle() does a pretty good job of determining
what CPU a task should run on. In the case of lat_ctx (and I believe
your use of gzip), the 'fast path' in reschedule_idle() is taken because
the CPU associated with the awakened task is idle. However, before
schedule() is run on the 'target' CPU, schedule() is run on another
CPU and the task is scheduled there.

The root cause of this situation is the delay between the time
reschedule_idle() determines what is the best CPU a task should run
on, and the time schedule() is actually ran on that CPU.

I have toyed with the idea of 'temporarily binding' a task to a CPU
for the duration of the delay between reschedule_idle() and schedule().
It would go something like this,

- Define a new field in the task structure 'saved_cpus_allowed'.
With a little collapsing of existing fields, there is room to put
this on the same cache line as 'cpus_allowed'.
- In reschedule_idle() if we determine that the best CPU for a task
is the CPU it is associated with (p->processor), then temporarily
bind the task to that CPU. The task is temporarily bound to the
CPU by overwriting the 'cpus_allowed' field such that the task can
only be scheduled on the target CPU. Of course, the original
value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
- In schedule(), the loop which examines all tasks on the runqueue
will restore the value of 'cpus_allowed'.

This would preserve the 'best CPU' decision made by reschedule_idle().
On the down side, 'temporarily bound' tasks could not be scheduled
until schedule() is run on their associated CPUs. This could potentially
waste CPU cycles, and delay context switches. In addition, it is
quite possible that while a task is 'temporarily bound' the state of
the system could change in such a way that the best CPU is no longer
best.

There appears to be a classic tradeoff between CPU affinity and
context switch time.

Comments?

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-07-15 07:44:16

by Troy Benjegerdes

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote:
> This discussion was started on '[email protected]'.
> I'm widening the distribution in the hope of getting more input.
>
> It started when Andi Kleen noticed that a single 'CPU Hog' task
> was being bounced back and forth between the 2 CPUs on his 2-way
> system. I had seen similar behavior when running the context
> switching test of LMbench. When running lat_ctx with only two
> threads on an 8 CPU system, one would ?expect? the two threads
> to be confined to two of the 8 CPUs in the system. However, what
> I have observed is that the threads are effectively 'round
> robined' among all the CPUs and they all end up bearing
> an equivalent amount of the CPU load. To more easily observe
> this, increase the number of 'TRIPS' in the benchmark to a really
> large number.

Did this 'CPU Hog' task do any I/O that might have caused an interrupt?

I'm wondering if the interrupt distribution has anything to do with this..
are you using any CPU affinity for interrupts? If not, this might explain
why the processes wind up doing a 'round robin'.

I'm trying to reproduce this with gzip on a dual Mac G4, and I'm wondering
if this will be scewed any because all interrupts are directed at cpu0, so
anything that generates input or output on the network or serial is going
to tend to wind up on cpu0.

I'd also like to figure out what the IPI latency actually is.. Does anyone
have any suggestions how to measure this? I would hope it's lower on the
dual 7410 machine I have since I have 4-stage pipelines as opposed to 20
odd stages that Pentium III class machines do..

> After a little investigation, I believe this 'situation' is caused
> by the latency of the reschedule IPI used by the scheduler. Recall
> that in lat_ctx all threads are in a tight loop consisting of:
>
> pipe_read()
> pipe_write()
>
> Both threads 'start' on the same CPU and are sitting in pipe_read
> waiting for data. A token is written to the pipe and one thread
> is awakened. The awakened thread, then immediately writes the token
> back to the pipe which ultimately results in a call to reschedule_idle()
> that will 'initiate' the scheduling of the other thread. In
> reschedule_idle() we can not take the 'fast path' because WE are
> currently executing on the other thread's preferred CPU. Therefore,
> reschedule_idle() chooses the oldest idle CPU and sends the IPI.
> However, before the IPI is received (and schedule() run) on the
> remote CPU, the currently running thread calls pipe_read which
> blocks and calls schedule(). Since the other task has yet to be
> scheduled on the other CPU, it is scheduled to run on the current
> CPU. Both tasks continue to execute on the one CPU until such time
> that an IPI induced schedule() on the other CPU hits a window where
> it finds one of the tasks to schedule. We continue in this way,
> migrating the tasks to the oldest idle CPU and eventually cycling our
> way through all the CPUs.
>
> Does this explanation sound reasonable?
>
> If so, it would then follow that booting with 'idle=poll' would
> help alleviate this situation. However, that is not the case. With
> idle=poll the CPU load is not as evenly distributed among the CPUs,
> but is still distributed among all of them.
>
> Does the behavior of the 'benchmark' mean anything? Should one
> expect tasks to stay their preferred CPUs if possible?
>
> Thoughts/comments
> --
> Mike Kravetz [email protected]
> IBM Linux Technology Center
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Troy Benjegerdes | master of mispeeling | 'da hozer' | [email protected]
-----"If this message isn't misspelled, I didn't write it" -- Me -----
"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's
why I draw cartoons. It's my life." -- Charles Shulz

2001-07-15 09:06:02

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: CPU affinity & IPI latency

On Sun, Jul 15, 2001 at 02:42:55AM -0500, Troy Benjegerdes wrote:
> On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote:
> > This discussion was started on '[email protected]'.
> > I'm widening the distribution in the hope of getting more input.
> >
> > It started when Andi Kleen noticed that a single 'CPU Hog' task
> > was being bounced back and forth between the 2 CPUs on his 2-way
> > system. I had seen similar behavior when running the context
> > switching test of LMbench. When running lat_ctx with only two
> > threads on an 8 CPU system, one would ?expect? the two threads
> > to be confined to two of the 8 CPUs in the system. However, what
> > I have observed is that the threads are effectively 'round
> > robined' among all the CPUs and they all end up bearing
> > an equivalent amount of the CPU load. To more easily observe
> > this, increase the number of 'TRIPS' in the benchmark to a really
> > large number.
>
> Did this 'CPU Hog' task do any I/O that might have caused an interrupt?

Yes; it did some IO, but most of its time was doing CPU work
(it was CPU bound)

>
> I'm wondering if the interrupt distribution has anything to do with this..
> are you using any CPU affinity for interrupts? If not, this might explain
> why the processes wind up doing a 'round robin'.

It was a normal Intel SMP box with no special settings and the interrupts
should have been evenly distributed (I did not check it at this time, but
normally they are about evenly distributed over the two cpus)


>
> I'm trying to reproduce this with gzip on a dual Mac G4, and I'm wondering
> if this will be scewed any because all interrupts are directed at cpu0, so
> anything that generates input or output on the network or serial is going
> to tend to wind up on cpu0.
>
> I'd also like to figure out what th e IPI latency actually is.. Does anyone
> have any suggestions how to measure this? I would hope it's lower on the
> dual 7410 machine I have since I have 4-stage pipelines as opposed to 20
> odd stages that Pentium III class machines do..

I'm not sure about the Mac, but on x86 linux boxes the SMP bootup synchronizes
the time stamp counters of the CPUs. If they are synchronized on your box
also you could store TSC value in the IPI sender and compute the average
latency in the receiver.


-Andi

2001-07-15 17:02:11

by Troy Benjegerdes

[permalink] [raw]
Subject: Re: [Lse-tech] Re: CPU affinity & IPI latency

On Sun, Jul 15, 2001 at 11:05:43AM +0200, Andi Kleen wrote:
> On Sun, Jul 15, 2001 at 02:42:55AM -0500, Troy Benjegerdes wrote:
> > On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote:
> > > This discussion was started on '[email protected]'.
> > > I'm widening the distribution in the hope of getting more input.
> > >
> > > It started when Andi Kleen noticed that a single 'CPU Hog' task
> > > was being bounced back and forth between the 2 CPUs on his 2-way
> > > system. I had seen similar behavior when running the context
> > > switching test of LMbench. When running lat_ctx with only two
> > > threads on an 8 CPU system, one would ?expect? the two threads
> > > to be confined to two of the 8 CPUs in the system. However, what
> > > I have observed is that the threads are effectively 'round
> > > robined' among all the CPUs and they all end up bearing
> > > an equivalent amount of the CPU load. To more easily observe
> > > this, increase the number of 'TRIPS' in the benchmark to a really
> > > large number.
> >
> > Did this 'CPU Hog' task do any I/O that might have caused an interrupt?
>
> Yes; it did some IO, but most of its time was doing CPU work
> (it was CPU bound)

I shouldn't have sent that first email out so quickly. About 5 minutes
after I sent it, I saw gzip switching CPU's with top a lot.

The next very interesting thing I found was that if I run something like
'while true; do true; done' to load the other CPU while gzip is running,
gzip runs faster. The time change isn't very large, and appears to be just
barely detectable above the noise, but it definetly appears that gzip is
bouncing back and forth between cpus if both are idle.

I'm tempted to try the somewhat brute-force approach of increasing
PROC_CHANGE_PENALTY, which is currently set to 20 (on ppc) to something
like 100. Would this be an adequate 'short-term' solution util there is
some sort of multi-queue scheduler that everyone Linus likes? What are the
drawbacks of increasing PROC_CHANGE_PENALTY?

> I'm not sure about the Mac, but on x86 linux boxes the SMP bootup synchronizes
> the time stamp counters of the CPUs. If they are synchronized on your box
> also you could store TSC value in the IPI sender and compute the average
> latency in the receiver.

The PPC has a 'timebase register' which is effectively the same thing as
the TSC, except it counts processor bus ticks/4, not cpu cycles.

--
Troy Benjegerdes | master of mispeeling | 'da hozer' | [email protected]
-----"If this message isn't misspelled, I didn't write it" -- Me -----
"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's
why I draw cartoons. It's my life." -- Charles Shulz

2001-07-15 19:59:19

by Davide Libenzi

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
>> A real-world example of this issue.
>>
>> I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess
>> was bouncing back and forth between the two CPUs. I actually was able to
>> gzip faster by starting up setiathome to keep one CPU busy and force the
>> scheduler to keep the gzip on a single CPU (I ran things several times to
>> verify it was actually faster)
>>
>> David Lang
>
> That does sound like the same behavior I was seeing with lat_ctx. Like
> I said in my previous note, the scheduler does try to take CPU affinity
> into account. reschedule_idle() does a pretty good job of determining
> what CPU a task should run on. In the case of lat_ctx (and I believe
> your use of gzip), the 'fast path' in reschedule_idle() is taken because
> the CPU associated with the awakened task is idle. However, before
> schedule() is run on the 'target' CPU, schedule() is run on another
> CPU and the task is scheduled there.
>
> The root cause of this situation is the delay between the time
> reschedule_idle() determines what is the best CPU a task should run
> on, and the time schedule() is actually ran on that CPU.
>
> I have toyed with the idea of 'temporarily binding' a task to a CPU
> for the duration of the delay between reschedule_idle() and schedule().
> It would go something like this,
>
> - Define a new field in the task structure 'saved_cpus_allowed'.
> With a little collapsing of existing fields, there is room to put
> this on the same cache line as 'cpus_allowed'.
> - In reschedule_idle() if we determine that the best CPU for a task
> is the CPU it is associated with (p->processor), then temporarily
> bind the task to that CPU. The task is temporarily bound to the
> CPU by overwriting the 'cpus_allowed' field such that the task can
> only be scheduled on the target CPU. Of course, the original
> value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> - In schedule(), the loop which examines all tasks on the runqueue
> will restore the value of 'cpus_allowed'.
>
> This would preserve the 'best CPU' decision made by reschedule_idle().
> On the down side, 'temporarily bound' tasks could not be scheduled
> until schedule() is run on their associated CPUs. This could potentially
> waste CPU cycles, and delay context switches. In addition, it is
> quite possible that while a task is 'temporarily bound' the state of
> the system could change in such a way that the best CPU is no longer
> best.
>
> There appears to be a classic tradeoff between CPU affinity and
> context switch time.

The problem of the current scheduler is that it acts like an infinite feedback
system.
When we're going to decide if we've to move a task we analyze the status at the
current time without taking in account the system status at previous time
values.
But the feedback we send ( IPI to move the task ) take a finite time to hit the
target CPU and, meanwhile, the system status changes.
So we're going to apply a feedback calculated in time T0 to a time Tn, and this
will result in system auto-oscillation that we perceive as tasks bouncing
between CPUs.
This is kind of electronic example but it applies to all feedback systems.
The solution to this problem, given that we can't have a zero feedback delivery
time, is :

1) lower the feedback amount, that means, try to minimize task movements

2) a low pass filter, that means, when we're going to decide the sort ( move )
of a task, we've to weight the system status with the one that it had
at previous time values





- Davide

2001-07-15 20:11:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: CPU affinity & IPI latency

On Sun, Jul 15, 2001 at 01:02:21PM -0700, Davide Libenzi wrote:
> The problem of the current scheduler is that it acts like an infinite feedback
> system.
> When we're going to decide if we've to move a task we analyze the status at the
> current time without taking in account the system status at previous time
> values.
> But the feedback we send ( IPI to move the task ) take a finite time to hit the
> target CPU and, meanwhile, the system status changes.
> So we're going to apply a feedback calculated in time T0 to a time Tn, and this
> will result in system auto-oscillation that we perceive as tasks bouncing
> between CPUs.
> This is kind of electronic example but it applies to all feedback systems.
> The solution to this problem, given that we can't have a zero feedback delivery
> time, is :
>
> 1) lower the feedback amount, that means, try to minimize task movements
>
> 2) a low pass filter, that means, when we're going to decide the sort ( move )
> of a task, we've to weight the system status with the one that it had
> at previous time values

Nice analysis. I think Mike's proposal of the 'saved_cpus_allowed' field
to temporarily bind the task to the target would just act as such an low
pass filter.

-Andi

2001-07-15 20:16:00

by Andi Kleen

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote:
> - Define a new field in the task structure 'saved_cpus_allowed'.
> With a little collapsing of existing fields, there is room to put
> this on the same cache line as 'cpus_allowed'.
> - In reschedule_idle() if we determine that the best CPU for a task
> is the CPU it is associated with (p->processor), then temporarily
> bind the task to that CPU. The task is temporarily bound to the
> CPU by overwriting the 'cpus_allowed' field such that the task can
> only be scheduled on the target CPU. Of course, the original
> value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> - In schedule(), the loop which examines all tasks on the runqueue
> will restore the value of 'cpus_allowed'.

This sounds racy, at least with your simple description. Even other CPUs
could walk their run queue while the IPI is being processed and reset the
cpus_allowed flag too early. I guess it would need a check in the
schedule loop that restores cpus_allowed that the current CPU is the only
on set in that task's cpus_allowed. This unfortunately would hang the
process if the CPU for some reason cannot process schedules timely, so
it may be needed to add a timestamp also to the task_struct that allows
the restoration of cpus_allowed even from non target CPUs after some
time.


-Andi

2001-07-15 20:28:02

by Davide Libenzi

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


On 15-Jul-2001 Andi Kleen wrote:
> On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote:
>> - Define a new field in the task structure 'saved_cpus_allowed'.
>> With a little collapsing of existing fields, there is room to put
>> this on the same cache line as 'cpus_allowed'.
>> - In reschedule_idle() if we determine that the best CPU for a task
>> is the CPU it is associated with (p->processor), then temporarily
>> bind the task to that CPU. The task is temporarily bound to the
>> CPU by overwriting the 'cpus_allowed' field such that the task can
>> only be scheduled on the target CPU. Of course, the original
>> value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
>> - In schedule(), the loop which examines all tasks on the runqueue
>> will restore the value of 'cpus_allowed'.
>
> This sounds racy, at least with your simple description. Even other CPUs
> could walk their run queue while the IPI is being processed and reset the
> cpus_allowed flag too early. I guess it would need a check in the
> schedule loop that restores cpus_allowed that the current CPU is the only
> on set in that task's cpus_allowed. This unfortunately would hang the
> process if the CPU for some reason cannot process schedules timely, so
> it may be needed to add a timestamp also to the task_struct that allows
> the restoration of cpus_allowed even from non target CPUs after some
> time.

I previously expressed ( maybe in this thread, I don't remember ) my idea of
not touching ( hacking ) the current scheduler to let it fit to SMP, that IMHO,
has different problems and needs different answers.
Anyway, as I said a couple of messages ago, the solution of this problem could
be to move the moved task in a private ( per CPU ) wakeup list that, when the
idle comes up, it'll be ran.





- Davide

2001-07-16 01:00:43

by Mike Kravetz

[permalink] [raw]
Subject: Re: [Lse-tech] Re: CPU affinity & IPI latency

On Sun, Jul 15, 2001 at 12:00:38PM -0500, Troy Benjegerdes wrote:
>
> The next very interesting thing I found was that if I run something like
> 'while true; do true; done' to load the other CPU while gzip is running,
> gzip runs faster. The time change isn't very large, and appears to be just
> barely detectable above the noise, but it definetly appears that gzip is
> bouncing back and forth between cpus if both are idle.
>
> I'm tempted to try the somewhat brute-force approach of increasing
> PROC_CHANGE_PENALTY, which is currently set to 20 (on ppc) to something
> like 100. Would this be an adequate 'short-term' solution util there is
> some sort of multi-queue scheduler that everyone Linus likes? What are the
> drawbacks of increasing PROC_CHANGE_PENALTY?

I'm pretty sure changing the value of PROC_CHANGE_PENALTY will not help
in this case. PROC_CHANGE_PENALTY becomes increasingly important as
the number of running tasks is increased. In the case of simply running
one task (like gzip) on your 2 CPU system, I don't think it will make
any difference.

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-07-16 10:13:21

by Hubertus Franke

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


David, you are preaching to choir.

One can not have it both ways, at least without "#ifdef"s.
As Mike stated, we made the decision to adhere to current scheduling
semantics
purely for the purspose of comparision and increased adaptation chances.
As shown with the LoadBalancing addition to MQ, there are simple ways to
relax and completely eliminate the feedback between the queues, if one so
desires.

As for the solutions you proposed for the "switching problem", namely the
wakeup
list. I don't think you want a list here. A list would basically mean that
you
would need to walk it and come up with a single decision again. I think
what
I proposed, namely a per-CPU reschedule reservation that can be overwritten
taking
PROC_CHANGE_PENALTY or some form of it into account, seems a better
solution.
Open to discussions...

Hubertus Franke
Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability)

email: [email protected]



Davide Libenzi <[email protected]>@vger.kernel.org on 07/15/2001
04:02:21 PM

Sent by: [email protected]


To: Mike Kravetz <[email protected]>
cc: [email protected], Andi Kleen <[email protected]>,
[email protected], Larry McVoy <[email protected]>, David
Lang <[email protected]>
Subject: Re: CPU affinity & IPI latency




On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
>> A real-world example of this issue.
>>
>> I was gzipping a large (~800MB) file on a dual athlon box. the gzip
prcess
>> was bouncing back and forth between the two CPUs. I actually was able to
>> gzip faster by starting up setiathome to keep one CPU busy and force the
>> scheduler to keep the gzip on a single CPU (I ran things several times
to
>> verify it was actually faster)
>>
>> David Lang
>
> That does sound like the same behavior I was seeing with lat_ctx. Like
> I said in my previous note, the scheduler does try to take CPU affinity
> into account. reschedule_idle() does a pretty good job of determining
> what CPU a task should run on. In the case of lat_ctx (and I believe
> your use of gzip), the 'fast path' in reschedule_idle() is taken because
> the CPU associated with the awakened task is idle. However, before
> schedule() is run on the 'target' CPU, schedule() is run on another
> CPU and the task is scheduled there.
>
> The root cause of this situation is the delay between the time
> reschedule_idle() determines what is the best CPU a task should run
> on, and the time schedule() is actually ran on that CPU.
>
> I have toyed with the idea of 'temporarily binding' a task to a CPU
> for the duration of the delay between reschedule_idle() and schedule().
> It would go something like this,
>
> - Define a new field in the task structure 'saved_cpus_allowed'.
> With a little collapsing of existing fields, there is room to put
> this on the same cache line as 'cpus_allowed'.
> - In reschedule_idle() if we determine that the best CPU for a task
> is the CPU it is associated with (p->processor), then temporarily
> bind the task to that CPU. The task is temporarily bound to the
> CPU by overwriting the 'cpus_allowed' field such that the task can
> only be scheduled on the target CPU. Of course, the original
> value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> - In schedule(), the loop which examines all tasks on the runqueue
> will restore the value of 'cpus_allowed'.
>
> This would preserve the 'best CPU' decision made by reschedule_idle().
> On the down side, 'temporarily bound' tasks could not be scheduled
> until schedule() is run on their associated CPUs. This could potentially
> waste CPU cycles, and delay context switches. In addition, it is
> quite possible that while a task is 'temporarily bound' the state of
> the system could change in such a way that the best CPU is no longer
> best.
>
> There appears to be a classic tradeoff between CPU affinity and
> context switch time.

The problem of the current scheduler is that it acts like an infinite
feedback
system.
When we're going to decide if we've to move a task we analyze the status at
the
current time without taking in account the system status at previous time
values.
But the feedback we send ( IPI to move the task ) take a finite time to hit
the
target CPU and, meanwhile, the system status changes.
So we're going to apply a feedback calculated in time T0 to a time Tn, and
this
will result in system auto-oscillation that we perceive as tasks bouncing
between CPUs.
This is kind of electronic example but it applies to all feedback systems.
The solution to this problem, given that we can't have a zero feedback
delivery
time, is :

1) lower the feedback amount, that means, try to minimize task movements

2) a low pass filter, that means, when we're going to decide the sort (
move )
of a task, we've to weight the system status with the one that it
had
at previous time values





- Davide

2001-07-16 15:49:00

by Mike Kravetz

[permalink] [raw]
Subject: Re: [Lse-tech] Re: CPU affinity & IPI latency

On Sun, Jul 15, 2001 at 10:15:42PM +0200, Andi Kleen wrote:
> On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote:
> > - Define a new field in the task structure 'saved_cpus_allowed'.
> > With a little collapsing of existing fields, there is room to put
> > this on the same cache line as 'cpus_allowed'.
> > - In reschedule_idle() if we determine that the best CPU for a task
> > is the CPU it is associated with (p->processor), then temporarily
> > bind the task to that CPU. The task is temporarily bound to the
> > CPU by overwriting the 'cpus_allowed' field such that the task can
> > only be scheduled on the target CPU. Of course, the original
> > value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> > - In schedule(), the loop which examines all tasks on the runqueue
> > will restore the value of 'cpus_allowed'.
>
> This sounds racy, at least with your simple description. Even other CPUs
> could walk their run queue while the IPI is being processed and reset the
> cpus_allowed flag too early. I guess it would need a check in the
> schedule loop that restores cpus_allowed that the current CPU is the only
> on set in that task's cpus_allowed.

You are correct. It is trivial to allow only the 'target' CPU to reset
the cpus_allowed field. This was my first thought. However, as you
state below we would get into trouble if there was an extremely long
delay in that CPU running schedule.

The timestamp sounds reasonable, but I was trying to keep it as simple
as possible.

> This unfortunately would hang the
> process if the CPU for some reason cannot process schedules timely, so
> it may be needed to add a timestamp also to the task_struct that allows
> the restoration of cpus_allowed even from non target CPUs after some
> time.

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-07-16 16:13:45

by Davide Libenzi

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


On 16-Jul-2001 Hubertus Franke wrote:
>
> David, you are preaching to choir.
>
> One can not have it both ways, at least without "#ifdef"s.
> As Mike stated, we made the decision to adhere to current scheduling
> semantics
> purely for the purspose of comparision and increased adaptation chances.
> As shown with the LoadBalancing addition to MQ, there are simple ways to
> relax and completely eliminate the feedback between the queues, if one so
> desires.
>
> As for the solutions you proposed for the "switching problem", namely the
> wakeup
> list. I don't think you want a list here. A list would basically mean that
> you
> would need to walk it and come up with a single decision again. I think
> what
> I proposed, namely a per-CPU reschedule reservation that can be overwritten
> taking
> PROC_CHANGE_PENALTY or some form of it into account, seems a better
> solution.
> Open to discussions...

No, when you're going to decide ( reschedule_idle ) which idle to spin out, you
can inspect the wake list and, based on the content of the list, one can take a
better decision about which idle to wake.
I think that a list, instead of a single task pointer, is a more open solution
that could drive to a more sophisticated choice of the CPU to stock the task to.




- Davide

2001-07-16 16:16:05

by Mike Kravetz

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency

On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
>
> Mike, could we utilize the existing mechanism such as has_cpu.
>

I like it. Especially the way you eliminated the situation where
we would have multiple tasks waiting for schedule. Hope this is
not a frequent situation!!! The only thing I don't like is the
use of has_cpu to prevent the task from being scheduled. Right
now, I can't think of any problems with it. However, in the past
I have been bit by using fields for purposes other than what they
were designed for.

--
Mike Kravetz [email protected]
IBM Linux Technology Center

2001-07-16 18:26:51

by Hubertus Franke

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


Well, actually the sole purpose of <has_cpu> is
to lock a task out of a scheduling decision. Remember
during transient states, there are two tasks that have
the <has_cpu> flag set for a particular cpu.
So I think using <has_cpu> is kosher and preferred in
this situation, IMHO.

As you saw from David Levines reply, he thinks that
a list is more appropriate for more generic decision
support.
But I don't like the fact that you lock down several
tasks at that point. In my solution, you return the
task back to general schedulability.


Hubertus Franke
Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability)

email: [email protected]



Mike Kravetz <[email protected]> on 07/16/2001 12:14:46 PM

To: Hubertus Franke/Watson/IBM@IBMUS
cc: Mike Kravetz <[email protected]>, [email protected],
[email protected]
Subject: Re: CPU affinity & IPI latency



On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
>
> Mike, could we utilize the existing mechanism such as has_cpu.
>

I like it. Especially the way you eliminated the situation where
we would have multiple tasks waiting for schedule. Hope this is
not a frequent situation!!! The only thing I don't like is the
use of has_cpu to prevent the task from being scheduled. Right
now, I can't think of any problems with it. However, in the past
I have been bit by using fields for purposes other than what they
were designed for.

--
Mike Kravetz [email protected]
IBM Linux Technology Center



2001-07-16 21:22:48

by Davide Libenzi

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


On 16-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
>>
>> Mike, could we utilize the existing mechanism such as has_cpu.
>>
>
> I like it. Especially the way you eliminated the situation where
> we would have multiple tasks waiting for schedule. Hope this is
> not a frequent situation!!! The only thing I don't like is the
> use of has_cpu to prevent the task from being scheduled. Right
> now, I can't think of any problems with it. However, in the past
> I have been bit by using fields for purposes other than what they
> were designed for.

How about this ( draft ) :


struct task_struct {
...
struct task_struct * wlist_next;
...
};


static union {
struct schedule_data {
struct task_struct * curr;
struct task_struct * wlist;
cycles_t last_schedule;
} schedule_data;
char __pad [SMP_CACHE_BYTES];
} aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0}}};


static inline struct task_struct * wpick(aligned_data * ad) {
struct task_struct * tsk = ad->wlist;
if (tsk) {
ad->wlist = tsk->wlist_next;
add_to_runqueue(tsk);
}
return tsk;
}

static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
if (task_on_runqueue(tsk))
list_del(&tsk->run_list);
tsk->wlist_next = ad->wlist;
ad->wlist = tsk;
}


asmlinkage void schedule(void)
{
...
if ((next = wpick(sched_data)))
goto ...;
...
}

In reschedule_idle() when before sending the IPI we do a wpush().
We modify aligned_data->wlist and tsk->wlist_next under runqueue_lock so we
don't need another one.
A slight change is needed to reschedule_idle() to handle the new field.
Pros to this solution are 1) we are not going to give other fields a different
meaning 2) when the idle will call schedule it'll pick the task w/o rescan.




- Davide

2001-07-16 21:45:42

by Hubertus Franke

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


Clean, but in this solutions, you can lock out tasks for a
several cycles, if indeed several of them where added
in the time before we can service the IPI on the target cpu.
You can end up with strange priority inversion problems,
because these tasks aren't seen by the general scheduler
or let alone are on the runqueue.
That's why I opted in my design for a single slot.

One could opt for the following implementation to ensure only
a single waiting task, namely put the already enqueued one back
if the goodness is worse.


static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
if (ad->wlist &&
(preemption_goodness(tsk,ad->wlist,ad->curr->mm) > 0))
{
add_to_runqueue(ad->wlist,tsk);
if (task_on_runqueue(tsk))
list_del(&tsk->run_list);
/* obsolete now tsk->wlist_next = ad->wlist; */
ad->wlist = tsk;
}
}

I also don't like the fact that with reschedule_idle() you
just put the task into the runqueue, then you take it out again,
just to put it back into it after the IPI and that as it seems
on every reschedule_idle() path.

In my design one simply wouldn't send the IPI to a target CPU that
has a pending IPI waiting and preemption wouldn't happen based on
the goodness values.

I grant, that my design seems a bit more intrusive on the code,
but you were argueing just yesterday not to get stuck up with
closeness to the current code and semantics.

Comments ?

Hubertus Franke
Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability)

email: [email protected]



Davide Libenzi <[email protected]>@ewok.dev.mcafeelabs.com on
07/16/2001 05:25:57 PM

Sent by: [email protected]


To: Mike Kravetz <[email protected]>
cc: [email protected], [email protected],
Hubertus Franke/Watson/IBM@IBMUS
Subject: Re: CPU affinity & IPI latency




On 16-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
>>
>> Mike, could we utilize the existing mechanism such as has_cpu.
>>
>
> I like it. Especially the way you eliminated the situation where
> we would have multiple tasks waiting for schedule. Hope this is
> not a frequent situation!!! The only thing I don't like is the
> use of has_cpu to prevent the task from being scheduled. Right
> now, I can't think of any problems with it. However, in the past
> I have been bit by using fields for purposes other than what they
> were designed for.

How about this ( draft ) :


struct task_struct {
...
struct task_struct * wlist_next;
...
};


static union {
struct schedule_data {
struct task_struct * curr;
struct task_struct * wlist;
cycles_t last_schedule;
} schedule_data;
char __pad [SMP_CACHE_BYTES];
} aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0}}};


static inline struct task_struct * wpick(aligned_data * ad) {
struct task_struct * tsk = ad->wlist;
if (tsk) {
ad->wlist = tsk->wlist_next;
add_to_runqueue(tsk);
}
return tsk;
}

static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
if (task_on_runqueue(tsk))
list_del(&tsk->run_list);
tsk->wlist_next = ad->wlist;
ad->wlist = tsk;
}


asmlinkage void schedule(void)
{
...
if ((next = wpick(sched_data)))
goto ...;
...
}

In reschedule_idle() when before sending the IPI we do a wpush().
We modify aligned_data->wlist and tsk->wlist_next under runqueue_lock so we
don't need another one.
A slight change is needed to reschedule_idle() to handle the new field.
Pros to this solution are 1) we are not going to give other fields a
different
meaning 2) when the idle will call schedule it'll pick the task w/o rescan.




- Davide




2001-07-16 22:53:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: CPU affinity & IPI latency


On 16-Jul-2001 Hubertus Franke wrote:
>
> Clean, but in this solutions, you can lock out tasks for a
> several cycles, if indeed several of them where added
> in the time before we can service the IPI on the target cpu.
> You can end up with strange priority inversion problems,
> because these tasks aren't seen by the general scheduler
> or let alone are on the runqueue.
> That's why I opted in my design for a single slot.

The idea of list was kind of extension, but to solve this particular problem we
just need a single task ( with no extra member in task_struct ) pointer.


> One could opt for the following implementation to ensure only
> a single waiting task, namely put the already enqueued one back
> if the goodness is worse.
>
>
> static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
> if (ad->wlist &&
> (preemption_goodness(tsk,ad->wlist,ad->curr->mm) > 0))
> {
> add_to_runqueue(ad->wlist,tsk);
> if (task_on_runqueue(tsk))
> list_del(&tsk->run_list);
> /* obsolete now tsk->wlist_next = ad->wlist; */
> ad->wlist = tsk;
> }
> }

wpush() must be called only if *) the target is idle <and> *) the list ( slot )
is empty.
I won't add the goodness checking in there.
Yes, it could happen that another task with greater goodness needs a CPU to run
and, w/o goodness checking this could not be rescheduled on the idle.
But think about at what would happen if the IPI had a zero time response.
The idle would be running the old ( listed or slotted ) task and the CPU would
not be idle anymore.



>
> I also don't like the fact that with reschedule_idle() you
> just put the task into the runqueue, then you take it out again,
> just to put it back into it after the IPI and that as it seems
> on every reschedule_idle() path.

You can "suspend" the task even by adding an extra field inside the task
struct and add checking of this field around inside the code.
The list removal does not need you to spread extra checking into the code.
It's a simple method to hide a task without modifying a bunch of code.


> In my design one simply wouldn't send the IPI to a target CPU that
> has a pending IPI waiting and preemption wouldn't happen based on
> the goodness values.

You do exactly that with this solution, you reserve the task for
the target idle.
What do You mean with "preemption wouldn't happen based on the goodness values"?



> I grant, that my design seems a bit more intrusive on the code,
> but you were argueing just yesterday not to get stuck up with
> closeness to the current code and semantics.

Pls, don't make me say this stuff.
My was not to make the code more intrusive but, on the contrary, to make it
light by relaxing, in some aspect, the SMP scheduler behaviour compatibility.
I already expressed my opinion about that in previous emails.





- Davide