2007-08-02 08:58:36

by Martin Röhricht

[permalink] [raw]
Subject: Scheduling the highest priority task

Hi,

perhaps someone can give me a hint what I should consider to look for in
order to change the ("old" 2.6.21) scheduler such that it schedules the
highest priority task of a given runqueue.
Given a multiprocessor system I currently observe that whenever there
are two tasks on one CPU, the lower priority one is migrated to another
CPU. But I don't realize why this happens. From looking at the source
code I thought it should be the highest priority one (lowest bit set in
the runqueue's bitmap) according to
idx = sched_find_first_bit(array->bitmap);
within move_tasks(). The idx value is then used as an index (surprise)
to the linked list of tasks of this particular priority and one task is
picked:
head = array->queue + idx;
curr = head->prev;
tmp = list_entry(curr, struct task_struct, run_list);

Is my assumption wrong? Using printk()s within this code section makes
the system just hang completely quite soon. The schedstats do not notify
me immediately. So I am a bit lost on how to track down or trace the
problem.

Can anybody confirm that my observations are correct that the scheduler
picks the lowest priority job of a runqueue for migration?
What needs to be changed in order to pick the highest priority one?

Martin


2007-08-02 11:40:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduling the highest priority task


* Martin Roehricht <[email protected]> wrote:

> perhaps someone can give me a hint what I should consider to look for in
> order to change the ("old" 2.6.21) scheduler such that it schedules the
> highest priority task of a given runqueue.
> Given a multiprocessor system I currently observe that whenever there
> are two tasks on one CPU, the lower priority one is migrated to another
> CPU. But I don't realize why this happens. From looking at the source
> code I thought it should be the highest priority one (lowest bit set in
> the runqueue's bitmap) according to
> idx = sched_find_first_bit(array->bitmap);
> within move_tasks(). The idx value is then used as an index (surprise)
> to the linked list of tasks of this particular priority and one task is
> picked:
> head = array->queue + idx;
> curr = head->prev;
> tmp = list_entry(curr, struct task_struct, run_list);
>
> Can anybody confirm that my observations are correct that the
> scheduler picks the lowest priority job of a runqueue for migration?
> What needs to be changed in order to pick the highest priority one?

in the SMP migration code, the 'old scheduler' indeed picks the lowest
priority one, _except_ if that task is running on another CPU or is too
'cache hot':

if (skip_for_load ||
!can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {

also, from the priority-queue at 'idx', we pick head->prev, i.e. we
process the list in the opposite order as schedule(). (This got changed
in CFS to process in the same direction - which is more logical and also
yield the most cache-cold tasks for migration.)

i hope this helps.

> Is my assumption wrong? Using printk()s within this code section makes
> the system just hang completely quite soon. The schedstats do not
> notify me immediately. So I am a bit lost on how to track down or
> trace the problem.

yep, printk locks up. You can use my static tracer:

http://people.redhat.com/mingo/latency-tracing-patches/

add explicit static tracepoints to the scheduler code you want to
instrument via trace_special(x,y,z) calls [with parameters that interest
you most], and you can read out the trace via:

http://people.redhat.com/mingo/latency-tracing-patches/trace-it.c

Ingo

2007-08-02 15:00:29

by Martin Röhricht

[permalink] [raw]
Subject: Re: Scheduling the highest priority task

On 08/02/2007 01:40 PM, Ingo Molnar wrote:
> in the SMP migration code, the 'old scheduler' indeed picks the lowest
> priority one, _except_ if that task is running on another CPU or is too
> 'cache hot':

But why is it, that the scheduler picks the lowest priority one? I
thought sched_find_first_bit() picks the index of the lowest order bit
in the bitmap and thus the highest priority job. Is that wrong?
What needs to be changed to let the scheduler pick the highest priority
task from a given runqueue?
I am very confused ...

Martin

2007-08-02 15:04:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduling the highest priority task


* Martin Roehricht <[email protected]> wrote:

> On 08/02/2007 01:40 PM, Ingo Molnar wrote:
> >in the SMP migration code, the 'old scheduler' indeed picks the lowest
> >priority one, _except_ if that task is running on another CPU or is too
> >'cache hot':
>
> But why is it, that the scheduler picks the lowest priority one? I
> thought sched_find_first_bit() picks the index of the lowest order bit
> in the bitmap and thus the highest priority job. Is that wrong? What
> needs to be changed to let the scheduler pick the highest priority
> task from a given runqueue? I am very confused ...

it first picks the lowest index (i.e. the highest priority active
priority-queue), but within those tasks (each task in that priority
queue has equal priority) the load-balancer has freedom to pick any.
Based on performance data we went for picking from the tail of the
queue.

Ingo

2007-08-02 15:14:18

by Martin Röhricht

[permalink] [raw]
Subject: Re: Scheduling the highest priority task

On 08/02/2007 05:03 PM, Ingo Molnar wrote:
> * Martin Roehricht <[email protected]> wrote:
>
>> On 08/02/2007 01:40 PM, Ingo Molnar wrote:
>> >in the SMP migration code, the 'old scheduler' indeed picks the lowest
>> >priority one, _except_ if that task is running on another CPU or is too
>> >'cache hot':
>>
>> But why is it, that the scheduler picks the lowest priority one? I
>> thought sched_find_first_bit() picks the index of the lowest order bit
>> in the bitmap and thus the highest priority job. Is that wrong? What
>> needs to be changed to let the scheduler pick the highest priority
>> task from a given runqueue? I am very confused ...
>
> it first picks the lowest index (i.e. the highest priority active
> priority-queue), but within those tasks (each task in that priority
> queue has equal priority) the load-balancer has freedom to pick any.
> Based on performance data we went for picking from the tail of the
> queue.

That's fine with me, that within the same priority-queue any task can be
chosen. But assume two tasks with highly different priorities, such as
105 and 135 are scheduled on the same processor and one of them is now
to be migrated -- shouldn't be the queue with task P=105 considered
first for migration by this code?
Both tasks would use different queues with their own linked lists, right?

Martin

2007-08-02 15:19:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduling the highest priority task


* Martin Roehricht <[email protected]> wrote:

> That's fine with me, that within the same priority-queue any task can
> be chosen. But assume two tasks with highly different priorities, such
> as 105 and 135 are scheduled on the same processor and one of them is
> now to be migrated -- shouldn't be the queue with task P=105
> considered first for migration by this code? Both tasks would use
> different queues with their own linked lists, right?

yes. What makes you believe that the lower priority one (prio 135) is
chosen? [ as i said before, that will only be chosen if all tasks in the
higher-priority queue (prio 105) are either already running on a CPU or
have recently run so that the cache-hot logic skips them. ]

Ingo

2007-08-02 15:47:12

by Martin Röhricht

[permalink] [raw]
Subject: Re: Scheduling the highest priority task

On 08/02/2007 05:19 PM, Ingo Molnar wrote:
> * Martin Roehricht <[email protected]> wrote:
>
>> That's fine with me, that within the same priority-queue any task can
>> be chosen. But assume two tasks with highly different priorities, such
>> as 105 and 135 are scheduled on the same processor and one of them is
>> now to be migrated -- shouldn't be the queue with task P=105
>> considered first for migration by this code? Both tasks would use
>> different queues with their own linked lists, right?
>
> yes. What makes you believe that the lower priority one (prio 135) is
> chosen? [ as i said before, that will only be chosen if all tasks in the
> higher-priority queue (prio 105) are either already running on a CPU or
> have recently run so that the cache-hot logic skips them. ]

This believe is primarily based on my observations of multiple benchmark
runs and also on your statement earlier: ?in the SMP migration code, the
'old scheduler' indeed picks the lowest priority one?.

Perhaps it is just an unfortunate coincidence that at ~90% of the time a
migration decision is made, the higher priority process is currently
cache hot whereas the lower priority process is not. That would be
unlucky for me as I would like to decide upon specific runtime
circumstances whether the highest or the lowest priority job of a
runqueue should be migrated to another CPU. :-/

Martin

2007-08-02 19:48:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduling the highest priority task


* Martin Roehricht <[email protected]> wrote:

> On 08/02/2007 05:19 PM, Ingo Molnar wrote:
> >* Martin Roehricht <[email protected]> wrote:
> >
> >>That's fine with me, that within the same priority-queue any task can
> >>be chosen. But assume two tasks with highly different priorities, such
> >>as 105 and 135 are scheduled on the same processor and one of them is
> >>now to be migrated -- shouldn't be the queue with task P=105
> >>considered first for migration by this code? Both tasks would use
> >>different queues with their own linked lists, right?
> >
> >yes. What makes you believe that the lower priority one (prio 135) is
> >chosen? [ as i said before, that will only be chosen if all tasks in the
> >higher-priority queue (prio 105) are either already running on a CPU or
> >have recently run so that the cache-hot logic skips them. ]
>
> This believe is primarily based on my observations of multiple
> benchmark runs and also on your statement earlier: ?in the SMP
> migration code, the 'old scheduler' indeed picks the lowest priority
> one?.

oh, sorry, that was meant to be the 'highest priority one' :-/

so i think you got it all right, i just typoed that first sentence.

Ingo

2007-08-02 21:08:16

by Martin Röhricht

[permalink] [raw]
Subject: Re: Scheduling the highest priority task

On 02.08.2007 21:48, Ingo Molnar wrote:
> * Martin Roehricht <[email protected]> wrote:
>
>> On 08/02/2007 05:19 PM, Ingo Molnar wrote:
>> >* Martin Roehricht <[email protected]> wrote:
>> >
>> >>That's fine with me, that within the same priority-queue any task can
>> >>be chosen. But assume two tasks with highly different priorities, such
>> >>as 105 and 135 are scheduled on the same processor and one of them is
>> >>now to be migrated -- shouldn't be the queue with task P=105
>> >>considered first for migration by this code? Both tasks would use
>> >>different queues with their own linked lists, right?
>> >
>> >yes. What makes you believe that the lower priority one (prio 135) is
>> >chosen? [ as i said before, that will only be chosen if all tasks in the
>> >higher-priority queue (prio 105) are either already running on a CPU or
>> >have recently run so that the cache-hot logic skips them. ]
>>
>> This believe is primarily based on my observations of multiple
>> benchmark runs and also on your statement earlier: ?in the SMP
>> migration code, the 'old scheduler' indeed picks the lowest priority
>> one?.
>
> oh, sorry, that was meant to be the 'highest priority one' :-/
>
> so i think you got it all right, i just typoed that first sentence.

Okay, now I think I understood this part of the code correctly. The
reason why I observe a continous migration of the _lower_ priority tasks
is most probably due to the fact that the higher priority one is
currently running, according to:
can_migrate_task() in move_tasks(), and therein:

if (task_running(rq, p))
return 0;

I tracked down via an extended /proc/schedstats that my tasks fall
frequently into this pitfall. I basically solved it by making use of the
more active push-strategy which is called later by load_balance() once
the move_tasks() function did not succeed. So in case I need the higher
priority tasks, I return immediately from move_tasks().

Thanks for your help,
Martin