2009-11-29 13:23:45

by Mike Galbraith

[permalink] [raw]
Subject: [patch] f83f9ac causes tasks running at MAX_PRIO


top - 11:33:19 up 1 min, 21 users, load average: 4.47, 1.44, 0.51
Tasks: 288 total, 1 running, 287 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.2%us, 0.5%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
568 root 20 0 0 0 0 S 1 0.0 0:00.02 3 scsi_eh_1
4724 root 40 0 358m 22m 4008 S 1 0.7 0:01.69 0 Xorg
6064 root 40 0 8872 1316 852 R 1 0.0 0:00.07 3 top
1 root 40 0 1064 388 324 S 0 0.0 0:01.30 2 init
2 root 40 0 0 0 0 S 0 0.0 0:00.00 1 kthreadd
3 root RT 0 0 0 0 S 0 0.0 0:00.00 0 migration/0

WARN_ON(current->normal_prio == MAX_PRIO);

[ 0.092016] ------------[ cut here ]------------
[ 0.096008] WARNING: at kernel/sched.c:2591 sched_fork+0xe4/0x1ae()
[ 0.100002] Hardware name: MS-7502
[ 0.104002] Modules linked in:
[ 0.108191] Pid: 0, comm: swapper Not tainted 2.6.32-tip-smpx #956
[ 0.112002] Call Trace:
[ 0.116004] [<ffffffff810365ab>] ? sched_fork+0xe4/0x1ae
[ 0.120004] [<ffffffff81038eb8>] warn_slowpath_common+0x77/0xa4
[ 0.124004] [<ffffffff81038ef4>] warn_slowpath_null+0xf/0x11
[ 0.128003] [<ffffffff810365ab>] sched_fork+0xe4/0x1ae
[ 0.132005] [<ffffffff81059a9f>] ? monotonic_to_bootbased+0x26/0x34
[ 0.136004] [<ffffffff8103783b>] copy_process+0x4b9/0x10e4
[ 0.140006] [<ffffffff810bd38b>] ? __get_vm_area_node+0x175/0x1bf
[ 0.144004] [<ffffffff810385b2>] do_fork+0x14c/0x304
[ 0.148005] [<ffffffff810200dd>] ? __ioremap_caller+0x292/0x2fb
[ 0.152005] [<ffffffff812838b0>] ? acpi_os_map_memory+0x12/0x1b
[ 0.156006] [<ffffffff810038d2>] kernel_thread+0x82/0xe0
[ 0.160006] [<ffffffff8146c510>] ? kernel_init+0x0/0x1af
[ 0.164004] [<ffffffff81003930>] ? child_rip+0x0/0x20
[ 0.168004] [<ffffffff81281a48>] ? rest_init+0x1c/0x76
[ 0.172003] [<ffffffff8146cc54>] start_kernel+0x351/0x35c
[ 0.176004] [<ffffffff8146c29a>] x86_64_start_reservations+0xaa/0xae
[ 0.180004] [<ffffffff8146c37f>] x86_64_start_kernel+0xe1/0xe8
[ 0.184007] ---[ end trace 4eaa2a86a8e2da22 ]---

sched: fix task priority bug.

f83f9ac removed a call to effective_prio() in wake_up_new_task(), which
leads to tasks running at MAX_PRIO. That call set both the child's prio
and normal_prio fields to normal_prio(child). Do the same fork time by
setting both to normal_prio(parent).

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Peter Williams <[email protected]>
LKML-Reference: <new-submission>

---
kernel/sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2609,7 +2609,7 @@ void sched_fork(struct task_struct *p, i
/*
* Make sure we do not leak PI boosting priority to the child.
*/
- p->prio = current->normal_prio;
+ p->prio = p->normal_prio = normal_prio(current);

if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;


2009-12-02 11:46:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] f83f9ac causes tasks running at MAX_PRIO

On Sun, 2009-11-29 at 14:23 +0100, Mike Galbraith wrote:

> sched: fix task priority bug.
>
> f83f9ac removed a call to effective_prio() in wake_up_new_task(), which
> leads to tasks running at MAX_PRIO. That call set both the child's prio
> and normal_prio fields to normal_prio(child). Do the same fork time by
> setting both to normal_prio(parent).
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Peter Williams <[email protected]>
> LKML-Reference: <new-submission>
>
> ---
> kernel/sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2609,7 +2609,7 @@ void sched_fork(struct task_struct *p, i
> /*
> * Make sure we do not leak PI boosting priority to the child.
> */
> - p->prio = current->normal_prio;
> + p->prio = p->normal_prio = normal_prio(current);
>
> if (!rt_prio(p->prio))
> p->sched_class = &fair_sched_class;
>

Damn PI stuff makes my head hurt ;-)

So we've got:

->prio - the actual effective priority [ prio scale ]
->normal_prio - the task's normal priority [ prio scale ]
->static_prio - SCHED_OTHER's nice value [ prio scale ]
->rt_priority - SCHED_FIFO/RR prio value [ sched_param scale ]

[ with prio scale being:

[0, MAX_RT_PRIO-1] [MAX_RT_PRIO, MAX_PRIO-1]
RT-100, RT-99..RT-1 NICE-20, NICE+19
]

So at sched_fork() we do the

p->prio = p->normal_prio;

thing, to unboost.

If that results in MAX_PRIO, then our parent's ->normal_prio is stuffed.

Looking at the code I can see that happening because we've got:

init_idle() doing:
idle->prio = idle->normal_prio = MAX_PRIO;

Which will propagate... like reported.

Now, since the idle-threads usually run on &idle_sched_class, nobody
will actually look at their ->prio, so having that out-of-range might
make sense.

Just needs to get fixed up when we fork a normal thread, which would be
in sched_fork(), now your call to normal_prio() fixes this by setting
everything to ->static_prio for SCHED_OTHER tasks, however

migration_call()
CPU_DEAD:
rq->idle->static_prio = MAX_PRIO;

spoils that too, then again, at that point nothing will fork from that
idle thread.

Funny thing though, INIT_TASK() sets everything at MAX_PRIO-20.

Ingo, any particular reason we set idle threads at MAX_PRIO? Can't we
simply do something like below and be done with it?

---
kernel/sched.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c0e4e9d..5ad5a66 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6963,7 +6963,6 @@ void __cpuinit init_idle(struct task_struct *idle,
int cpu)
__sched_fork(idle);
idle->se.exec_start = sched_clock();

- idle->prio = idle->normal_prio = MAX_PRIO;
cpumask_copy(&idle->cpus_allowed, cpumask_of(cpu));
__set_task_cpu(idle, cpu);

@@ -7667,7 +7666,6 @@ migration_call(struct notifier_block *nfb,
unsigned long action, void *hcpu)
spin_lock_irq(&rq->lock);
update_rq_clock(rq);
deactivate_task(rq, rq->idle, 0);
- rq->idle->static_prio = MAX_PRIO;
__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);

2009-12-02 12:49:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] f83f9ac causes tasks running at MAX_PRIO

On Wed, 2009-12-02 at 12:46 +0100, Peter Zijlstra wrote:

> init_idle() doing:
> idle->prio = idle->normal_prio = MAX_PRIO;
>
> Which will propagate... like reported.
>
> Now, since the idle-threads usually run on &idle_sched_class, nobody
> will actually look at their ->prio, so having that out-of-range might
> make sense.
>
> Just needs to get fixed up when we fork a normal thread, which would be
> in sched_fork(), now your call to normal_prio() fixes this by setting
> everything to ->static_prio for SCHED_OTHER tasks, however
>
> migration_call()
> CPU_DEAD:
> rq->idle->static_prio = MAX_PRIO;
>
> spoils that too..

Darn.

> Ingo, any particular reason we set idle threads at MAX_PRIO? Can't we
> simply do something like below and be done with it?

Hysterical reasons? That might have been a doorstop conversion kit for
O(1), but boots fine with CFS, and prio 40 tasks are history.

> ---
> kernel/sched.c | 2 --
> 1 files changed, 0 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index c0e4e9d..5ad5a66 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6963,7 +6963,6 @@ void __cpuinit init_idle(struct task_struct *idle,
> int cpu)
> __sched_fork(idle);
> idle->se.exec_start = sched_clock();
>
> - idle->prio = idle->normal_prio = MAX_PRIO;
> cpumask_copy(&idle->cpus_allowed, cpumask_of(cpu));
> __set_task_cpu(idle, cpu);
>
> @@ -7667,7 +7666,6 @@ migration_call(struct notifier_block *nfb,
> unsigned long action, void *hcpu)
> spin_lock_irq(&rq->lock);
> update_rq_clock(rq);
> deactivate_task(rq, rq->idle, 0);
> - rq->idle->static_prio = MAX_PRIO;
> __setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
> rq->idle->sched_class = &idle_sched_class;
> migrate_dead_tasks(cpu);
>

2009-12-02 14:03:22

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] f83f9ac causes tasks running at MAX_PRIO

On Wed, 2009-12-02 at 12:46 +0100, Peter Zijlstra wrote:
> On Sun, 2009-11-29 at 14:23 +0100, Mike Galbraith wrote:
>
> > sched: fix task priority bug.
> >
> > f83f9ac removed a call to effective_prio() in wake_up_new_task(), which
> > leads to tasks running at MAX_PRIO. That call set both the child's prio
> > and normal_prio fields to normal_prio(child). Do the same fork time by
> > setting both to normal_prio(parent).
> >
> > Signed-off-by: Mike Galbraith <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Peter Williams <[email protected]>
> > LKML-Reference: <new-submission>
> >
> > ---
> > kernel/sched.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -2609,7 +2609,7 @@ void sched_fork(struct task_struct *p, i
> > /*
> > * Make sure we do not leak PI boosting priority to the child.
> > */
> > - p->prio = current->normal_prio;
> > + p->prio = p->normal_prio = normal_prio(current);
> >
> > if (!rt_prio(p->prio))
> > p->sched_class = &fair_sched_class;
> >
>
> Damn PI stuff makes my head hurt ;-)

I recommend Advil

>
> So we've got:
>
> ->prio - the actual effective priority [ prio scale ]
> ->normal_prio - the task's normal priority [ prio scale ]
> ->static_prio - SCHED_OTHER's nice value [ prio scale ]
> ->rt_priority - SCHED_FIFO/RR prio value [ sched_param scale ]
>
> [ with prio scale being:
>
> [0, MAX_RT_PRIO-1] [MAX_RT_PRIO, MAX_PRIO-1]
> RT-100, RT-99..RT-1 NICE-20, NICE+19
> ]
>
> So at sched_fork() we do the
>
> p->prio = p->normal_prio;
>
> thing, to unboost.
>
> If that results in MAX_PRIO, then our parent's ->normal_prio is stuffed.
>
> Looking at the code I can see that happening because we've got:
>
> init_idle() doing:
> idle->prio = idle->normal_prio = MAX_PRIO;

But this is only called on the idle task, which should never fork.


>
> Which will propagate... like reported.
>
> Now, since the idle-threads usually run on &idle_sched_class, nobody
> will actually look at their ->prio, so having that out-of-range might
> make sense.
>
> Just needs to get fixed up when we fork a normal thread, which would be
> in sched_fork(), now your call to normal_prio() fixes this by setting
> everything to ->static_prio for SCHED_OTHER tasks, however
>
> migration_call()
> CPU_DEAD:
> rq->idle->static_prio = MAX_PRIO;
>
> spoils that too, then again, at that point nothing will fork from that
> idle thread.

Well, that is when the CPU is dead, right?

>
> Funny thing though, INIT_TASK() sets everything at MAX_PRIO-20.

Right, because the init task will fork. But once a task becomes idle, it
should never do anything (but service interrupts).

>
> Ingo, any particular reason we set idle threads at MAX_PRIO? Can't we
> simply do something like below and be done with it?

There probably isn't any reason this can't be done, but I'm thinking we
may be papering over a bug instead of solving one.

-- Steve

>
> ---
> kernel/sched.c | 2 --
> 1 files changed, 0 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index c0e4e9d..5ad5a66 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6963,7 +6963,6 @@ void __cpuinit init_idle(struct task_struct *idle,
> int cpu)
> __sched_fork(idle);
> idle->se.exec_start = sched_clock();
>
> - idle->prio = idle->normal_prio = MAX_PRIO;
> cpumask_copy(&idle->cpus_allowed, cpumask_of(cpu));
> __set_task_cpu(idle, cpu);
>
> @@ -7667,7 +7666,6 @@ migration_call(struct notifier_block *nfb,
> unsigned long action, void *hcpu)
> spin_lock_irq(&rq->lock);
> update_rq_clock(rq);
> deactivate_task(rq, rq->idle, 0);
> - rq->idle->static_prio = MAX_PRIO;
> __setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
> rq->idle->sched_class = &idle_sched_class;
> migrate_dead_tasks(cpu);
>
>

2009-12-02 16:51:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] f83f9ac causes tasks running at MAX_PRIO

On Wed, 2009-12-02 at 09:03 -0500, Steven Rostedt wrote:
> On Wed, 2009-12-02 at 12:46 +0100, Peter Zijlstra wrote:

> > Funny thing though, INIT_TASK() sets everything at MAX_PRIO-20.
>
> Right, because the init task will fork. But once a task becomes idle, it
> should never do anything (but service interrupts).


Seems it forks _as_ the idle thread..

/*
* Make us the idle thread. Technically, schedule() should not be
* called from this thread, however somewhere below it might be,
* but because we are the idle thread, we just pick up running again
* when this runqueue becomes "idle".
*/
init_idle(current, smp_processor_id());

calc_load_update = jiffies + LOAD_FREQ;

/*
* During early bootup we pretend to be a normal task:
*/
current->sched_class = &fair_sched_class;

..init task becomes the idle thread in sched_init().. gets to
rest_init(), forks with prio and normal_prio now at MAX_PRIO, but
static_prio still at NICE_0.

Peter's change looks OK unless I'm missing something. Though...

static void pull_task(struct rq *src_rq, struct task_struct *p,
struct rq *this_rq, int this_cpu)
{
deactivate_task(src_rq, p, 0);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
* to be always true for them.
*/
check_preempt_curr(this_rq, p, 0);
}

..we have some O(1) comments for future generations to ponder.

-Mike

2009-12-03 03:52:26

by Peter Williams

[permalink] [raw]
Subject: Re: [patch] f83f9ac causes tasks running at MAX_PRIO

On 02/12/09 22:49, Mike Galbraith wrote:
> Hysterical reasons? That might have been a doorstop conversion kit for
> O(1), but boots fine with CFS, and prio 40 tasks are history.

In my (not so) humble opinion, there is still a lot of unnecessary cruft
in sched.c that should have been removed as the final (clean up) stage
of implementing CFS (i.e. stuff that was needed for O(1) but no longer
serves a useful purpose). I know that the extra overhead of this code
is probably inconsequential (and the compiler may even optimize some of
it away) but it looks untidy and makes the code harder to understand.

Recent patches that I've submitted were intended as a start to removing
some of this cruft and I had intended to send more patches after they
were accepted. I figured that it was better to do it as a number of
small changes rather than one big one. Should I continue?

Peter

2009-12-03 01:44:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] f83f9ac causes tasks running at MAX_PRIO

On Thu, 2009-12-03 at 11:25 +1000, Peter Williams wrote:
> On 02/12/09 22:49, Mike Galbraith wrote:
> > Hysterical reasons? That might have been a doorstop conversion kit for
> > O(1), but boots fine with CFS, and prio 40 tasks are history.
>
> In my (not so) humble opinion, there is still a lot of unnecessary cruft
> in sched.c that should have been removed as the final (clean up) stage
> of implementing CFS (i.e. stuff that was needed for O(1) but no longer
> serves a useful purpose). I know that the extra overhead of this code
> is probably inconsequential (and the compiler may even optimize some of
> it away) but it looks untidy and makes the code harder to understand.
>
> Recent patches that I've submitted were intended as a start to removing
> some of this cruft and I had intended to send more patches after they
> were accepted. I figured that it was better to do it as a number of
> small changes rather than one big one. Should I continue?

Certainly.

-Mike

2009-12-03 10:02:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] f83f9ac causes tasks running at MAX_PRIO

On Thu, 2009-12-03 at 11:25 +1000, Peter Williams wrote:

> Recent patches that I've submitted were intended as a start to removing
> some of this cruft and I had intended to send more patches after they
> were accepted. I figured that it was better to do it as a number of
> small changes rather than one big one. Should I continue?

Please, feel free to clean that up.


2009-12-09 09:57:55

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:sched/urgent] sched: Fix task priority bug

Commit-ID: 57785df5ac53c70da9fb53696130f3c551bfe1f9
Gitweb: http://git.kernel.org/tip/57785df5ac53c70da9fb53696130f3c551bfe1f9
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 4 Dec 2009 09:59:02 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 9 Dec 2009 10:03:10 +0100

sched: Fix task priority bug

83f9ac removed a call to effective_prio() in wake_up_new_task(), which
leads to tasks running at MAX_PRIO.

This is caused by the idle thread being set to MAX_PRIO before forking
off init. O(1) used that to make sure idle was always preempted, CFS
uses check_preempt_curr_idle() for that so we can savely remove this bit
of legacy code.

Reported-by: Mike Galbraith <[email protected]>
Tested-by: Mike Galbraith <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1259754383.4003.610.camel@laptop>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 71eb062..3878f50 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3158,10 +3158,6 @@ static void pull_task(struct rq *src_rq, struct task_struct *p,
deactivate_task(src_rq, p, 0);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
- /*
- * Note that idle threads have a prio of MAX_PRIO, for this test
- * to be always true for them.
- */
check_preempt_curr(this_rq, p, 0);
}

@@ -6992,7 +6988,6 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
__sched_fork(idle);
idle->se.exec_start = sched_clock();

- idle->prio = idle->normal_prio = MAX_PRIO;
cpumask_copy(&idle->cpus_allowed, cpumask_of(cpu));
__set_task_cpu(idle, cpu);

@@ -7696,7 +7691,6 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
spin_lock_irq(&rq->lock);
update_rq_clock(rq);
deactivate_task(rq, rq->idle, 0);
- rq->idle->static_prio = MAX_PRIO;
__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);