2007-10-15 14:17:40

by Ingo Molnar

[permalink] [raw]
Subject: [git pull] scheduler updates for v2.6.24


Linus, please pull the latest scheduler git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

It contains lots of scheduler updates from lots of people - hopefully
the last big one for quite some time. Most of the focus was on
performance (both micro-performance and scalability/balancing), but
there's the fair-scheduling feature now Kconfig selectable too. Find the
shortlog below.

Code that is touched outside of the scheduler: the KVM bits were acked
by Avi, the net/unix change is trivial and only affects sync wakeups,
ditto the fs/pipe.c changes - but i can push those separately if it
needs an ack from David first.

ABI/API changes:

- new CONFIG_FAIR_USER_SCHED and /sys/kernel/uids/ + uevent API.
- /proc/stat and /proc/<pid>/stat changes for guest-CPU usage [KVM]
- /proc/sched_debug formats changed/enhanced

Testing status: the changes are chronological and all the
interactivity-impacting changes are near the head of the queue and most
of them were done weeks ago, and were thus part of the CFS-v22 backport
series - which was tested by many people. There are no known regressions
at the moment. It's all fully bisectable.

Thanks,

Ingo

------------------>
Alexey Dobriyan (1):
sched: uninline scheduler

Andi Kleen (4):
sched: cleanup: remove unnecessary gotos
sched: cleanup: refactor common code of sleep_on / wait_for_completion
sched: cleanup: refactor normalize_rt_tasks
sched: remove stale comment from sched_group_set_shares()

Arjan van de Ven (1):
Make scheduler debug file operations const

Dhaval Giani (1):
sched: group scheduling, sysfs tunables

Dmitry Adamushko (14):
sched: clean up struct load_stat
sched: clean up schedstat block in dequeue_entity()
sched: sched_setscheduler() fix
sched: add set_curr_task() calls
sched: do not keep current in the tree and get rid of sched_entity::fair_key
sched: optimize task_new_fair()
sched: simplify sched_class::yield_task()
sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()
sched: yield fix
sched: fix __pick_next_entity()
sched: tidy up SCHED_RR
sched: cleanup, remove calc_weighted()
sched: cleanup, make dequeue_entity() and update_stats_wait_end() similar
sched: fix group scheduling for SCHED_BATCH

Gautham R Shenoy (1):
sched: fix rt ptracer monopolizing CPU

Hiroshi Shimamoto (1):
sched: clean up sched_fork()

Ingo Molnar (71):
sched: fix sysctl_sched_child_runs_first flag
sched: resched task in task_new_fair()
sched: small sched_debug cleanup
sched: debug: track maximum 'slice'
sched: uniform tunings
sched: use constants if !CONFIG_SCHED_DEBUG
sched: remove stat_gran
sched: remove precise CPU load
sched: remove precise CPU load calculations #2
sched: track cfs_rq->curr on !group-scheduling too
sched: cleanup: simplify cfs_rq_curr() methods
sched: uninline __enqueue_entity()/__dequeue_entity()
sched: speed up update_load_add/_sub()
sched: clean up calc_weighted()
sched: introduce se->vruntime
sched: move sched_feat() definitions
sched: optimize vruntime based scheduling
sched: simplify check_preempt() methods
sched: wakeup granularity increase
sched: add se->vruntime debugging
sched: remove SCHED_FEAT_SKIP_INITIAL
sched: add more vruntime statistics
sched: debug: update exec_clock only when SCHED_DEBUG
sched: remove wait_runtime limit
sched: remove wait_runtime fields and features
sched: x86: allow single-depth wchan output
sched: fix delay accounting performance regression
sched: prettify /proc/sched_debug output
sched: enhance debug output
sched: kernel/sched_fair.c whitespace cleanups
sched: fair-group sched, cleanups
sched: enable CONFIG_FAIR_GROUP_SCHED=y by default
sched debug: BKL usage statistics
sched: remove unneeded tunables
sched debug: print settings
sched debug: more width for parameter printouts
sched: entity_key() fix
sched: remove condition from set_task_cpu()
sched: remove last_min_vruntime effect
sched: undo some of the recent changes
sched: fix sign check error in place_entity()
sched: fix sched_fork()
sched: remove set_leftmost()
sched: clean up schedstats, cnt -> count
sched: cleanup, remove stale comment
sched: mark scheduling classes as const
sched: whitespace cleanups
sched: vslice fixups for non-0 nice levels
sched: optimize schedule() a bit on SMP
sched: tweak wakeup granularity
sched: run sched_domain_debug() if CONFIG_SCHED_DEBUG=y
sched: break out if printing a warning in sched_domain_debug()
sched: style cleanup
sched: kfree(NULL) is valid
sched: cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG
sched: cleanup: rename task_grp to task_group
sched: cleanup: function prototype cleanups
sched: fix: move the CPU check into ->task_new_fair()
sched: update comment
sched: clean up is_migration_thread()
sched: do not normalize kernel threads via SysRq-N
sched: do not wakeup-preempt with SCHED_BATCH tasks
sched: speed up context-switches a bit
sched: reintroduce cache-hot affinity
sched: debug: increase width of debug line
sched: debug, improve migration statistics
sched: allow the immediate migration of cache-cold tasks
sched: reintroduce topology.h tunings
sched: enable wake-idle on CONFIG_SCHED_MC=y
sched: affine sync wakeups
sched: sync wakeups preempt too

Laurent Vivier (4):
sched: guest CPU accounting: add guest-CPU /proc/stat field
sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields
sched: guest CPU accounting: maintain stats in account_system_time()
sched: guest CPU accounting: maintain guest state in KVM

Matthias Kaehlcke (1):
sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (4):
sched: fix SMP migration latencies
sched: fix formatting of /proc/sched_debug
sched: cleanup, remove the TASK_NONINTERACTIVE flag
sched: prevent wakeup over-scheduling

Milton Miller (5):
sched: domain sysctl fixes: use kcalloc()
sched: domain sysctl fixes: use for_each_online_cpu()
sched: domain sysctl fixes: unregister the sysctl table before domains
sched: domain sysctl fixes: do not crash on allocation failure
sched: domain sysctl fixes: add terminator comment

Paul E. McKenney (1):
sched: export cpu_clock()

Peter Williams (2):
sched: reduce balance-tasks overhead
sched: isolate SMP balancing code a bit more

Peter Zijlstra (16):
sched: simplify SCHED_FEAT_* code
sched: new task placement for vruntime
sched: simplify adaptive latency
sched: clean up new task placement
sched: add tree based averages
sched: handle vruntime 64-bit overflow
sched: better min_vruntime tracking
sched: add vslice
sched debug: check spread
sched: max_vruntime() simplification
sched: clean up min_vruntime use
sched: speed up and simplify vslice calculations
sched: another wakeup_granularity fix
sched: disable sleeper_fairness on SCHED_BATCH
sched: disable forced preemption by default
sched: activate task_hot() only on fair-scheduled tasks

S.Caglar Onur (1):
sched debug: BKL usage statistics, fix

Srivatsa Vaddagiri (13):
sched: group-scheduler core
sched: revert recent removal of set_curr_task()
sched: fix minor bug in yield
sched: print nr_running and load in /proc/sched_debug
sched: print &rq->cfs stats
sched: clean up code under CONFIG_FAIR_GROUP_SCHED
sched: add fair-user scheduler
sched: group scheduler wakeup latency fix
sched: group scheduler SMP migration fix
sched: group scheduler, fix coding style issues
sched: group scheduler, fix bloat
sched: group scheduler, fix latency
sched: generate uevents for user creation/destruction

Zou Nan hai (1):
sched: some proc entries are missed in sched_domain sys_ctl debug code

Documentation/sched-design-CFS.txt | 67 +
arch/i386/Kconfig | 11
drivers/kvm/kvm.h | 10
drivers/kvm/kvm_main.c | 2
fs/pipe.c | 9
fs/proc/array.c | 17
fs/proc/base.c | 2
fs/proc/proc_misc.c | 15
include/linux/kernel_stat.h | 1
include/linux/sched.h | 108 +-
include/linux/topology.h | 5
init/Kconfig | 21
kernel/delayacct.c | 2
kernel/exit.c | 6
kernel/fork.c | 3
kernel/ksysfs.c | 8
kernel/sched.c | 1526 +++++++++++++++++++++----------------
kernel/sched_debug.c | 282 ++++--
kernel/sched_fair.c | 859 ++++++++------------
kernel/sched_idletask.c | 26
kernel/sched_rt.c | 51 -
kernel/sched_stats.h | 28
kernel/sysctl.c | 37
kernel/user.c | 249 +++++-
net/unix/af_unix.c | 4
25 files changed, 1998 insertions(+), 1351 deletions(-)


2007-10-15 15:05:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24


* Ingo Molnar <[email protected]> wrote:

> Linus, please pull the latest scheduler git tree from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

oops, these two cleanups caused build failures in some config variants:

> sched: reduce balance-tasks overhead
> sched: isolate SMP balancing code a bit more

so i dropped them and re-pushed. New shortlog below.

Ingo

------------------>
Alexey Dobriyan (1):
sched: uninline scheduler

Andi Kleen (4):
sched: cleanup: remove unnecessary gotos
sched: cleanup: refactor common code of sleep_on / wait_for_completion
sched: cleanup: refactor normalize_rt_tasks
sched: remove stale comment from sched_group_set_shares()

Arjan van de Ven (1):
Make scheduler debug file operations const

Dhaval Giani (1):
sched: group scheduling, sysfs tunables

Dmitry Adamushko (14):
sched: clean up struct load_stat
sched: clean up schedstat block in dequeue_entity()
sched: sched_setscheduler() fix
sched: add set_curr_task() calls
sched: do not keep current in the tree and get rid of sched_entity::fair_key
sched: optimize task_new_fair()
sched: simplify sched_class::yield_task()
sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()
sched: yield fix
sched: fix __pick_next_entity()
sched: tidy up SCHED_RR
sched: cleanup, remove calc_weighted()
sched: cleanup, make dequeue_entity() and update_stats_wait_end() similar
sched: fix group scheduling for SCHED_BATCH

Gautham R Shenoy (1):
sched: fix rt ptracer monopolizing CPU

Hiroshi Shimamoto (1):
sched: clean up sched_fork()

Ingo Molnar (71):
sched: fix sysctl_sched_child_runs_first flag
sched: resched task in task_new_fair()
sched: small sched_debug cleanup
sched: debug: track maximum 'slice'
sched: uniform tunings
sched: use constants if !CONFIG_SCHED_DEBUG
sched: remove stat_gran
sched: remove precise CPU load
sched: remove precise CPU load calculations #2
sched: track cfs_rq->curr on !group-scheduling too
sched: cleanup: simplify cfs_rq_curr() methods
sched: uninline __enqueue_entity()/__dequeue_entity()
sched: speed up update_load_add/_sub()
sched: clean up calc_weighted()
sched: introduce se->vruntime
sched: move sched_feat() definitions
sched: optimize vruntime based scheduling
sched: simplify check_preempt() methods
sched: wakeup granularity increase
sched: add se->vruntime debugging
sched: remove SCHED_FEAT_SKIP_INITIAL
sched: add more vruntime statistics
sched: debug: update exec_clock only when SCHED_DEBUG
sched: remove wait_runtime limit
sched: remove wait_runtime fields and features
sched: x86: allow single-depth wchan output
sched: fix delay accounting performance regression
sched: prettify /proc/sched_debug output
sched: enhance debug output
sched: kernel/sched_fair.c whitespace cleanups
sched: fair-group sched, cleanups
sched: enable CONFIG_FAIR_GROUP_SCHED=y by default
sched debug: BKL usage statistics
sched: remove unneeded tunables
sched debug: print settings
sched debug: more width for parameter printouts
sched: entity_key() fix
sched: remove condition from set_task_cpu()
sched: remove last_min_vruntime effect
sched: undo some of the recent changes
sched: fix sign check error in place_entity()
sched: fix sched_fork()
sched: remove set_leftmost()
sched: clean up schedstats, cnt -> count
sched: cleanup, remove stale comment
sched: mark scheduling classes as const
sched: whitespace cleanups
sched: vslice fixups for non-0 nice levels
sched: optimize schedule() a bit on SMP
sched: tweak wakeup granularity
sched: run sched_domain_debug() if CONFIG_SCHED_DEBUG=y
sched: break out if printing a warning in sched_domain_debug()
sched: style cleanup
sched: kfree(NULL) is valid
sched: cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG
sched: cleanup: rename task_grp to task_group
sched: cleanup: function prototype cleanups
sched: fix: move the CPU check into ->task_new_fair()
sched: update comment
sched: clean up is_migration_thread()
sched: do not normalize kernel threads via SysRq-N
sched: do not wakeup-preempt with SCHED_BATCH tasks
sched: speed up context-switches a bit
sched: reintroduce cache-hot affinity
sched: debug: increase width of debug line
sched: debug, improve migration statistics
sched: allow the immediate migration of cache-cold tasks
sched: reintroduce topology.h tunings
sched: enable wake-idle on CONFIG_SCHED_MC=y
sched: affine sync wakeups
sched: sync wakeups preempt too

Laurent Vivier (4):
sched: guest CPU accounting: add guest-CPU /proc/stat field
sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields
sched: guest CPU accounting: maintain stats in account_system_time()
sched: guest CPU accounting: maintain guest state in KVM

Matthias Kaehlcke (1):
sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (4):
sched: fix SMP migration latencies
sched: fix formatting of /proc/sched_debug
sched: cleanup, remove the TASK_NONINTERACTIVE flag
sched: prevent wakeup over-scheduling

Milton Miller (5):
sched: domain sysctl fixes: use kcalloc()
sched: domain sysctl fixes: use for_each_online_cpu()
sched: domain sysctl fixes: unregister the sysctl table before domains
sched: domain sysctl fixes: do not crash on allocation failure
sched: domain sysctl fixes: add terminator comment

Paul E. McKenney (1):
sched: export cpu_clock()

Peter Zijlstra (16):
sched: simplify SCHED_FEAT_* code
sched: new task placement for vruntime
sched: simplify adaptive latency
sched: clean up new task placement
sched: add tree based averages
sched: handle vruntime 64-bit overflow
sched: better min_vruntime tracking
sched: add vslice
sched debug: check spread
sched: max_vruntime() simplification
sched: clean up min_vruntime use
sched: speed up and simplify vslice calculations
sched: another wakeup_granularity fix
sched: disable sleeper_fairness on SCHED_BATCH
sched: disable forced preemption by default
sched: activate task_hot() only on fair-scheduled tasks

S.Caglar Onur (1):
sched debug: BKL usage statistics, fix

Srivatsa Vaddagiri (13):
sched: group-scheduler core
sched: revert recent removal of set_curr_task()
sched: fix minor bug in yield
sched: print nr_running and load in /proc/sched_debug
sched: print &rq->cfs stats
sched: clean up code under CONFIG_FAIR_GROUP_SCHED
sched: add fair-user scheduler
sched: group scheduler wakeup latency fix
sched: group scheduler SMP migration fix
sched: group scheduler, fix coding style issues
sched: group scheduler, fix bloat
sched: group scheduler, fix latency
sched: generate uevents for user creation/destruction

Zou Nan hai (1):
sched: some proc entries are missed in sched_domain sys_ctl debug code

Documentation/sched-design-CFS.txt | 67 +
arch/i386/Kconfig | 11
drivers/kvm/kvm.h | 10
drivers/kvm/kvm_main.c | 2
fs/pipe.c | 9
fs/proc/array.c | 17
fs/proc/base.c | 2
fs/proc/proc_misc.c | 15
include/linux/kernel_stat.h | 1
include/linux/sched.h | 99 +-
include/linux/topology.h | 5
init/Kconfig | 21
kernel/delayacct.c | 2
kernel/exit.c | 6
kernel/fork.c | 3
kernel/ksysfs.c | 8
kernel/sched.c | 1444 +++++++++++++++++++++----------------
kernel/sched_debug.c | 282 ++++---
kernel/sched_fair.c | 811 ++++++++------------
kernel/sched_idletask.c | 8
kernel/sched_rt.c | 19
kernel/sched_stats.h | 28
kernel/sysctl.c | 37
kernel/user.c | 249 ++++++
net/unix/af_unix.c | 4
25 files changed, 1872 insertions(+), 1288 deletions(-)

2007-10-15 18:46:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24

On Mon, 15 Oct 2007 16:17:23 +0200
Ingo Molnar <[email protected]> wrote:

> Linus, please pull the latest scheduler git tree from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

Did Paul Jackson's crash get fixed?

2007-10-15 18:53:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24


* Andrew Morton <[email protected]> wrote:

> On Mon, 15 Oct 2007 16:17:23 +0200
> Ingo Molnar <[email protected]> wrote:
>
> > Linus, please pull the latest scheduler git tree from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
>
> Did Paul Jackson's crash get fixed?

yes - that crash was a showstopper that was holding up the pull request
for 2 days. Paul bisected it down to the culprit and the fix was to do
this in wake_up_new_task():

- if (!p->sched_class->task_new || !current->se.on_rq) {
+ if (!p->sched_class->task_new || !current->se.on_rq || !rq->cfs.curr) {

(during early bootup the cfs_rq has no curr pointer yet.) It's not clear
why this race did not trigger earlier. (and the two checks can probably
be consolidated into a single "!rq->cfs.curr" condition.)

Ingo

2007-10-16 00:11:22

by Nick Piggin

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24

On Tuesday 16 October 2007 00:17, Ingo Molnar wrote:
> Linus, please pull the latest scheduler git tree from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
>
> It contains lots of scheduler updates from lots of people - hopefully
> the last big one for quite some time. Most of the focus was on
> performance (both micro-performance and scalability/balancing), but
> there's the fair-scheduling feature now Kconfig selectable too. Find the
> shortlog below.

Nice work...

However it's a pity all the balancing stuff got wildly changed
in 2.6.23 and then somewhat changed back again now.

Despite appearances, a lot of those things weren't actually
*completely* arbitrary values. I fear that it will make finding
performance regressions harder than it should have...

Anyway.

2007-10-16 10:09:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24


* Thomas Backlund <[email protected]> wrote:

> How does this one compare to the v22 you released earlier ?

v22 has most of it included.

> I'm thinking of backporting any fixes/optimizations to 2.6.22 (and
> possibly 2.6.23)

i have already backported it as v22.1 - will release it within a few
days. (once the currently open regressions have been fixed)

Ingo

2007-10-16 10:12:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24


* Ingo Molnar <[email protected]> wrote:

> * Thomas Backlund <[email protected]> wrote:
>
> > How does this one compare to the v22 you released earlier ?
>
> v22 has most of it included.
>
> > I'm thinking of backporting any fixes/optimizations to 2.6.22 (and
> > possibly 2.6.23)
>
> i have already backported it as v22.1 - will release it within a few
> days. (once the currently open regressions have been fixed)

i've uploaded what i have at the moment, to:

http://people.redhat.com/mingo/cfs-scheduler/devel/sched-cfs-v2.6.23.1-v22.1-rc0.patch

Ingo

2007-10-16 10:13:33

by Thomas Backlund

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24

Ingo Molnar skrev:
> Linus, please pull the latest scheduler git tree from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
>
> It contains lots of scheduler updates from lots of people - hopefully
> the last big one for quite some time. Most of the focus was on
> performance (both micro-performance and scalability/balancing), but
> there's the fair-scheduling feature now Kconfig selectable too. Find the
> shortlog below.
>


How does this one compare to the v22 you released earlier ?

I'm thinking of backporting any fixes/optimizations to 2.6.22
(and possibly 2.6.23)

--
Thomas

2007-10-16 11:01:33

by Thomas Backlund

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24

Ingo Molnar skrev:
> * Ingo Molnar <[email protected]> wrote:
>
>> * Thomas Backlund <[email protected]> wrote:
>>
>>> How does this one compare to the v22 you released earlier ?
>> v22 has most of it included.
>>

OK, that's what I thought

>>> I'm thinking of backporting any fixes/optimizations to 2.6.22 (and
>>> possibly 2.6.23)
>> i have already backported it as v22.1 - will release it within a few
>> days. (once the currently open regressions have been fixed)
>

OK

> i've uploaded what i have at the moment, to:
>
> http://people.redhat.com/mingo/cfs-scheduler/devel/sched-cfs-v2.6.23.1-v22.1-rc0.patch
>
> Ingo

Big thanks for your work...

Now I just have to see if I can get it to work with the -hrt series and
I'm really happy ;-)

--
Thomas

2007-10-16 22:17:48

by Gabriel C

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24

Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
>
>> On Mon, 15 Oct 2007 16:17:23 +0200
>> Ingo Molnar <[email protected]> wrote:
>>
>>> Linus, please pull the latest scheduler git tree from:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
>> Did Paul Jackson's crash get fixed?
>
> yes - that crash was a showstopper that was holding up the pull request
> for 2 days. Paul bisected it down to the culprit and the fix was to do
> this in wake_up_new_task():
>
> - if (!p->sched_class->task_new || !current->se.on_rq) {
> + if (!p->sched_class->task_new || !current->se.on_rq || !rq->cfs.curr) {
>
> (during early bootup the cfs_rq has no curr pointer yet.) It's not clear
> why this race did not trigger earlier. (and the two checks can probably
> be consolidated into a single "!rq->cfs.curr" condition.)

Maybe not related to that but now my box is killed after this merge.

When I do not much on the box I get maybe 6h uptime , by doing some work ( compiling etc ) is random freeze.

I was able to capture the OOps finally :

...

[15692.917111] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000044
[15692.917159] printing eip:
[15692.917174] c0111f90
[15692.917185] *pde = 00000000
[15692.917200] Oops: 0000 [#1]
[15692.917208] PREEMPT SMP
[15692.917240] Modules linked in: fuse netconsole configfs pc87360 hwmon_vid eeprom adm1021 uhci_hcd sr_mod shpchp pci_hotplug ohci_hcd iTCO_wdt iTCO_vendor_support intel_agp i82860_edac i2c_i801 ehci_hcd usbcore edac_core cdrom agpgart 3c59x mii ext4dev jbd2 capability commoncap loop lp parport_pc parport evdev
[15692.917623] CPU: 0
[15692.917625] EIP: 0060:[<c0111f90>] Not tainted VLI
[15692.917629] EFLAGS: 00010046 (2.6.23-g65a6ec0d #330)
[15692.917661] EIP is at pick_next_task_fair+0x1f/0x2d
[15692.917672] eax: c150a7b8 ebx: 00000000 ecx: 00000000 edx: 00000000
[15692.917689] esi: c1507a48 edi: 00000000 ebp: 00eaaf7a esp: cb1fdf14
[15692.917701] ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
[15692.917715] Process sed (pid: 28999, ti=cb1fc000 task=cfdc3500 task.ti=cb1fc000)
[15692.917725] Stack: c02f8268 c02ef7b5 00000002 cb1fdf58 cb1fdf50 00000000 c0400f38 c0403780
[15692.917833] cfdc3500 cfdc3634 c150a780 00000000 c011a8e7 00000000 c1077aa0 000000ff
[15692.917942] 00000000 00000000 00000000 cb1fdf8c 00000010 cfdc3500 cb1fdf8c c011ace5
[15692.918048] Call Trace:
[15692.918072] [<c02ef7b5>] schedule+0x321/0x58f
[15692.918109] [<c011a8e7>] do_exit+0x293/0x6c6
[15692.918143] [<c011ace5>] do_exit+0x691/0x6c6
[15692.918169] [<c011ad87>] sys_exit_group+0x0/0xd
[15692.918195] [<c01026e6>] sysenter_past_esp+0x5f/0x85
[15692.918232] =======================
[15692.918244] Code: 8b 53 28 89 43 34 89 53 38 5b 5e c3 53 31 d2 83 78 40 00 74 20 83 c0 38 8b 50 20 31 db 85 d2 74 0a 8d 5a f8 89 da e8 a9 ff ff ff <8b> 43 44 85 c0 75 e6 8d 53 d0 89 d0 5b c3 57 56 53 89 c6 89 d7
[15692.918981] EIP: [<c0111f90>] pick_next_task_fair+0x1f/0x2d SS:ESP 0068:cb1fdf14

...

After that the box is death need to hard reset it.

Interesting thing is when I compile the kernel with debug I don't get that ( or maybe its need longer to triggers it ? )

Config , lspci , dmesg , hardware specs , Oops message , and the top output when it Oops'ed there :


http://194.231.229.228/lara/

>
> Ingo

Regards,

Gabriel

2007-10-16 22:38:20

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24

On 15/10/2007, Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > On Mon, 15 Oct 2007 16:17:23 +0200
> > Ingo Molnar <[email protected]> wrote:
> >
> > > Linus, please pull the latest scheduler git tree from:
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
> >
> > Did Paul Jackson's crash get fixed?
>
> yes - that crash was a showstopper that was holding up the pull request
> for 2 days. Paul bisected it down to the culprit and the fix was to do
> this in wake_up_new_task():
>
> - if (!p->sched_class->task_new || !current->se.on_rq) {
> + if (!p->sched_class->task_new || !current->se.on_rq || !rq->cfs.curr) {
>
> (during early bootup the cfs_rq has no curr pointer yet.) It's not clear
> why this race did not trigger earlier.

an update on this issue:

shortly, SD_BALANCE_FORK is required to trigger this problem and
hence, only NUMA machines could have been affected by it (and only
ia64 and x86 have SD_BALANCE_FORK in SD_NODE_INIT).

more details:

it's perfectly legitimate for 'rq->cfs.curr' to be NULL in
task_new_fair() in the case when this_cpu != task_cpu(p) (p -- is a
newly created task).

why this_cpu != task_cpu(p) :

do_fork() --> copy_process() --> sched_fork() -->
cpu = sched_balance_self(this_cpu, SD_BALANCE_FORK)

chose a different cpu for the new task and there is _no_
'class_sched_fair' task running on this cpu at the moment (that's why
rq->cfs.curr == NULL).

[ thanks a lot to Paul for providing debugging information ]

btw., it's not the 'curr->vruntime < se->vruntime' part in
task_new_fair() that gave us the oops (it's only executed in the case
of this_cpu == task_cpu(p)) _but_ it's rather:

[*] check_spread(cfs_rq, curr) which also accesses 'curr->vruntime'.

> (and the two checks can probably
> be consolidated into a single "!rq->cfs.curr" condition.)

2 checks are required as 'current' and rq->cfs.curr are not the same :-)
It also should work if we just get rid of [*] or add an adiitional
(curr != NULL) check there.

just as a additional observation:

there are lots of per-cpu threads (like events/cpu, ksoftirq/cpu,
etc.) being created on start-up (x NUMBER_OF_CPUS) and SD_SCHED_FORK
(actually, sched_balance_self() from sched_fork()) is just an overhead
in this case...
although, sched_balance_self() is likely to be responsible for a minor
% of the time taken to create a new context so optimizing it away
(esp. for some corner cases) won't improve the start-up time
noticeable.


>
> Ingo
> -

--
Best regards,
Dmitry Adamushko

2007-10-16 23:31:32

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24

[ cc'ed Srivatsa ]

On 17/10/2007, Gabriel C <[email protected]> wrote:
> Ingo Molnar wrote:
> [15692.917111] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000044
> ...
> [15692.917629] EFLAGS: 00010046 (2.6.23-g65a6ec0d #330)
> [15692.917661] EIP is at pick_next_task_fair+0x1f/0x2d

Gabriel, could you please post a disassembled code for pick_next_task_fair()?
(objdump -d kernel/sched.o and then search for pick_next_task_fair --
copy_and_past)

anyway, my guess is that it's :

se = pick_next_entity(cfs_rq);
cfs_rq = group_cfs_rq(se);

'se' _happens_ to be NULL and group_cf_rq(se) does se->my_q and
(according to my calculations) offset(my_q) == 68 (0x44) for x86 32bit
system with CONFIG_SCHEDSTATS=n and CONFIG_FAIR_GROUP_SCHED=y
(according to the config).

that might take place provided put_prev_task_fair() failed for some
reason to insert 'current' (or its corresponding group element) back
into the tree in schedule()... say, due to some inconsistency in
cfs_rq's data.

Srivatsa, that's somewhat similar to another issue that has been
posted earlier today (crash in put_prev_task_fair() -->
__enqueue_task() --> rb_insert_color()) that you are already aware of
... (/me will continue tomorrow).


--
Best regards,
Dmitry Adamushko

2007-10-16 23:55:04

by Gabriel C

[permalink] [raw]
Subject: Re: [git pull] scheduler updates for v2.6.24

Dmitry Adamushko wrote:
> [ cc'ed Srivatsa ]
>
> On 17/10/2007, Gabriel C <[email protected]> wrote:
>> Ingo Molnar wrote:
>> [15692.917111] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000044
>> ...
>> [15692.917629] EFLAGS: 00010046 (2.6.23-g65a6ec0d #330)
>> [15692.917661] EIP is at pick_next_task_fair+0x1f/0x2d
>
> Gabriel, could you please post a disassembled code for pick_next_task_fair()?
> (objdump -d kernel/sched.o and then search for pick_next_task_fair --
> copy_and_past)

Sure here it is :

00000e49 <pick_next_task_fair>:
e49: 53 push %ebx
e4a: 31 d2 xor %edx,%edx
e4c: 83 78 40 00 cmpl $0x0,0x40(%eax)
e50: 74 20 je e72 <pick_next_task_fair+0x29>
e52: 83 c0 38 add $0x38,%eax
e55: 8b 50 20 mov 0x20(%eax),%edx
e58: 31 db xor %ebx,%ebx
e5a: 85 d2 test %edx,%edx
e5c: 74 0a je e68 <pick_next_task_fair+0x1f>
e5e: 8d 5a f8 lea -0x8(%edx),%ebx
e61: 89 da mov %ebx,%edx
e63: e8 a9 ff ff ff call e11 <set_next_entity>
e68: 8b 43 44 mov 0x44(%ebx),%eax
e6b: 85 c0 test %eax,%eax
e6d: 75 e6 jne e55 <pick_next_task_fair+0xc>
e6f: 8d 53 d0 lea -0x30(%ebx),%edx
e72: 89 d0 mov %edx,%eax
e74: 5b pop %ebx
e75: c3 ret


>
> anyway, my guess is that it's :
>
> se = pick_next_entity(cfs_rq);
> cfs_rq = group_cfs_rq(se);
>
> 'se' _happens_ to be NULL and group_cf_rq(se) does se->my_q and
> (according to my calculations) offset(my_q) == 68 (0x44) for x86 32bit
> system with CONFIG_SCHEDSTATS=n and CONFIG_FAIR_GROUP_SCHED=y
> (according to the config).
>
> that might take place provided put_prev_task_fair() failed for some
> reason to insert 'current' (or its corresponding group element) back
> into the tree in schedule()... say, due to some inconsistency in
> cfs_rq's data.
>
> Srivatsa, that's somewhat similar to another issue that has been
> posted earlier today (crash in put_prev_task_fair() -->
> __enqueue_task() --> rb_insert_color()) that you are already aware of
> ... (/me will continue tomorrow).
>
>