2003-03-26 09:27:04

by Andrew Morton

[permalink] [raw]
Subject: 2.5.66-mm1


ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm1/


. The anticipatory scheduler is in wrapup mode now. It is pretty much in
its final form.

. The ext2 locking changes have been significantly redone.

The per-blockgroup data structures had to go. For a 4TB filesystem we
cannot even kmalloc that many pointers, let alone data structures.

So the per-blockgroup spinlocking has been replaced with hashed
spinlocking and the per-blockgroup accounting has been removed. A "per-cpu
counter" thing has been invented to amortise the locking cost of the
filesystem-wide counters.

. ext3 is now using spinlocking in its block allocator rather than a
filesystem-wide semaphore.

It is stability-tested but I have not yet performance tested this
closely. It does appear to have improved the context switch problem (and
the file fragmentation problem which the context switch problem causes).
But there's a way to go here.




Changes since 2.5.65-mm4:


linus.patch

Latest -bk

-nfsd-32-bit-dev_t-fixes.patch
-i2c-fix.patch

Merged

+kgdb-ga.patch

George Anzinger's gdb stub

+ppa-null-pointer-fix.patch

Might fix the parport scsi driver

+initcall-debug.patch

Debugging support for misbehaving initcalls

+posix-timers-64-bit-fix.patch

Timer fix for 64-bit machines

+slab-off-by-one-fix.patch

Slab was using too much memory.

+install_page-flush_cache_page.patch

Cache coherency bug in remap_file_pages()

+as-minor-tweaks.patch
+as-remove-stats.patch

Anticipaory scheduler tuning and clanups.

+posix-timer-double-expiration-fix.patch

Posix timers were sending timer expiry info twice.

+hugh-01-no-SWAP_ERROR.patch
+hugh-02-try_to_unmap-CONFIG_SWAP.patch
+hugh-03-add_to_swap_cache.patch
+hugh-04-page_convert_anon-ENOMEM.patch
+hugh-05-page_convert_anon-unlocking.patch
+hugh-06-wrap-below-vm_start.patch
+hugh-07-objrmap-page_table_lock.patch
+hugh-08-rmap-comments.patch
+hugh-09-tmpfs-truncation.patch
+hugh-10-tmpfs-atomics.patch
+hugh-11-fix-unuse_pmd-fixme.patch
+hugh-12-vm_enough_memory-double-counts.patch

Various vm/mm fixes and cleanups

+ext3-max-file-size-fix.patch

Allow ext3 to create files larger than 32GB (should be nearly 2TB)

-ext2-no-lock_super.patch
-ext2-ialloc-no-lock_super.patch
+ext2-no-lock_super-ng.patch
+ext2-ialloc-no-lock_super-ng.patch

Rework the ext2 block and inode allocator locking changes.

+dev_t-remove-B_FREE.patch

Remove B_FREE.

+tty_io-cleanup.patch
+page_to_pfn-in-blk_queue_bounce.patch
+init_inode_once-bloat-fix.patch

Cleanups and fixlets

+compound-page-warning-fix.patch

Fix a warning

+slab-cache-sizes-cleanup.patch

Unduplicate some tables in slab.

+stat_t-larger-dev_t.patch

Large dev_t fix.

+acpi-build-fix.patch

make acpi compile.

+sync_blockdev-on-final-close.patch

Only write out blockdev mappings on the final close.

+ext3-concurrent-block-inode-allocation.patch
+ext3-concurrent-block-allocation-fix-1.patch

Use spinlocking in the ext3 block allocator, not as fs-wide semaphore.



All 104 patches:

linus.patch

mm.patch
add -mmN to EXTRAVERSION

kgdb-ga.patch
kgdb stub for ia32 (George Anzinger's one)

ppa-null-pointer-fix.patch

initcall-debug.patch
initcall debugging support

posix-timers-64-bit-fix.patch
POSIX timers interface long/int cleanup

slab-off-by-one-fix.patch
slab: fix off-by-one in size calculation

config_spinline.patch
uninline spinlocks for profiling accuracy.

ppc64-reloc_hide.patch

ppc64-pci-patch.patch
Subject: pci patch

ppc64-aio-32bit-emulation.patch
32/64bit emulation for aio

ppc64-scruffiness.patch
Fix some PPC64 compile warnings

sym-do-160.patch
make the SYM driver do 160 MB/sec

install_page-flush_cache_page.patch
add flush_cache_page() to install_page()

config-PAGE_OFFSET.patch
Configurable kenrel/user memory split

ptrace-flush.patch
cache flushing in the ptrace code

buffer-debug.patch
buffer.c debugging

warn-null-wakeup.patch

ext3-truncate-ordered-pages.patch
ext3: explicitly free truncated pages

reiserfs_file_write-5.patch

rcu-stats.patch
RCU statistics reporting

ext3-journalled-data-assertion-fix.patch
Remove incorrect assertion from ext3

nfs-speedup.patch

nfs-oom-fix.patch
nfs oom fix

sk-allocation.patch
Subject: Re: nfs oom

nfs-more-oom-fix.patch

rpciod-atomic-allocations.patch
Make rcpiod use atomic allocations

linux-isp.patch

isp-update-1.patch

kblockd.patch
Create `kblockd' workqueue

as-iosched.patch
anticipatory I/O scheduler

as-np-reads-1.patch
AS: read-vs-read fixes

as-np-reads-2.patch
AS: more read-vs-read fixes

as-predict-data-direction.patch
as: predict direction of next IO

as-remove-frontmerge.patch
AS: remove frontmerge tunable

as-misc-cleanups.patch
AS: misc cleanups

as-minor-tweaks.patch
AS: tuning and tweaks

as-remove-stats.patch
AS: remove statistics

cfq-2.patch
CFQ scheduler, #2

unplug-use-kblockd.patch
Use kblockd for running request queues

fremap-all-mappings.patch
Make all executable mappings be nonlinear

objrmap-2.5.62-5.patch
object-based rmap

sched-2.5.64-D3.patch
sched-2.5.64-D3, more interactivity changes

scheduler-tunables.patch
scheduler tunables

show_task-free-stack-fix.patch
show_task() fix and cleanup

yellowfin-set_bit-fix.patch
yellowfin driver set_bit fix

htree-nfs-fix.patch
Fix ext3 htree / NFS compatibility problems

task_prio-fix.patch
simple task_prio() fix

slab_store_user-large-objects.patch
slab debug: perform redzoning against larger objects

pcmcia-2.patch

pcmcia-3b.patch

pcmcia-3.patch

pcmcia-4.patch

pcmcia-5.patch

pcmcia-6.patch

pcmcia-7b.patch

pcmcia-7.patch

pcmcia-8.patch

pcmcia-9.patch

pcmcia-10.patch

htree-nfs-fix-2.patch
htree nfs fix

posix-timer-double-expiration-fix.patch
posix timers: fix double-reporting of timer expiration

hugh-01-no-SWAP_ERROR.patch
swap 01/13 no SWAP_ERROR

hugh-02-try_to_unmap-CONFIG_SWAP.patch
Subject: [PATCH] swap 02/13 !CONFIG_SWAP try_to_unmap

hugh-03-add_to_swap_cache.patch
swap 03/13 add_to_swap_cache

hugh-04-page_convert_anon-ENOMEM.patch
swap 04/13 page_convert_anon -ENOMEM

hugh-05-page_convert_anon-unlocking.patch
swap 05/13 page_convert_anon unlocking

hugh-06-wrap-below-vm_start.patch
swap 06/13 wrap below vm_start

hugh-07-objrmap-page_table_lock.patch
swap 07/13 objrmap page_table_lock

hugh-08-rmap-comments.patch
swap 08/13 rmap comments

hugh-09-tmpfs-truncation.patch
swap 09/13 tmpfs truncation

hugh-10-tmpfs-atomics.patch
swap 10/13 tmpfs atomics

hugh-11-fix-unuse_pmd-fixme.patch
swap 11/13 fix unuse_pmd fixme

hugh-12-vm_enough_memory-double-counts.patch
swap 12/13 vm_enough_memory double counts

ext3-max-file-size-fix.patch
ext3: fix max file size

ext2-no-lock_super-ng.patch

ext2-ialloc-no-lock_super-ng.patch

linear-oops-fix-1.patch
md/linear oops fix

dev_t-32-bit.patch
[for playing only] change type of dev_t

dev_t-remove-B_FREE.patch
dev_t: eliminate B_FREE

dev_t-drm-warnings.patch
dev_t: fix drm printk warnings

sg-dev_t-fix.patch
32-bit dev_t fix for sg

oops-dump-preceding-code.patch
i386 oops output: dump preceding code

x86-clock-override-option.patch
x86 clock override boot option

tty_io-cleanup.patch
tty_io cleanup

page_to_pfn-in-blk_queue_bounce.patch
Subject: use page_to_pfn() in __blk_queue_bounce()

init_inode_once-bloat-fix.patch
Subject: init_inode_once() wants sizeof(struct hlist_head)

conntrack-use-after-free-fix.patch
fix use-after-free in ip_conntrack

VM_DONTEXPAND-fix.patch
honour VM_DONTEXPAND in vma merging

compound-page-warning-fix.patch
Fix 64bit warnings in mm/page_alloc.c

cdevname-irq-safety-fix.patch
make cdevname() callable from interrupts

register_chrdev_region-leak-fix.patch
register_chrdev_region() leak and race fix

slab-cache-sizes-cleanup.patch
slab: cache sizes cleanup

stat_t-larger-dev_t.patch
struct stat - support larger dev_t

acpi-build-fix.patch
ACPI build fix

sync_blockdev-on-final-close.patch
sync blockdevs on the final close only

ext3_mark_inode_dirty-speedup.patch
ext3_mark_inode_dirty() speedup

ext3_mark_inode_dirty-less-calls.patch
ext3_commit_write speedup

ext3-handle-cache.patch
ext3: create a slab cache for transaction handles

ext3-no-bkl.patch

journal_dirty_metadata-speedup.patch

journal_get_write_access-speedup.patch

ext3-concurrent-block-inode-allocation.patch
Subject: [PATCH] concurrent block/inode allocation for EXT3

ext3-concurrent-block-allocation-fix-1.patch




2003-03-28 01:54:22

by Ed Tomlinson

[permalink] [raw]
Subject: Re: 2.5.66-mm1

Hi Andrew,

Got this opps after about 20 hours with mm1 (65-mm3 lasted 5 days
until I rebooted).

Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c011516d
*pde = 00000000
Oops: 0002 [#1]
CPU: 0
EIP: 0060:[<c011516d>] Not tainted VLI
EFLAGS: 00010097
EIP is at schedule+0x8d/0x3a0
eax: 00000001 ebx: cf5e99c0 ecx: cf5e99c0 edx: ffffffff
esi: 00000000 edi: c031de00 ebp: cf5ebf08 esp: cf5ebef0
ds: 007b es: 007b ss: 0068
Process newsplex (pid: 1205, threadinfo=cf5ea000 task=cf5e99c0)
Stack: c011fbd7 c02bbc40 00000246 05261e41 cf5ebf14 cf5ebf50 cf5ebf3c c0120754
cf5ebf14 c02bc538 c02bc538 05261e41 4b87ad6e c01206e0 cf5e99c0 c02bbc40
c015abd6 000007d1 00000000 cf5ebf60 c015ac19 cf5ea000 cf5ea000 00000000
Call Trace:
[<c011fbd7>] add_timer+0x57/0xa0
[<c0120754>] schedule_timeout+0x54/0xa0
[<c01206e0>] process_timeout+0x0/0x20
[<c015abd6>] do_poll+0x56/0xc0
[<c015ac19>] do_poll+0x99/0xc0
[<c015ad88>] sys_poll+0x148/0x220
[<c013eb3b>] sys_mprotect+0x21b/0x22f
[<c01079ec>] sys_clone+0x2c/0x60
[<c015a200>] __pollwait+0x0/0xc0
[<c0109277>] syscall_call+0x7/0xb

Code: 40 17 04 75 4d 8b 03 85 c0 74 47 48 0f 84 da 02 00 00 ff 0d 00 de 31 c0 8b 43 68 ff 08 8b 03 83 f8 02 0f 84 b6 02 00 00 8b 73 28 <ff> 4e 00 8b 53 24 8b 43 20 89 50 04 89 02 8b 4b 18 8d 14 ce 8d
<6>note: newsplex[1205] exited with preempt_count 2
Debug: sleeping function called from illegal context at include/linux/rwsem.h:43
Call Trace:
[<c01168d3>] __might_sleep+0x53/0x60
[<c01198d5>] profile_exit_task+0x15/0x60
[<c011aee6>] do_exit+0x86/0x460
[<c0109ab5>] die+0x75/0x80
[<c0113854>] do_page_fault+0x134/0x45e
[<c0114798>] try_to_wake_up+0x138/0x240
[<c011fde4>] mod_timer+0x124/0x180
[<c012a520>] nanosleep_wake_up+0x0/0x20
[<c0131feb>] buffered_rmqueue+0xab/0x140
[<c0132103>] __alloc_pages+0x83/0x280
[<c0113720>] do_page_fault+0x0/0x45e
[<c01094dd>] error_code+0x2d/0x40
[<c011516d>] schedule+0x8d/0x3a0
[<c011fbd7>] add_timer+0x57/0xa0
[<c0120754>] schedule_timeout+0x54/0xa0
[<c01206e0>] process_timeout+0x0/0x20
[<c015abd6>] do_poll+0x56/0xc0
[<c015ac19>] do_poll+0x99/0xc0
[<c015ad88>] sys_poll+0x148/0x220
[<c013eb3b>] sys_mprotect+0x21b/0x22f
[<c01079ec>] sys_clone+0x2c/0x60
[<c015a200>] __pollwait+0x0/0xc0
[<c0109277>] syscall_call+0x7/0xb

Hope this helps

Ed Tomlinson


2003-03-28 04:47:07

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.66-mm1

Ed Tomlinson <[email protected]> wrote:
>
> Hi Andrew,
>
> Got this opps after about 20 hours with mm1 (65-mm3 lasted 5 days
> until I rebooted).
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> c011516d
> *pde = 00000000
> Oops: 0002 [#1]
> CPU: 0
> EIP: 0060:[<c011516d>] Not tainted VLI
> EFLAGS: 00010097
> EIP is at schedule+0x8d/0x3a0
> eax: 00000001 ebx: cf5e99c0 ecx: cf5e99c0 edx: ffffffff
> esi: 00000000 edi: c031de00 ebp: cf5ebf08 esp: cf5ebef0
> ds: 007b es: 007b ss: 0068
> Process newsplex (pid: 1205, threadinfo=cf5ea000 task=cf5e99c0)
> Stack: c011fbd7 c02bbc40 00000246 05261e41 cf5ebf14 cf5ebf50 cf5ebf3c c0120754
> cf5ebf14 c02bc538 c02bc538 05261e41 4b87ad6e c01206e0 cf5e99c0 c02bbc40
> c015abd6 000007d1 00000000 cf5ebf60 c015ac19 cf5ea000 cf5ea000 00000000
> Call Trace:
> [<c011fbd7>] add_timer+0x57/0xa0
> [<c0120754>] schedule_timeout+0x54/0xa0
> [<c01206e0>] process_timeout+0x0/0x20
> [<c015abd6>] do_poll+0x56/0xc0
> [<c015ac19>] do_poll+0x99/0xc0
> [<c015ad88>] sys_poll+0x148/0x220
> [<c013eb3b>] sys_mprotect+0x21b/0x22f
> [<c01079ec>] sys_clone+0x2c/0x60
> [<c015a200>] __pollwait+0x0/0xc0
> [<c0109277>] syscall_call+0x7/0xb
>
> Code: 40 17 04 75 4d 8b 03 85 c0 74 47 48 0f 84 da 02 00 00 ff 0d 00 de 31 c0 8b 43 68 ff 08 8b 03 83 f8 02 0f 84 b6 02 00 00 8b 73 28 <ff> 4e 00 8b 53 24 8b 43 20 89 50 04 89 02 8b 4b 18 8d 14 ce 8d

That longer Code: line is really handy.

You died in schedule()->deactivate_task()->dequeue_task().

static inline void dequeue_task(struct task_struct *p, prio_array_t *array)
{
array->nr_active--;

`array' is zero.

I'm going to Cc Ingo and run away. Ed uses preempt.

2003-03-28 10:34:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.5.66-mm1


On Thu, 27 Mar 2003, Andrew Morton wrote:

> That longer Code: line is really handy.
>
> You died in schedule()->deactivate_task()->dequeue_task().
>
> static inline void dequeue_task(struct task_struct *p, prio_array_t *array)
> {
> array->nr_active--;
>
> `array' is zero.
>
> I'm going to Cc Ingo and run away. Ed uses preempt.

hm, this is an 'impossible' scenario from the scheduler code POV. Whenever
we deactivate a task, we remove it from the runqueue and set p->array to
NULL. Whenever we activate a task again, we set p->array to non-NULL. A
double-deactivate is not possible. I tried to reproduce it with various
scheduler workloads, but didnt succeed.

Mike, do you have a backtrace of the crash you saw?

Ingo

2003-03-28 14:11:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.5.66-mm1

At 11:45 AM 3/28/2003 +0100, Ingo Molnar wrote:

>On Thu, 27 Mar 2003, Andrew Morton wrote:
>
> > That longer Code: line is really handy.
> >
> > You died in schedule()->deactivate_task()->dequeue_task().
> >
> > static inline void dequeue_task(struct task_struct *p, prio_array_t *array)
> > {
> > array->nr_active--;
> >
> > `array' is zero.
> >
> > I'm going to Cc Ingo and run away. Ed uses preempt.
>
>hm, this is an 'impossible' scenario from the scheduler code POV. Whenever
>we deactivate a task, we remove it from the runqueue and set p->array to
>NULL. Whenever we activate a task again, we set p->array to non-NULL. A
>double-deactivate is not possible. I tried to reproduce it with various
>scheduler workloads, but didnt succeed.
>
>Mike, do you have a backtrace of the crash you saw?

No, I didn't save it due to "grubby fingerprints".

-Mike

2003-03-28 14:49:21

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.66-mm1

On Fri, 28 Mar 2003, Mike Galbraith wrote:

> >hm, this is an 'impossible' scenario from the scheduler code POV. Whenever
> >we deactivate a task, we remove it from the runqueue and set p->array to
> >NULL. Whenever we activate a task again, we set p->array to non-NULL. A
> >double-deactivate is not possible. I tried to reproduce it with various
> >scheduler workloads, but didnt succeed.
> >
> >Mike, do you have a backtrace of the crash you saw?
>
> No, I didn't save it due to "grubby fingerprints".

Hmm i think i may have his this one but i never posted due to being unable
to reproduce it on a vanilla kernel or the same kernel afterwards (which
was hacked so i won't vouch for it's cleanliness). I think preempt
might have bitten him in a bad place (mine is also CONFIG_PREEMPT), is it
possible that when we did the task_rq_unlock we got preempted and when we
got back we used the local variable requeue_waker which was set before
dropping the lock, and therefore might not be valid anymore due to
scheduler decisions done after dropping the runqueue lock?

Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c011b8d9
*pde = 00000000
Oops: 0000 [#1]
CPU: 0
EIP: 0060:[<c011b8d9>] Not tainted
EFLAGS: 00010046
EIP is at try_to_wake_up+0x1e9/0x4f0
eax: c055a000 ebx: c04e5aa0 ecx: c0552fc0 edx: c04e5aa0
esi: 00000000 edi: 00000000 ebp: c055bee4 esp: c055beb8
ds: 007b es: 007b ss: 0068
Process swapper (pid: 0, threadinfo=c055a000 task=c04e5aa0)
Stack: 00000001 c055a000 c0552fc0 00000000 cb1a0000 00000001 00000001 00000002
00000000 c04e88e4 00000001 c055bf08 c011d172 c1694700 00000001 00000000
c04e88e4 c04e88dc c055a000 00000001 c055bf3c c011d203 c04e88dc 00000001
Call Trace:
[<c011d172>] __wake_up_common+0x32/0x60
[<c011d203>] __wake_up+0x63/0xb0
[<c0122fb5>] release_console_sem+0x165/0x170
[<c0122d7b>] printk+0x1eb/0x270
[<c015e210>] invalidate_bh_lru+0x0/0x60
[<c015e210>] invalidate_bh_lru+0x0/0x60
[<c015e210>] invalidate_bh_lru+0x0/0x60
[<c01163f2>] smp_call_function_interrupt+0x42/0xb0
[<c015e210>] invalidate_bh_lru+0x0/0x60
[<c0106eb0>] default_idle+0x0/0x40
[<c010a41a>] call_function_interrupt+0x1a/0x20
[<c0106eb0>] default_idle+0x0/0x40
[<c0106ede>] default_idle+0x2e/0x40
[<c0106f6a>] cpu_idle+0x3a/0x50
[<c0105000>] rest_init+0x0/0x80

Code: 8b 06 48 89 06 8b 4a 24 8b 42 20 89 01 89 48 04 8b 4a 18 8d

0xc011b8d9 is in try_to_wake_up (kernel/sched.c:282).
277 /*
278 * Adding/removing a task to/from a priority array:
279 */
280 static inline void dequeue_task(struct task_struct *p, prio_array_t *array)
281 {
282 array->nr_active--;
283 list_del(&p->run_list);
284 if (list_empty(array->queue + p->prio))
285 __clear_bit(p->prio, array->bitmap);
286 }

(gdb) list *__wake_up_common+0x32
0xc011d1b2 is in __wake_up_common (kernel/sched.c:1424).
1419 list_for_each_safe(tmp, next, &q->task_list) {
1420 wait_queue_t *curr;
1421 unsigned flags;
1422 curr = list_entry(tmp, wait_queue_t, task_list);
1423 flags = curr->flags;
1424 if (curr->func(curr, mode, sync) &&
1425 (flags & WQ_FLAG_EXCLUSIVE) &&
1426 !--nr_exclusive)
1427 break;
1428 }

(gdb) list *__wake_up+0x62
0xc011d242 is in __wake_up (kernel/sched.c:1445).
1440
1441 if (unlikely(!q))
1442 return;
1443
1444 spin_lock_irqsave(&q->lock, flags);
1445 __wake_up_common(q, mode, nr_exclusive, 0);
1446 spin_unlock_irqrestore(&q->lock, flags);
1447 }
1448
1449 /*


--
function.linuxpower.ca

2003-03-28 15:15:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.5.66-mm1


On Fri, 28 Mar 2003, Zwane Mwaikambo wrote:

> Hmm i think i may have his this one but i never posted due to being
> unable to reproduce it on a vanilla kernel or the same kernel afterwards
> (which was hacked so i won't vouch for it's cleanliness). I think
> preempt might have bitten him in a bad place (mine is also
> CONFIG_PREEMPT), is it possible that when we did the task_rq_unlock we
> got preempted and when we got back we used the local variable
> requeue_waker which was set before dropping the lock, and therefore
> might not be valid anymore due to scheduler decisions done after
> dropping the runqueue lock?

yes, this one was my only suspect, but it should really never cause any
problems. We might change sleep_avg during the wakeup, and carry the
requeue_waker flag over a preemptible window, but the requeueing itself
re-takes the runqueue lock, and does not take anything for granted. The
flag could very well be random as well, and the code should still be
correct - there's no requirement to recalculate the priority every time we
change sleep_avg. (in fact we at times intentionally keep those values
detached.)

Ingo

2003-03-28 15:46:03

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.5.66-mm1

At 09:56 AM 3/28/2003 -0500, Zwane Mwaikambo wrote:
>On Fri, 28 Mar 2003, Mike Galbraith wrote:
>
> > >hm, this is an 'impossible' scenario from the scheduler code POV. Whenever
> > >we deactivate a task, we remove it from the runqueue and set p->array to
> > >NULL. Whenever we activate a task again, we set p->array to non-NULL. A
> > >double-deactivate is not possible. I tried to reproduce it with various
> > >scheduler workloads, but didnt succeed.
> > >
> > >Mike, do you have a backtrace of the crash you saw?
> >
> > No, I didn't save it due to "grubby fingerprints".
>
>Hmm i think i may have his this one but i never posted due to being unable
>to reproduce it on a vanilla kernel or the same kernel afterwards (which
>was hacked so i won't vouch for it's cleanliness). I think preempt
>might have bitten him in a bad place (mine is also CONFIG_PREEMPT), is it
>possible that when we did the task_rq_unlock we got preempted and when we
>got back we used the local variable requeue_waker which was set before
>dropping the lock, and therefore might not be valid anymore due to
>scheduler decisions done after dropping the runqueue lock?

Dunno. I did have one lying around. The attached one was while printing
out array switch latency after starvation timeout. Others happened while
printing wakeup stats for p->state > 1 tasks in scheduler_tick() [under
lock w/ wakeup disabled in printk.c]. It's nothing I did to the scheduler
;-) I don't think, but this was in 65-mm3-twiddle-twiddle-twiddle.

>Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
>c011b8d9
>*pde = 00000000
>Oops: 0000 [#1]
>CPU: 0
>EIP: 0060:[<c011b8d9>] Not tainted
>EFLAGS: 00010046
>EIP is at try_to_wake_up+0x1e9/0x4f0
>eax: c055a000 ebx: c04e5aa0 ecx: c0552fc0 edx: c04e5aa0
>esi: 00000000 edi: 00000000 ebp: c055bee4 esp: c055beb8
>ds: 007b es: 007b ss: 0068
>Process swapper (pid: 0, threadinfo=c055a000 task=c04e5aa0)
>Stack: 00000001 c055a000 c0552fc0 00000000 cb1a0000 00000001 00000001
>00000002
> 00000000 c04e88e4 00000001 c055bf08 c011d172 c1694700 00000001
> 00000000
> c04e88e4 c04e88dc c055a000 00000001 c055bf3c c011d203 c04e88dc
> 00000001
>Call Trace:
> [<c011d172>] __wake_up_common+0x32/0x60
> [<c011d203>] __wake_up+0x63/0xb0
> [<c0122fb5>] release_console_sem+0x165/0x170
> [<c0122d7b>] printk+0x1eb/0x270
> [<c015e210>] invalidate_bh_lru+0x0/0x60
> [<c015e210>] invalidate_bh_lru+0x0/0x60
> [<c015e210>] invalidate_bh_lru+0x0/0x60
> [<c01163f2>] smp_call_function_interrupt+0x42/0xb0
> [<c015e210>] invalidate_bh_lru+0x0/0x60
> [<c0106eb0>] default_idle+0x0/0x40
> [<c010a41a>] call_function_interrupt+0x1a/0x20
> [<c0106eb0>] default_idle+0x0/0x40
> [<c0106ede>] default_idle+0x2e/0x40
> [<c0106f6a>] cpu_idle+0x3a/0x50
> [<c0105000>] rest_init+0x0/0x80
>
>Code: 8b 06 48 89 06 8b 4a 24 8b 42 20 89 01 89 48 04 8b 4a 18 8d
>
>0xc011b8d9 is in try_to_wake_up (kernel/sched.c:282).
>277 /*
>278 * Adding/removing a task to/from a priority array:
>279 */
>280 static inline void dequeue_task(struct task_struct *p,
>prio_array_t *array)
>281 {
>282 array->nr_active--;
>283 list_del(&p->run_list);
>284 if (list_empty(array->queue + p->prio))
>285 __clear_bit(p->prio, array->bitmap);
>286 }

Same spot.

-Mike


Attachments:
oops.txt (3.11 kB)

2003-03-28 15:50:05

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.5.66-mm1

At 04:25 PM 3/28/2003 +0100, Ingo Molnar wrote:

>On Fri, 28 Mar 2003, Zwane Mwaikambo wrote:
>
> > Hmm i think i may have his this one but i never posted due to being
> > unable to reproduce it on a vanilla kernel or the same kernel afterwards
> > (which was hacked so i won't vouch for it's cleanliness). I think
> > preempt might have bitten him in a bad place (mine is also
> > CONFIG_PREEMPT), is it possible that when we did the task_rq_unlock we
> > got preempted and when we got back we used the local variable
> > requeue_waker which was set before dropping the lock, and therefore
> > might not be valid anymore due to scheduler decisions done after
> > dropping the runqueue lock?
>
>yes, this one was my only suspect, but it should really never cause any
>problems. We might change sleep_avg during the wakeup, and carry the
>requeue_waker flag over a preemptible window, but the requeueing itself
>re-takes the runqueue lock, and does not take anything for granted. The
>flag could very well be random as well, and the code should still be
>correct - there's no requirement to recalculate the priority every time we
>change sleep_avg. (in fact we at times intentionally keep those values
>detached.)

In my 66-twiddle tree, I moved that under the lock out of pure paranoia. I
can try to see if printing under hefty (very) load will still trigger the
occasional explosion.

-Mike