2002-09-16 14:22:04

by Peter Waechtler

[permalink] [raw]
Subject: Oops in sched.c on PPro SMP

I suffered from lockups on PPro SMP as I switched from 2.4.18-SuSE
to 2.4.19 and 2.4.20-pre7

First I thought it was caused by bttv, overheat or stack overflow -
but to make it short: I think it's a bug in schedule() or
something broken with CONFIG_X86_PPRO_FENCE

I fetched three Oopses at exactly the same instruction
ksymoops 2.4.3 on i686 2.4.20-pre7. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.20-pre7/ (default)
-m /boot/System.map-2.4.20-pre7 (specified)

Unable to handle kernel NULL pointer dereference at virtual address
00000020
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0117f10>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210207
eax: 00000005 ebx: 00000014 ecx: 00000000 edx: 00000006
esi: c0be0000 edi: c51d803c ebp: c0be1fbc esp: c0be1f8c
ds: 0018 es: 0018 ss: 0018
Process setiathome (pid: 4674, stackpage=c0be1000)
Stack: c0be0000 000040b1 000040b0 cd406ea0 c02c1400 00000000 00000001
00000001
00000000 00000000 c0be0000 c03079c0 bffff0cc c0108d09 00000000 0000000c
409d0040 000040b1 000040b0 bffff0cc 000c40b1 0000002b 0020002b ffffffef
Call Trace: [<c0108d09>]
Code: 8b 51 20 d1 fa 89 d8 2b 41 24 c1 f8 02 8d 54 10 01 89 51 20

>>EIP; c0117f10 <schedule+248/550> <=====
Trace; c0108d08 <reschedule+4/c>
Code; c0117f10 <schedule+248/550>
00000000 <_EIP>:
Code; c0117f10 <schedule+248/550> <=====
0: 8b 51 20 mov 0x20(%ecx),%edx <=====
Code; c0117f12 <schedule+24a/550>
3: d1 fa sar %edx
Code; c0117f14 <schedule+24c/550>
5: 89 d8 mov %ebx,%eax
Code; c0117f16 <schedule+24e/550>
7: 2b 41 24 sub 0x24(%ecx),%eax
Code; c0117f1a <schedule+252/550>
a: c1 f8 02 sar $0x2,%eax
Code; c0117f1c <schedule+254/550>
d: 8d 54 10 01 lea 0x1(%eax,%edx,1),%edx
Code; c0117f20 <schedule+258/550>
11: 89 51 20 mov %edx,0x20(%ecx)

---------- Oops 2 -------------
ksymoops 2.4.3 on i686 2.4.19. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.19/ (default)
-m /boot/System.map-2.4.19 (default)

Unable to handle kernel NULL pointer dereference at virtual address
00000020
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c0117f80>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210207
eax: 00000005 ebx: 00000014 ecx: 00000000 edx: 00000006
esi: c6d82000 edi: c6d8203c ebp: c6d83fbc esp: c6d83f8c
ds: 0018 es: 0018 ss: 0018
Process setiathome (pid: 2005, stackpage=c6d83000)
Stack: c6d82000 40ef2040 0000013c c7125080 c02b9900 00000004 c011590f
00000002
00000000 00000001 c6d82000 c02ff9e0 bffff05c c0108a99 40efa2fc 40ef7874
40ef4ed4 40ef2040 0000013c bffff05c 00000000 c010002b 0000002b ffffffef
Call Trace: [<c011590f>] [<c0108a99>]
Code: 8b 51 20 d1 fa 89 d8 2b 41 24 c1 f8 02 8d 54 10 01 89 51 20

>>EIP; c0117f80 <schedule+228/530> <=====
Trace; c011590e <smp_apic_timer_interrupt+f2/114>
Trace; c0108a98 <reschedule+4/c>
Code; c0117f80 <schedule+228/530>
00000000 <_EIP>:
Code; c0117f80 <schedule+228/530> <=====
0: 8b 51 20 mov 0x20(%ecx),%edx <=====
---------- Oops 3 (not sure about kernel version) -------------
Unable to handle kernel NULL pointer dereference at virtual address
00000020
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c0117f80>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210207
eax: 00000005 ebx: 00000014 ecx: 00000000 edx: 00000006
esi: c529a000 edi: c529a03c ebp: c529bfbc esp: c529bf8c
ds: 0018 es: 0018 ss: 0018
Process setiathome (pid: 2001, stackpage=c529b000)
Stack: c529a000 000c2703 00184e08 cd43a7a0 c02b9900 00000004 c011590f
00000002
00000000 00000001 c529a000 c02ff9e0 bffff0bc c0108a99 00184e06
0807ac20
0000009f 000c2703 00184e08 bffff0bc 001842f0 c010002b 0000002b
ffffffef
Call Trace: [<c011590f>] [<c0108a99>]
Code: 8b 51 20 d1 fa 89 d8 2b 41 24 c1 f8 02 8d 54 10 01 89 51 20

>>EIP; c0117f80 <schedule+2b8/550> <=====
Trace; c011590e <nmi_watchdog_tick+4e/100>
Trace; c0108a98 <do_signal+22c/294>
Code; c0117f80 <schedule+2b8/550>
00000000 <_EIP>:
Code; c0117f80 <schedule+2b8/550> <=====
0: 8b 51 20 mov 0x20(%ecx),%edx <=====
---------- Oops 4 -------------
While typing this: Oops with ecx!=NULL:
Unable to handle kernel paging request< >t rirtial adpress e1170009
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0117f10>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210206
eax: f7e60005 ebx: 00000014 ecx: e9000009 edx: fa345205
esi: c45d2000 edi: c45d203c ebp: c45d3fbc esp: c45d3f8c
ds: 0018 es: 0018 ss: 0018
Process setiathome (pid: 2035, stackpage=c45d3000)
Stack: c45d2000 00013e0c 40ec2040 c26efc20 c02c1400 00000000 00000001
00000001
00000000 00000000 c45d2000 c03079c0 bfffefec c0108d09 0001360c 0001320c
40f0b870 00013e0c 40ec2040 bfffefec 00012e0c 0000002b 0000002b ffffffef
Call Trace: [<c0108d09>]
Code: 8b 51 20 d1 fa 89 d8 2b 41 24 c1 f8 02 8d 54 10 01 89 51 20

>>EIP; c0117f10 <schedule+248/550> <=====
Trace; c0108d08 <reschedule+4/c>
Code; c0117f10 <schedule+248/550>
00000000 <_EIP>:
Code; c0117f10 <schedule+248/550> <=====
0: 8b 51 20 mov 0x20(%ecx),%edx <=====
[...]


Here the relevant intermixed assembly/source from objdump:
gcc -I ../include -I ../include/linux -I ../include/asm-i386 \
-g -O2 -c sched.c -D__KERNEL__
objdump -S sched.o >sched.txt

/* Do we need to re-calculate counters? */
if (unlikely(!c)) {
62d: 83 7d f0 00 cmpl $0x0,0xfffffff0(%ebp)
631: 75 6d jne 6a0 <schedule+0x290>
:"0" (oldval) : "memory"

static inline void spin_unlock(spinlock_t *lock)
{
char oldval = 1;
633: b0 01 mov $0x1,%al
#if SPINLOCK_DEBUG
if (lock->magic != SPINLOCK_MAGIC)
BUG();
if (!spin_is_locked(lock))
BUG();
#endif
__asm__ __volatile__(
635: 86 05 00 00 00 00 xchg %al,0x0
struct task_struct *p;

spin_unlock_irq(&runqueue_lock);
63b: fb sti <===== INTERRUPTS ENABLE
*/
/* the spinlock helpers are in arch/i386/kernel/semaphore.c */
static inline void read_lock(rwlock_t *rw)
{
63c: b8 00 00 00 00 mov $0x0,%eax
#if SPINLOCK_DEBUG
if (rw->magic != RWLOCK_MAGIC)
BUG();
#endif
__build_read_lock(rw, "__read_lock_failed");
641: f0 83 28 01 lock subl $0x1,(%eax)
645: 0f 88 59 13 00 00 js 19a4 <Letext+0x40>
read_lock(&tasklist_lock);
for_each_task(p)
64b: 8b 0d 48 00 00 00 mov 0x48,%ecx
651: 81 f9 00 00 00 00 cmp $0x0,%ecx
657: 74 26 je 67f <schedule+0x26f>
659: bb 14 00 00 00 mov $0x14,%ebx
65e: 89 f6 mov %esi,%esi
p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
660: 8b 51 20 mov 0x20(%ecx),%edx <= CRASH
663: d1 fa sar %edx
665: 89 d8 mov %ebx,%eax
667: 2b 41 24 sub 0x24(%ecx),%eax
66a: c1 f8 02 sar $0x2,%eax
66d: 8d 54 10 01 lea 0x1(%eax,%edx,1),%edx
671: 89 51 20 mov %edx,0x20(%ecx)
674: 8b 49 48 mov 0x48(%ecx),%ecx
677: 81 f9 00 00 00 00 cmp $0x0,%ecx
67d: 75 e1 jne 660 <schedule+0x250>
read_unlock(&tasklist_lock);
67f: f0 ff 05 00 00 00 00 lock incl 0x0
spin_lock_irq(&runqueue_lock);
686: fa cli


First I thought the readlocks were broken by the compiler, due
to syntax changes. But staring at the code I wondered how
%ecx can become zero at 660: - from this code it's impossible!
But wait: we allowed interrupts again...

So my explanation is as follows: the scheduler is interrupted
and entry.S calls:

ENTRY(ret_from_intr)
GET_CURRENT(%ebx)
ret_from_exception:
movl EFLAGS(%esp),%eax # mix EFLAGS and CS
movb CS(%esp),%al
testl $(VM_MASK | 3),%eax # return to VM86 mode or non-supervisor?
jne ret_from_sys_call
jmp restore_all

ALIGN
reschedule:
call SYMBOL_NAME(schedule) # test
jmp ret_from_sys_call

ENTRY(ret_from_sys_call)
cli # need_resched and signals atomic test
cmpl $0,need_resched(%ebx)
jne reschedule
cmpl $0,sigpending(%ebx)
jne signal_return
restore_all:
RESTORE_ALL


The spin_unlock_irq(&runqueue_lock); was already there in 2.4.16
I can't check when it was introduced - perhaps since the beginning?
But this explains my former reported crashes in the scheduler half
a year ago - I think it always happened when switching from a SuSE
kernel to Linus tree.

So there are 2 possibilities: the spin_unlock_irq(&runqueue_lock)
is wrong in the scheduler, but this should be noted by more SMP
users then, or the CONFIG_X86_PPRO_FENCE does not work as expected.
I will replace the spin_unlock_irq with spin_unlock for a test spin..


This is a snippet from
diff -Nu linux-2.4.18.SuSE/kernel/sched.c linux-2.4.19/kernel/sched.c

/* Do we need to re-calculate counters? */
if (unlikely(!c)) {
- ++rcl_curr;
- list_for_each(tmp, numa_runqueue_head(numa_node_id())) {
- p = list_entry(tmp, struct task_struct,
run_list);
- p->time_slice = TASK_TIMESLICE(p);
- }
-#ifdef CONFIG_NUMA_SCHED
- if (recalculate_all) {
- int nid;
-
- for (nid = 0; nid < numnodes; nid++) {
- if (nid == numa_node_id())
- continue;
- list_for_each(tmp,
numa_runqueue_head(nid)) {
- p = list_entry(tmp, struct
task_struct,
run_list);
- p->time_slice =
TASK_TIMESLICE(p);
- }
- }
- }
-#endif
+ struct task_struct *p;
+
+ spin_unlock_irq(&runqueue_lock);
+ read_lock(&tasklist_lock);
+ for_each_task(p)
+ p->counter = (p->counter >> 1) +
NICE_TO_TICKS(p->nice);
+ read_unlock(&tasklist_lock);
+ spin_lock_irq(&runqueue_lock);
goto repeat_schedule;
}


--- snippet .config ---
CONFIG_M686=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_X86_HAS_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_PGE=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_F00F_WORKS_OK=y
CONFIG_X86_MCE=y

--- /proc/cpuinfo ---
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 1
model name : Pentium Pro
stepping : 7
cpu MHz : 200.457
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov
bogomips : 399.76

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 1
model name : Pentium Pro
stepping : 9
cpu MHz : 200.457
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov
bogomips : 400.58


2002-09-16 14:42:11

by Alan

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

On Mon, 2002-09-16 at 15:25, Peter Waechtler wrote:
> I suffered from lockups on PPro SMP as I switched from 2.4.18-SuSE
> to 2.4.19 and 2.4.20-pre7

What compiler did you build it with ? I've seen oopses like this from
gcc 3.0.x that went away with gcc 3.2, 2.95 or 2.96

Also does turning off the nmi watchdog junk make the box stable ?

2002-09-16 15:37:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

On Mon, Sep 16, 2002 at 04:25:03PM +0200, Peter Waechtler wrote:
> Process setiathome (pid: 2035, stackpage=c45d3000)
^^^^^^^^^^ stress the cpu

> 651: 81 f9 00 00 00 00 cmp $0x0,%ecx
^^^^^^^^^^^^^^^^^
> 657: 74 26 je 67f <schedule+0x26f>
> 659: bb 14 00 00 00 mov $0x14,%ebx
> 65e: 89 f6 mov %esi,%esi
> p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
> 660: 8b 51 20 mov 0x20(%ecx),%edx <= CRASH
> 663: d1 fa sar %edx
> 665: 89 d8 mov %ebx,%eax
> 667: 2b 41 24 sub 0x24(%ecx),%eax
> 66a: c1 f8 02 sar $0x2,%eax
> 66d: 8d 54 10 01 lea 0x1(%eax,%edx,1),%edx
> 671: 89 51 20 mov %edx,0x20(%ecx)
> 674: 8b 49 48 mov 0x48(%ecx),%ecx
> 677: 81 f9 00 00 00 00 cmp $0x0,%ecx
^^^^^^^^^^^^^^^^
> 67d: 75 e1 jne 660 <schedule+0x250>
> read_unlock(&tasklist_lock);
> 67f: f0 ff 05 00 00 00 00 lock incl 0x0
> spin_lock_irq(&runqueue_lock);
> 686: fa cli

as you said apparently ecx cannot be zero. But I bet the 0x00 really is
waiting relocation at link time. that shouldn't be zero infact, it
should be the address of the init_task (&init_task).

Can you disassemble the .o object using objdump -Dr or can you disassemble such
piece of code from the vmlinux instead of compiling with the -S flag to
verify that to verify that? If it really checks against zero then it's a
miscompilation, it should check against &init_task as said above.

> First I thought the readlocks were broken by the compiler, due
> to syntax changes. But staring at the code I wondered how
> %ecx can become zero at 660: - from this code it's impossible!
> But wait: we allowed interrupts again...
>
> So my explanation is as follows: the scheduler is interrupted
> and entry.S calls:

interrupts cannot clobber %ecx or change the tasklist, if they clobber
%ecx or modify the tasklist that would be the real bug.

> So there are 2 possibilities: the spin_unlock_irq(&runqueue_lock)
> is wrong in the scheduler, but this should be noted by more SMP
> users then, or the CONFIG_X86_PPRO_FENCE does not work as expected.

the PPRO_FENCE is strictier than the non FENCE one. However here the
corruption you notice is in the tasklist, and the read/write locks are
not affected by the FENCE option, so FENCE isn't likely to explain it.

If something I would suspect something wrong in the read/write
spinlocks, to rule out such possibility you could for example try to
replace all the read_lock and write_lock around the tasklist with
spin_lock_irqsave/spin_lock_irqrestore. So that you would use the FENCE
xchg functionality around the tasklist too and you would also make sure
that no irq can happen in between, just to be 100% sure that if it
crashes it's because the memory is corrupted somehow. But really, the
read/write locks just use the lock on the bus when they modify the
spinlock value so I'd be *very* surprised if that doesn't work on ppro.
The non-FENCE case of recent cpus is used to skip the lock on the bus
during the unlock operation to exploit the release semantics of all
writes to memory in writeback cache mode of the recent x86 (that allows
unrelated speculative reads outside the critical section to enter
inside).

I really suspect an hardware fault here, if you could reproduce easily
you could try to drop a dimm of ram and retest, you can also try memtst
from cerberus testsuite or/and memtest86 from the lilo menu.

the tasklist walking very likely is triggering very quick cacheline
bouncing and lots of ram I/O, 99% of hardware bugs triggers while
walking lists because of the cpu-dcache trashing and the ram cannot cope
with it. Probably the O1 scheduler would hide the problem because it
drops such tight loop. Note that the 2.4.18 SuSE kernel scheduler
algorithm is O(N) where N is always the number of running tasks, never the
total number of tasks in the system, while in mainline the scheduler is
O(N) where N is the total number of tasks in the system, this mean
normally in mainline you can walk a list with 100/500 elements, while
with the SuSE kernel you'll walk a list of always around 2/3 elements,
depends on the workload of course. The O1 scheduler included in 8.1
reduces N to a constant 1. So if you cannot reproduce with the SuSE 8.0
kernel could be simply because you've lots of tasks in your system but
only a few of them runs at the same time. That's a dramatic optimization
in the SuSE kernel but it's not a bugfix, it only hides the corruption
in your case methinks, like the O1 scheduler in 8.1 will hide it even
better, even if you have lots of tasks running at the same time ;). It
is true to walk the runqueue we need the runqueue_lock that needs irq
disabled, but regardless irqs must not screwup anything in the tasklist.

So I would say, it's either an hardware issue, or random memory
corruption generated by some driver. Just some guesses. And if it's the
irq clobbering the %ecx or the tasklist then something is very wrong in
the irq code or in the hardware (I'd exclude such possibility, but you
can try adding _irq to the read_lock/read_unlock of the tasklist_lock to
disable irq and see if you can still reproduce). If %ecx is checked
against zero as well something is very wrong, but in the compiler, and
that would explain things too (you're recommended to use either 2.95 or
3.2).

Hope this helps,

Andrea

2002-09-16 15:39:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

On Mon, Sep 16, 2002 at 03:49:27PM +0100, Alan Cox wrote:
> Also does turning off the nmi watchdog junk make the box stable ?

good idea, I didn't though about this one since I only heard the nmi to
lockup hard boxes after hours of load, never to generate any
malfunction, but certainly the nmi handling isn't probably one of the
most exercised hardware paths in the cpus, so it's a good idea to
reproduce with it turned off (OTOH I guess you probably turned it on
explicitly only after you got these troubles, in order to debug them).

Andrea

2002-09-16 21:05:32

by Peter Waechtler

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

Am Montag den, 16. September 2002, um 16:49, schrieb Alan Cox:

> On Mon, 2002-09-16 at 15:25, Peter Waechtler wrote:
>> I suffered from lockups on PPro SMP as I switched from 2.4.18-SuSE
>> to 2.4.19 and 2.4.20-pre7
>
> What compiler did you build it with ? I've seen oopses like this from
> gcc 3.0.x that went away with gcc 3.2, 2.95 or 2.96
>
> Also does turning off the nmi watchdog junk make the box stable ?
>

It's 2.95.3:

Reading specs from /usr/lib/gcc-lib/i486-suse-linux/2.95.3/specs
gcc version 2.95.3 20010315 (SuSE)

But I will check the readlocks more closely.

2002-09-16 21:11:29

by Peter Waechtler

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

Am Montag den, 16. September 2002, um 17:44, schrieb Andrea Arcangeli:

> On Mon, Sep 16, 2002 at 03:49:27PM +0100, Alan Cox wrote:
>> Also does turning off the nmi watchdog junk make the box stable ?
>
> good idea, I didn't though about this one since I only heard the nmi to
> lockup hard boxes after hours of load, never to generate any
> malfunction, but certainly the nmi handling isn't probably one of the
> most exercised hardware paths in the cpus, so it's a good idea to
> reproduce with it turned off (OTOH I guess you probably turned it on
> explicitly only after you got these troubles, in order to debug them).
>

I only turned the nmi watchdog on, on the one "unknown" version Oops.

This box was running fine with 2.4.18-SuSE with uptimes 40+days. _Now_
I am almost sure, that it's _not_ a hardware problem (FENCE counting
here as software - since there is a software workaround).

I had 3 lockups in 2 days, when I switched to 2.4.19 - and even lower
room temperature. No, there _must_ be a bug :)

With the relocation you are right - I thought it would test against
NULL :-(

I think that the tasklist is broken inbetween - either due to broken
readlocks (no working EFENCE on PPRO)

Can someone explain me the difference for label 1 and 2?
Why is the "js 2f" there? This I don't understand fully -
it looks broken to me.

include/asm-i386/rwlock.h

#define __build_read_lock_ptr(rw, helper) \
asm volatile(LOCK "subl $1,(%0)\n\t" \
"js 2f\n" \
"1:\n" \
LOCK_SECTION_START("") \
"2:\tcall " helper "\n\t" \
"jmp 1b\n" \
LOCK_SECTION_END \
::"a" (rw) : "memory")

2002-09-16 23:07:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

On Mon, Sep 16, 2002 at 11:16:20PM +0200, Peter Waechtler wrote:
> Am Montag den, 16. September 2002, um 17:44, schrieb Andrea Arcangeli:
>
> >On Mon, Sep 16, 2002 at 03:49:27PM +0100, Alan Cox wrote:
> >>Also does turning off the nmi watchdog junk make the box stable ?
> >
> >good idea, I didn't though about this one since I only heard the nmi to
> >lockup hard boxes after hours of load, never to generate any
> >malfunction, but certainly the nmi handling isn't probably one of the
> >most exercised hardware paths in the cpus, so it's a good idea to
> >reproduce with it turned off (OTOH I guess you probably turned it on
> >explicitly only after you got these troubles, in order to debug them).
> >
>
> I only turned the nmi watchdog on, on the one "unknown" version Oops.
>
> This box was running fine with 2.4.18-SuSE with uptimes 40+days. _Now_
> I am almost sure, that it's _not_ a hardware problem (FENCE counting
> here as software - since there is a software workaround).
>
> I had 3 lockups in 2 days, when I switched to 2.4.19 - and even lower
> room temperature. No, there _must_ be a bug :)

possible. Which was the previous kernel running in the machine before
2.4.18-SuSE?

It could be an unlucky trashing pattern run by the tasklist walking that
is eliminated in the SuSE kernel, so it may not be a bug in mainline,
but we don't know for sure.

> With the relocation you are right - I thought it would test against
> NULL :-(
>
> I think that the tasklist is broken inbetween - either due to broken
> readlocks (no working EFENCE on PPRO)

The SuSE kernel should have exactly the same read/write locks and you
said the SuSE kernel works fine for you (the read/write locks aren't
used only in the tasklist).

> Can someone explain me the difference for label 1 and 2?
> Why is the "js 2f" there? This I don't understand fully -
> it looks broken to me.

it's correct, if not we would have noticed since a long time ;)

What it does is to subtract 1 to the lock, if it goes negative (signed)
it jumps into the looping slow path (label 2), then when it finally
stops looping because it acquired the lock, it jumps back to 1 and
enters the critical section. The slow path takes care of acquiring the
lock internally, first polling and doing without requiring the cacheline
exclusive the trylock again.

>
> include/asm-i386/rwlock.h
>
> #define __build_read_lock_ptr(rw, helper) \
> asm volatile(LOCK "subl $1,(%0)\n\t" \
> "js 2f\n" \
> "1:\n" \
> LOCK_SECTION_START("") \
> "2:\tcall " helper "\n\t" \
> "jmp 1b\n" \
> LOCK_SECTION_END \
> ::"a" (rw) : "memory")

As said the oops shows clear corruption in the tasklist, that contains
null pointers or random values. So I still tend to think this is
not shown in the SuSE kernel because it doesn't need to loop through all
the long tasklist to make the scheduling decision reducing signficantly
the cacheline trashing and memory I/O. You can also try to run a loop
like:

import os

while 1:
x = os.listdir('/proc')

that will trigger a frequent tasklist walking too, even if it's not a
tight loop like the one that schedule triggered. You could try to
simulate it with a kernel module too. If you make it easily reproducible
it'll be easier to find what's wrong.

You can also try to backout the dynsched patch from the 8.0 kernel, and
see if it returns reproducible then (you should find the patch in the srpm
or in my kernel.org directory).

You can also compare the number of tasks in the system with the number
of simultaneously runnable tasks. And also have a look if the problem
goes away if you stop setiathome.

Hope this helps,

Andrea

2002-09-17 02:46:28

by Adam Kropelin

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

I just built and booted 2.4.20-pre7 on my SMP ppro. If there's some specific
workload I can run to try to reproduce this, I'd be happy to try. Otherwise I'll
just run my normal tasks and see if anything goes haywire. It's been stable for
four hours so far.

(Built with gcc 2.96, BTW.)

--Adam

2002-09-17 17:07:04

by Peter Waechtler

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

Am Montag den, 16. September 2002, um 17:42, schrieb Andrea Arcangeli:

> Can you disassemble the .o object using objdump -Dr or can you
> disassemble such
> piece of code from the vmlinux instead of compiling with the -S flag to
> verify that to verify that? If it really checks against zero then it's a
> miscompilation, it should check against &init_task as said above.
>
>

objdump -Dr vmlinux
--- schedule snippet ---
if (unlikely(!c)) {
c0117edb: 83 7d f0 00 cmpl $0x0,0xfffffff0(%ebp)
c0117edf: 75 6f jne c0117f50 <schedule+0x288>

spin_unlock_irq(&runqueue_lock);
c0117ee1: b0 01 mov $0x1,%al
c0117ee3: 86 05 80 79 30 c0 xchg %al,0xc0307980
c0117ee9: fb sti

read_lock(&tasklist_lock);
c0117eea: b8 a0 79 30 c0 mov $0xc03079a0,%eax
c0117eef: f0 83 28 01 lock subl $0x1,(%eax)
c0117ef3: 0f 88 90 12 00 00 js c0119189
<.text.lock.sched+0x40>

for_each_task(p)
p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
c0117ef9: 8b 0d 48 c0 2d c0 mov 0xc02dc048,%ecx
c0117eff: 81 f9 00 c0 2d c0 cmp $0xc02dc000,%ecx
c0117f05: 74 28 je c0117f2f <schedule+0x267>
c0117f07: bb 14 00 00 00 mov $0x14,%ebx
c0117f0c: 8d 74 26 00 lea 0x0(%esi,1),%esi ;for
alignment

c0117f10: 8b 51 20 mov 0x20(%ecx),%edx
c0117f13: d1 fa sar %edx
c0117f15: 89 d8 mov %ebx,%eax
c0117f17: 2b 41 24 sub 0x24(%ecx),%eax
c0117f1a: c1 f8 02 sar $0x2,%eax
c0117f1d: 8d 54 10 01 lea 0x1(%eax,%edx,1),%edx
c0117f21: 89 51 20 mov %edx,0x20(%ecx)
c0117f24: 8b 49 48 mov 0x48(%ecx),%ecx
c0117f27: 81 f9 00 c0 2d c0 cmp $0xc02dc000,%ecx
c0117f2d: 75 e1 jne c0117f10 <schedule+0x248>

read_unlock(&tasklist_lock);
c0117f2f: f0 ff 05 a0 79 30 c0 lock incl 0xc03079a0
c0117f36: fa cli

spin_lock_irq(&runqueue_lock);
c0117f37: f0 fe 0d 80 79 30 c0 lock decb 0xc0307980
c0117f3e: 0f 88 4f 12 00 00 js c0119193
<.text.lock.sched+0x4a>

goto repeat_schedule;
c0117f43: e9 e4 fe ff ff jmp c0117e2c <schedule+0x164>


-----
<.text.lock.sched+0x4a>
c0119189: e8 66 e9 fe ff call c0107af4
<__read_lock_failed>
c011918e: e9 65 ed ff ff jmp c0117ef8 <schedule+0x230>


c0107af4 <__read_lock_failed>:
c0107af4: f0 ff 00 lock incl (%eax)
c0107af7: f3 90 repz nop
c0107af9: 83 38 01 cmpl $0x1,(%eax)
c0107afc: 78 f9 js c0107af7
<__read_lock_failed+0x3>
c0107afe: f0 ff 08 lock decl (%eax)
c0107b01: 0f 88 ed ff ff ff js c0107af4
<__read_lock_failed>
c0107b07: c3 ret


--- do_fork snippet ---
write_lock_irq(&tasklist_lock);

c011a416: fa cli
c011a417: ba a0 79 30 c0 mov $0xc03079a0,%edx
c011a41c: 89 d0 mov %edx,%eax
c011a41e: f0 81 28 00 00 00 01 lock subl $0x1000000,(%eax)
c011a425: 0f 85 6c 03 00 00 jne c011a797
<.text.lock.fork+0xc4>
; we got the lock

write_unlock_irq(&tasklist_lock);
c011a53b: f0 81 05 a0 79 30 c0 lock addl $0x1000000,0xc03079a0
c011a542: 00 00 00 01
c011a546: fb sti

<.text.lock.fork+0xc4>
c011a797: e8 38 d3 fe ff call c0107ad4
<__write_lock_failed>
c011a79c: e9 8a fc ff ff jmp c011a42b <do_fork+0x597>

c0107ad4 <__write_lock_failed>:
c0107ad4: f0 81 00 00 00 00 01 lock addl $0x1000000,(%eax)
c0107adb: f3 90 repz nop
c0107add: 81 38 00 00 00 01 cmpl $0x1000000,(%eax)
c0107ae3: 75 f6 jne c0107adb
<__write_lock_failed+0x7>
c0107ae5: f0 81 28 00 00 00 01 lock subl $0x1000000,(%eax)
c0107aec: 0f 85 e2 ff ff ff jne c0107ad4
<__write_lock_failed>
c0107af2: c3 ret
c0107af3: 90 nop


--- do_exit even more complicated but looks OK ---

> I really suspect an hardware fault here, if you could reproduce easily
> you could try to drop a dimm of ram and retest, you can also try memtst
> from cerberus testsuite or/and memtest86 from the lilo menu.
>

I did that in January - as I encountered these lockups the first time.
I removed some SIMMs, crashed again, exchanged the SIMMs and so on.
memtest86 didn't find anything in one day.

Once I had a machine check exception - sine then I lowered the CPU clock.
After the box was running fine with 180MHz I switched to 200MHz
(yes, I overclocked the CPUs with 233MHz 2 or 3 years - without problems)
2.4.18-SuSE was running fine with the same load over summer. NO lockup
in 2 months.

Now I have to believe that the silicon aged and that tight loop causes
errors? There are similar loops in kernel/exit.c in 2.4.18-SuSE but it
_always_ crashes in schedule().

I also checked the bttv driver and used the 2.4.18-SuSE one in 2.4.19 -
but still crashes. I run the driver almost all the time - I have two
cards, one for watching TV or listen Radio, the other grabs EPG data.

But the box also crashed when I unloaded the drivers over night.
Argh, it seems that I'm forced to use O(1) scheduler!

> So I would say, it's either an hardware issue, or random memory
> corruption generated by some driver. Just some guesses. And if it's the
> irq clobbering the %ecx or the tasklist then something is very wrong in
> the irq code or in the hardware (I'd exclude such possibility, but you
> can try adding _irq to the read_lock/read_unlock of the tasklist_lock to
> disable irq and see if you can still reproduce). If %ecx is checked
> against zero as well something is very wrong, but in the compiler, and
> that would explain things too (you're recommended to use either 2.95 or
> 3.2).
>
I use 2.95.3 out of SuSE 8.0 - everything seems to be compiled correctly.

Perhaps I add a MAGIC to the tasklist to notice when it gets thrashed?
Puh...

> Hope this helps,
>
Thanx for your effort.

2002-09-17 17:13:31

by Peter Waechtler

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

Am Dienstag den, 17. September 2002, um 01:13, schrieb Andrea Arcangeli:

> On Mon, Sep 16, 2002 at 11:16:20PM +0200, Peter Waechtler wrote:
>> Am Montag den, 16. September 2002, um 17:44, schrieb Andrea Arcangeli:
>>
>>> On Mon, Sep 16, 2002 at 03:49:27PM +0100, Alan Cox wrote:
>>>> Also does turning off the nmi watchdog junk make the box stable ?
>>>
>>> good idea, I didn't though about this one since I only heard the nmi
>>> to
>>> lockup hard boxes after hours of load, never to generate any
>>> malfunction, but certainly the nmi handling isn't probably one of the
>>> most exercised hardware paths in the cpus, so it's a good idea to
>>> reproduce with it turned off (OTOH I guess you probably turned it on
>>> explicitly only after you got these troubles, in order to debug them).
>>>
>>
>> I only turned the nmi watchdog on, on the one "unknown" version Oops.
>>
>> This box was running fine with 2.4.18-SuSE with uptimes 40+days. _Now_
>> I am almost sure, that it's _not_ a hardware problem (FENCE counting
>> here as software - since there is a software workaround).
>>
>> I had 3 lockups in 2 days, when I switched to 2.4.19 - and even lower
>> room temperature. No, there _must_ be a bug :)
>
> possible. Which was the previous kernel running in the machine before
> 2.4.18-SuSE?
>

I guess 2.4.10-SuSE from 7.3
In january I switched to 2.4.14 and 2.4.17 and applied the xfs patches.
At exactly the same instruction: kaboom!

http://marc.theaimsgroup.com/?l=linux-kernel&m=101113532211430&w=2


>> Can someone explain me the difference for label 1 and 2?
>> Why is the "js 2f" there? This I don't understand fully -
>> it looks broken to me.
>
> it's correct, if not we would have noticed since a long time ;)
>
> What it does is to subtract 1 to the lock, if it goes negative (signed)
> it jumps into the looping slow path (label 2), then when it finally
> stops looping because it acquired the lock, it jumps back to 1 and
> enters the critical section. The slow path takes care of acquiring the
> lock internally, first polling and doing without requiring the cacheline
> exclusive the trylock again.

After studying the disassembly I now see the "trick" with a jump
to a new section.

>>
>> include/asm-i386/rwlock.h
>>
>> #define __build_read_lock_ptr(rw, helper) \
>> asm volatile(LOCK "subl $1,(%0)\n\t" \
>> "js 2f\n" \
>> "1:\n" \
>> LOCK_SECTION_START("") \
>> "2:\tcall " helper "\n\t" \
>> "jmp 1b\n" \
>> LOCK_SECTION_END \
>> ::"a" (rw) : "memory")

2002-09-17 17:36:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

On Tue, Sep 17, 2002 at 07:11:52PM +0200, Peter Waechtler wrote:
> Once I had a machine check exception - sine then I lowered the CPU clock.
> After the box was running fine with 180MHz I switched to 200MHz
> (yes, I overclocked the CPUs with 233MHz 2 or 3 years - without problems)

I guess this explain the corruption. Please make sure the cpu are not
overclocked at all and then try to reproduce. You cannot choose 180mhz
or 200mhz randomly based on which kernel crashes or not, if the cpu are
180mhz ppro you should use 180mhz only, 200mhz will break. It won't
break so easily as 233mhz, but it will, the timings are strict on smp.
So please try to reproduce at 180mhz if the cpu should run at 180mhz.

I don't want to sound boring but please next times try if you can
reproduce on non overclocked hardware before reporting anything to l-k.

Andrea

2002-09-18 10:56:02

by Peter Waechtler

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

Am Dienstag den, 17. September 2002, um 19:41, schrieb Andrea Arcangeli:

> On Tue, Sep 17, 2002 at 07:11:52PM +0200, Peter Waechtler wrote:
>> Once I had a machine check exception - sine then I lowered the CPU
>> clock.
>> After the box was running fine with 180MHz I switched to 200MHz
>> (yes, I overclocked the CPUs with 233MHz 2 or 3 years - without
>> problems)
>
> I guess this explain the corruption. Please make sure the cpu are not
> overclocked at all and then try to reproduce. You cannot choose 180mhz
> or 200mhz randomly based on which kernel crashes or not, if the cpu are
> 180mhz ppro you should use 180mhz only, 200mhz will break. It won't
> break so easily as 233mhz, but it will, the timings are strict on smp.
> So please try to reproduce at 180mhz if the cpu should run at 180mhz.
>
> I don't want to sound boring but please next times try if you can
> reproduce on non overclocked hardware before reporting anything to l-k.
>

They are 200MHz PPros. So don't be bored.
To minimize the risk, I underclocked them, eh?

I slowed down memory timings - the box survived one night - but it
would be too early to count on this.

2002-09-18 17:47:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Oops in sched.c on PPro SMP

On Wed, Sep 18, 2002 at 01:00:58PM +0200, Peter Waechtler wrote:
> They are 200MHz PPros. So don't be bored.
> To minimize the risk, I underclocked them, eh?

ok ;), I misunderstood your previous sentence then.

> I slowed down memory timings - the box survived one night - but it
> would be too early to count on this.

maybe there's an hope ;)

Andrea