LinuxLists.cc - [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

2002-08-14 11:51:05

Subject: [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

Oopsed soon after boot up. Stable with vanilla 2.4.19. The board is Intel
SDS2. dnetc was running.

ksymoops 2.4.5 on i686 2.4.19-rc5. Options used
-V (default)
-K (specified)
-l /proc/modules (default)
-o /lib/modules/2.4.20-pre1-ac3 (specified)
-m /boot/System.map-2.4.20-pre1-ac3 (specified)

No modules in ksyms, skipping objects
No ksyms, skipping lsmod
Unable to handle kernel NULL pointer dereference at virtual address 0000002a
c0115d4c
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0115d4c>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010003
eax: 0000008c ebx: ffffffd6 ecx: c03268a4 edx: f721c000
esi: c0326880 edi: f721c02c ebp: f721dfa4 esp: f721df88
ds: 0018 es: 0018 ss: 0018
Process distributed-net (pid: 521, stackpage=f721d000)
Stack: f721c000 00000000 f721c02c c0112f5f 00000000 f721c000 f721c000 f721dfbc
c0116cff f721c000 00000043 0003f7a0 c0326880 bffff944 c0106f4b 00000000
00000000 40026004 00000043 0003f7a0 bffff944 0000009e 0000002b 0000002b
Call Trace: [<c0112f5f>] [<c0116cff>] [<c0106f4b>]
Code: 8b 4b 54 89 4d f4 8b 72 58 85 c9 75 37 89 73 58 f0 ff 46 14

>>EIP; c0115d4c <schedule+198/384> <=====

>>ebx; ffffffd6 <END_OF_CODE+3fc5a89a/????>
>>ecx; c03268a4 <runqueues+24/14000>
>>edx; f721c000 <END_OF_CODE+36e768c4/????>
>>esi; c0326880 <runqueues+0/14000>
>>edi; f721c02c <END_OF_CODE+36e768f0/????>
>>ebp; f721dfa4 <END_OF_CODE+36e78868/????>
>>esp; f721df88 <END_OF_CODE+36e7884c/????>

Trace; c0112f5f <smp_apic_timer_interrupt+f3/114>
Trace; c0116cff <sys_sched_yield+113/11c>
Trace; c0106f4b <system_call+33/38>

Code; c0115d4c <schedule+198/384>
00000000 <_EIP>:
Code; c0115d4c <schedule+198/384> <=====
0: 8b 4b 54 mov 0x54(%ebx),%ecx <=====
Code; c0115d4f <schedule+19b/384>
3: 89 4d f4 mov %ecx,0xfffffff4(%ebp)
Code; c0115d52 <schedule+19e/384>
6: 8b 72 58 mov 0x58(%edx),%esi
Code; c0115d55 <schedule+1a1/384>
9: 85 c9 test %ecx,%ecx
Code; c0115d57 <schedule+1a3/384>
b: 75 37 jne 44 <_EIP+0x44> c0115d90 <schedule+1dc/384>
Code; c0115d59 <schedule+1a5/384>
d: 89 73 58 mov %esi,0x58(%ebx)
Code; c0115d5c <schedule+1a8/384>
10: f0 ff 46 14 lock incl 0x14(%esi)

00:00.0 Host bridge: ServerWorks CNB20HE Host Bridge (rev 23)
00:00.1 Host bridge: ServerWorks CNB20HE Host Bridge (rev 01)
00:00.2 Host bridge: ServerWorks: Unknown device 0006 (rev 01)
00:00.3 Host bridge: ServerWorks: Unknown device 0006 (rev 01)
00:02.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:03.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 0d)
00:04.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 0d)
00:0f.0 ISA bridge: ServerWorks CSB5 South Bridge (rev 92)
00:0f.1 IDE interface: ServerWorks CSB5 IDE Controller (rev 92)
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 05)
00:0f.3 Host bridge: ServerWorks: Unknown device 0230
02:04.0 SCSI storage controller: Adaptec 7899P (rev 01)
02:04.1 SCSI storage controller: Adaptec 7899P (rev 01)

--
Antti Salmela

2002-08-14 12:35:45

by Alan

[permalink] [raw]

Subject: Re: [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

On Wed, 2002-08-14 at 12:54, Antti Salmela wrote:
> Oopsed soon after boot up. Stable with vanilla 2.4.19. The board is Intel
> SDS2. dnetc was running.

Does vanilla 2.4.20pre1 run ok ?

2002-08-14 13:07:03

by Antti Salmela

[permalink] [raw]

Subject: Re: [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

On Wed, Aug 14, 2002 at 01:37:10PM +0100, Alan Cox wrote:
> On Wed, 2002-08-14 at 12:54, Antti Salmela wrote:
> > Oopsed soon after boot up. Stable with vanilla 2.4.19. The board is Intel
> > SDS2. dnetc was running.
>
> Does vanilla 2.4.20pre1 run ok ?

Seems to work just fine.

--
Antti Salmela

2002-08-14 13:13:05

by Steffen Persvold

[permalink] [raw]

Subject: pci-dma bug in pci_alloc_consistent on i386 ?

Hi all,

I _think_ I found a little snag in the pci_alloc_consistent code for i386.
What I've discovered is that modern GbE drivers (such as the tg3 driver
written by David and the e1000 driver by Intel), does something like this
in setup :

if (!pci_set_dma_mask(pdev, (u64) 0xffffffffffffffff)) {
pci_using_dac = 1;
} else {
err = pci_set_dma_mask(pdev, (u64) 0xffffffff);
if (err) {
printk(KERN_ERR PFX "No usable DMA configuration, "
"aborting.\n");
goto err_out_free_res;
}
pci_using_dac = 0;
}

if (pci_using_dac)
dev->features |= NETIF_F_HIGHDMA;

On i386 the first pci_set_dma_mask will succeed, because :

(in include/asm-i386/pci.h)
static inline int pci_dma_supported(struct pci_dev *hwdev, u64 mask)
{
/*
* we fall back to GFP_DMA when the mask isn't all 1s,
* so we can't guarantee allocations that must be
* within a tighter range than GFP_DMA..
*/
if(mask < 0x00ffffff)
return 0;

return 1;
}

(in drivers/pci/pci.c)
int
pci_set_dma_mask(struct pci_dev *dev, u64 mask)
{
if (!pci_dma_supported(dev, mask))
return -EIO;

dev->dma_mask = mask;

return 0;
}

And this is just fine, the ethernet adapter will be able to DMA directly
to any memory (even highmem).

However when they are allocating RX and TX descriptors (and thus using
pci_alloc_consistent), they are getting GFP_DMA pages. This is why :

(in arch/i386/kernel/pci-dma.c)
void *pci_alloc_consistent(struct pci_dev *hwdev, size_t size,
dma_addr_t *dma_handle)
{
void *ret;
int gfp = GFP_ATOMIC;

if (hwdev == NULL || hwdev->dma_mask != 0xffffffff)
gfp |= GFP_DMA;
ret = (void *)__get_free_pages(gfp, get_order(size));

if (ret != NULL) {
memset(ret, 0, size);
*dma_handle = virt_to_bus(ret);
}
return ret;
}

IMHO the criteria for when to select GFP_DMA pages is wrong, it should be :

if (hwdev == NULL || hwdev->dma_mask < 0xffffffff)
gfp |= GFP_DMA;

Here's a patch :

--- pci-dma.c.~1~ Wed Aug 14 15:06:49 2002
+++ pci-dma.c Wed Aug 14 15:08:29 2002
@@ -19,7 +19,7 @@
void *ret;
int gfp = GFP_ATOMIC;

- if (hwdev == NULL || hwdev->dma_mask != 0xffffffff)
+ if (hwdev == NULL || hwdev->dma_mask < 0xffffffff)
gfp |= GFP_DMA;
ret = (void *)__get_free_pages(gfp, get_order(size));

Regards,
--
Steffen Persvold | Scalable Linux Systems | Try out the world's best
mailto:[email protected] | http://www.scali.com | performing MPI implementation:
Tel: (+47) 2262 8950 | Olaf Helsets vei 6 | - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY | >320MBytes/s and <4uS latency

2002-08-14 13:25:44

by Alan

[permalink] [raw]

Subject: Re: [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

On Wed, 2002-08-14 at 14:10, Antti Salmela wrote:
> On Wed, Aug 14, 2002 at 01:37:10PM +0100, Alan Cox wrote:
> > On Wed, 2002-08-14 at 12:54, Antti Salmela wrote:
> > > Oopsed soon after boot up. Stable with vanilla 2.4.19. The board is Intel
> > > SDS2. dnetc was running.
> >
> > Does vanilla 2.4.20pre1 run ok ?
>
> Seems to work just fine.

Really we need to find which kernel the problem started with then. If
you've got the time to spend on this try 2.4.19-ac1

2002-08-14 15:51:57

by Antti Salmela

[permalink] [raw]

Subject: Re: [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

On Wed, Aug 14, 2002 at 02:27:09PM +0100, Alan Cox wrote:
> On Wed, 2002-08-14 at 14:10, Antti Salmela wrote:
> > On Wed, Aug 14, 2002 at 01:37:10PM +0100, Alan Cox wrote:
> > > On Wed, 2002-08-14 at 12:54, Antti Salmela wrote:
> > > > Oopsed soon after boot up. Stable with vanilla 2.4.19. The board is Intel
> > > > SDS2. dnetc was running.
> > >
> > > Does vanilla 2.4.20pre1 run ok ?
> >
> > Seems to work just fine.
>
> Really we need to find which kernel the problem started with then. If
> you've got the time to spend on this try 2.4.19-ac1

2.4.19-rc1-ac2 appears to be the first one that does not work.

ksymoops 2.4.5 on i686 2.4.19-rc1-ac1. Options used
-V (default)
-K (specified)
-L (specified)
-o /lib/modules/2.4.19-rc1-ac2 (specified)
-m /boot/System.map-2.4.19-rc1-ac2 (specified)

No modules in ksyms, skipping objects
Unable to handle kernel NULL pointer dereference at virtual address 0000002a
c0116f0c
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<c0116f0c>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010003
eax: 0000008c ebx: c0327680 ecx: c03276a4 edx: f6760000
esi: ffffffd6 edi: f676002c ebp: f6761fa4 esp: f6761f88
ds: 0018 es: 0018 ss: 0018
Process distributed-net (pid: 511, stackpage=f6761000)
Stack: f6760000 00000a00 f676002c 00000001 c011428f f6760000 f6760000 f6761fbc
c0117eef f6760000 000000b5 000b2390 c0327680 bffff934 c01088eb 00000000
00000000 40026004 000000b5 000b2390 bffff934 0000009e c010002b 0000002b
Call Trace: [<c011428f>] [<c0117eef>] [<c01088eb>]
Code: 8b 7e 54 8b 4a 58 89 4d f4 85 ff 75 37 89 4e 58 f0 ff 41 14

>>EIP; c0116f0c <schedule+198/394> <=====

>>ebx; c0327680 <runqueues+a00/14000>
>>ecx; c03276a4 <runqueues+a24/14000>
>>edx; f6760000 <END_OF_CODE+363b9844/????>
>>esi; ffffffd6 <END_OF_CODE+3fc5981a/????>
>>edi; f676002c <END_OF_CODE+363b9870/????>
>>ebp; f6761fa4 <END_OF_CODE+363bb7e8/????>
>>esp; f6761f88 <END_OF_CODE+363bb7cc/????>

Trace; c011428f <smp_apic_timer_interrupt+f3/114>
Trace; c0117eef <sys_sched_yield+113/11c>
Trace; c01088eb <system_call+33/38>

Code; c0116f0c <schedule+198/394>
00000000 <_EIP>:
Code; c0116f0c <schedule+198/394> <=====
0: 8b 7e 54 mov 0x54(%esi),%edi <=====
Code; c0116f0f <schedule+19b/394>
3: 8b 4a 58 mov 0x58(%edx),%ecx
Code; c0116f12 <schedule+19e/394>
6: 89 4d f4 mov %ecx,0xfffffff4(%ebp)
Code; c0116f15 <schedule+1a1/394>
9: 85 ff test %edi,%edi
Code; c0116f17 <schedule+1a3/394>
b: 75 37 jne 44 <_EIP+0x44> c0116f50 <schedule+1dc/394>
Code; c0116f19 <schedule+1a5/394>
d: 89 4e 58 mov %ecx,0x58(%esi)
Code; c0116f1c <schedule+1a8/394>
10: f0 ff 41 14 lock incl 0x14(%ecx)

--
Antti Salmela

2002-08-14 17:27:05

by Christian Ehrhardt

[permalink] [raw]

Subject: Re: [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

Hi,

I invested some time analyzing the Ooops and thought I'd share
what I think I found out:

The code where it Oopses is line 451 in context_switch:
449 static inline task_t * context_switch(task_t *prev, task_t *next)
450 {
451 struct mm_struct *mm = next->mm;

0x54 is the offset of task->mm
At this point next is in %esi (%ebx in the earlier Oops posted). The
value of next is calculated by this code in schedule:
867 idx = sched_find_first_bit(array->bitmap);
868 queue = array->queue + idx;
869 next = list_entry(queue->next, task_t, run_list);

At this point idx is in %eax, i.e. it has a value of 0x8c == 140
in both of the Oopsen. Investigating further on the value of next
(0xffffffd6) shows that this value is the result of
list_entry (0x02, task_t, run_list),
i.e. queue->next == 0x02. Getting back to %eax shows that 140 (== MAX_PRIO
is actually NOT a valid index for array->queue above, i.e. it seems that we
overrun this array by one. Putting a ``BUG_ON(idx >= MAX_PRIO);'' between
lines 867 and 868 above should proof this.

HTH, I have no more time to investigate this now.

regards Christian Ehrhardt

------ Oooops preserved for reference --------------------
On Wed, Aug 14, 2002 at 06:55:05PM +0300, Antti Salmela wrote:
> ksymoops 2.4.5 on i686 2.4.19-rc1-ac1. Options used
> -V (default)
> -K (specified)
> -L (specified)
> -o /lib/modules/2.4.19-rc1-ac2 (specified)
> -m /boot/System.map-2.4.19-rc1-ac2 (specified)
>
> No modules in ksyms, skipping objects
> Unable to handle kernel NULL pointer dereference at virtual address 0000002a
> c0116f0c
> *pde = 00000000
> Oops: 0000
> CPU: 1
> EIP: 0010:[<c0116f0c>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010003
> eax: 0000008c ebx: c0327680 ecx: c03276a4 edx: f6760000
> esi: ffffffd6 edi: f676002c ebp: f6761fa4 esp: f6761f88
> ds: 0018 es: 0018 ss: 0018
> Process distributed-net (pid: 511, stackpage=f6761000)
> Stack: f6760000 00000a00 f676002c 00000001 c011428f f6760000 f6760000 f6761fbc
> c0117eef f6760000 000000b5 000b2390 c0327680 bffff934 c01088eb 00000000
> 00000000 40026004 000000b5 000b2390 bffff934 0000009e c010002b 0000002b
> Call Trace: [<c011428f>] [<c0117eef>] [<c01088eb>]
> Code: 8b 7e 54 8b 4a 58 89 4d f4 85 ff 75 37 89 4e 58 f0 ff 41 14
>
>
> >>EIP; c0116f0c <schedule+198/394> <=====
>
> >>ebx; c0327680 <runqueues+a00/14000>
> >>ecx; c03276a4 <runqueues+a24/14000>
> >>edx; f6760000 <END_OF_CODE+363b9844/????>
> >>esi; ffffffd6 <END_OF_CODE+3fc5981a/????>
> >>edi; f676002c <END_OF_CODE+363b9870/????>
> >>ebp; f6761fa4 <END_OF_CODE+363bb7e8/????>
> >>esp; f6761f88 <END_OF_CODE+363bb7cc/????>
>
> Trace; c011428f <smp_apic_timer_interrupt+f3/114>
> Trace; c0117eef <sys_sched_yield+113/11c>
> Trace; c01088eb <system_call+33/38>
>
> Code; c0116f0c <schedule+198/394>
> 00000000 <_EIP>:
> Code; c0116f0c <schedule+198/394> <=====
> 0: 8b 7e 54 mov 0x54(%esi),%edi <=====
> Code; c0116f0f <schedule+19b/394>
> 3: 8b 4a 58 mov 0x58(%edx),%ecx
> Code; c0116f12 <schedule+19e/394>
> 6: 89 4d f4 mov %ecx,0xfffffff4(%ebp)
> Code; c0116f15 <schedule+1a1/394>
> 9: 85 ff test %edi,%edi
> Code; c0116f17 <schedule+1a3/394>
> b: 75 37 jne 44 <_EIP+0x44> c0116f50 <schedule+1dc/394>
> Code; c0116f19 <schedule+1a5/394>
> d: 89 4e 58 mov %ecx,0x58(%esi)
> Code; c0116f1c <schedule+1a8/394>
> 10: f0 ff 41 14 lock incl 0x14(%ecx)
>
> --
> Antti Salmela
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
--
****************************************************************************
** Christian Ehrhardt ** e-Mail: [email protected] *********
****************************************************************************

THAT'S ALL FOLKS!

2002-08-15 01:32:04

by Alan

[permalink] [raw]

Subject: Re: [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

Thanks - your analysis is informative to say the least. It looks like
the PIV load balancing code is the problem.

2002-08-16 14:13:24

by Christian Ehrhardt

[permalink] [raw]

Subject: Re: [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

On Thu, Aug 15, 2002 at 02:33:02AM +0100, Alan Cox wrote:
> Thanks - your analysis is informative to say the least. It looks like
> the PIV load balancing code is the problem.

A few more observations and maybe a solution for the problem
(kernel is 2.4.20-pre1-ac3):

I ran Richard Gooch's scheduler benchmark[1] as a normal user with
num_running set to 50 (./a.out 50). The box locks up hard within a few
seconds. There is no Ooops, Magic-SysRq doesn't work anymore, neither does
console switching or Ctrl-Alt-Delete. NMI-Watchdog is enabled and failed
to reboot the box.

This suggests that sched_yield, nice or sched_setscheduler is involved
with sched_yield beeing #1 candidate.

Further investigation showed that adding BUG_ON(p->array != array);
in dequeue_task would have given some interesting results.
At least the following is quite possible and doesn't even require SMP:

Task Interrupt
Calls sys_sched_yield
======> Timer Interrupt
Timer Interrupt decreases times lice,
the time slice expires and the task is
moved to the expired array.
Continues with yield.
Assume current->prio == MAX_PRIO-1,
current->time_slice <= 1 is satisifed
anyway, i.e. wie do:
dequeue_task(current, rq->active);
enqueue_task(current, rq->expired);
However, the task has already been moved
from the active to the expired array
by the timer interrupt, i.e.
dequeue_task and enqueue_task will get
the nr_active counts and the bitmaps
wrong because they remove the task from
the wrong array. --> BOOOM

The (untested) patch below should correct this problem along with
a locking oddity (last hunk) that IMHO either needs fixing or a BIG
comment. Be prepared for a few (up to 4) lines of fuzz due to additional
BUG_ONs in both versions of the file.

regards Christian Ehrhardt

[1] http://www.atnf.csiro.au/people/rgooch/benchmarks/linux-scheduler.html

--- /usr/src/linux-2.4.20-pre1-ac3/kernel/sched.c Thu Aug 15 20:03:01 2002
+++ sched.c Fri Aug 16 16:15:57 2002
@@ -769,7 +772,7 @@
set_tsk_need_resched(p);

/* put it at the end of the queue: */
- dequeue_task(p, rq->active);
+ dequeue_task(p, p->array);
enqueue_task(p, rq->active);
}
goto out;
@@ -785,7 +788,7 @@
if (p->sleep_avg)
p->sleep_avg--;
if (!--p->time_slice) {
- dequeue_task(p, rq->active);
+ dequeue_task(p, p->array);
set_tsk_need_resched(p);
p->prio = effective_prio(p);
p->time_slice = TASK_TIMESLICE(p);
@@ -1396,7 +1399,7 @@
*/
if (likely(current->prio == MAX_PRIO-1)) {
if (current->time_slice <= 1) {
- dequeue_task(current, rq->active);
+ dequeue_task(current, array);
enqueue_task(current, rq->expired);
} else
current->time_slice--;
@@ -1411,7 +1414,7 @@
list_add_tail(&current->run_list, array->queue + current->prio);
__set_bit(current->prio, array->bitmap);
}
- spin_unlock(&rq->lock);
+ rq_unlock (rq);

schedule();

--
THAT'S ALL FOLKS!

2002-08-16 16:47:21

by Antti Salmela

[permalink] [raw]

Subject: Re: [OOPS] 2.4.20-pre1-ac3, SMP (Dual PIII)

On Fri, Aug 16, 2002 at 04:17:18PM +0200, Christian Ehrhardt wrote:
> On Thu, Aug 15, 2002 at 02:33:02AM +0100, Alan Cox wrote:
> > Thanks - your analysis is informative to say the least. It looks like
> > the PIV load balancing code is the problem.
>
> The (untested) patch below should correct this problem along with
> a locking oddity (last hunk) that IMHO either needs fixing or a BIG
> comment. Be prepared for a few (up to 4) lines of fuzz due to additional
> BUG_ONs in both versions of the file.

With this patch I could boot 2.4.20-pre2-ac3 and it has now run nearly an
hour without any problems.

> regards Christian Ehrhardt
>
> [1] http://www.atnf.csiro.au/people/rgooch/benchmarks/linux-scheduler.html
>
>
> --- /usr/src/linux-2.4.20-pre1-ac3/kernel/sched.c Thu Aug 15 20:03:01 2002
> +++ sched.c Fri Aug 16 16:15:57 2002
> @@ -769,7 +772,7 @@
> set_tsk_need_resched(p);
>
> /* put it at the end of the queue: */
> - dequeue_task(p, rq->active);
> + dequeue_task(p, p->array);
> enqueue_task(p, rq->active);
> }
> goto out;
> @@ -785,7 +788,7 @@
> if (p->sleep_avg)
> p->sleep_avg--;
> if (!--p->time_slice) {
> - dequeue_task(p, rq->active);
> + dequeue_task(p, p->array);
> set_tsk_need_resched(p);
> p->prio = effective_prio(p);
> p->time_slice = TASK_TIMESLICE(p);
> @@ -1396,7 +1399,7 @@
> */
> if (likely(current->prio == MAX_PRIO-1)) {
> if (current->time_slice <= 1) {
> - dequeue_task(current, rq->active);
> + dequeue_task(current, array);
> enqueue_task(current, rq->expired);
> } else
> current->time_slice--;
> @@ -1411,7 +1414,7 @@
> list_add_tail(&current->run_list, array->queue + current->prio);
> __set_bit(current->prio, array->bitmap);
> }
> - spin_unlock(&rq->lock);
> + rq_unlock (rq);
>
> schedule();
>

--
Antti Salmela