2004-04-27 10:17:30

by Juergen Stohr

[permalink] [raw]
Subject: [BUG]linux-2.4.26 Quad-Opteron: panic when init scsi

Hi,

I have got a problem similar to the one discussed in the thread
"[BUG]linux-2.4.24 with k8 numa support panic when init scsi": When
booting the 2.4.26 on a quad Opteron box, in most of the cases the
kernel crashes when initializing SCSI.
It seems to me that this bug is caused by a race, as in a few
cases the machine is able to boot.

The machine always boots if I set maxcpus=1.

If I append numa=off to the command line the kernel crashes with a
"Machine Check Exception" (but is able to initialize SCSI; perhaps
this is another bug?)

Does anybody know how to solve this problem or is anybody working on it?
Can someone give me a hint where to start when debugging this race?

I will attach the config of my kernel, the syslogs and the output of
ksymoops.

Regards,
J?rgen


Attachments:
(No filename) (784.00 B)
config-2.4.26.gz (981.00 B)
syslog.txt.gz (3.84 kB)
ksymoops.txt.gz (992.00 B)
Download all attachments

2004-04-28 23:12:22

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [BUG]linux-2.4.26 Quad-Opteron: panic when init scsi

On Tue, Apr 27, 2004 at 12:17:20PM +0200, Juergen Stohr wrote:
> Hi,
>
> I have got a problem similar to the one discussed in the thread
> "[BUG]linux-2.4.24 with k8 numa support panic when init scsi": When
> booting the 2.4.26 on a quad Opteron box, in most of the cases the
> kernel crashes when initializing SCSI.
> It seems to me that this bug is caused by a race, as in a few
> cases the machine is able to boot.
>
> The machine always boots if I set maxcpus=1.
>
> If I append numa=off to the command line the kernel crashes with a
> "Machine Check Exception" (but is able to initialize SCSI; perhaps
> this is another bug?)
>
> Does anybody know how to solve this problem or is anybody working on it?
> Can someone give me a hint where to start when debugging this race?
>
> I will attach the config of my kernel, the syslogs and the output of
> ksymoops.

Andi,

Have you seen Juergen's ksymoops trace?

It seems some BUG() (maybe BAD_RANGE()) is triggering at startup in
__alloc_pages().

2004-04-29 09:24:21

by Alexander v. Buelow

[permalink] [raw]
Subject: Re: [BUG]linux-2.4.26 Quad-Opteron: panic when init scsi

Hi,

> It looks like you compiled this on a SuSE 9.0/64bit system, right?
> I presume this means the SuSE 2.4.21 smp kernel worked on the same
> box, right?

Yes, that's right.

> Can you perhaps try to narrow down where it broke between (mainline)
> 2.4.21 and 2.4.26 ?

We tried: 2.4.21 -> ok
2.4.22 -> ok
2.4.23 -> not ok !!

Then we tried to find the error and I changed in mm/numa.c:

--- linux-2.4.26/mm/numa.c 2001-09-18 01:15:02.000000000 +0200
+++ linux-2.4.26-recoms/mm/numa.c 2004-04-27 18:25:28.000000000
+0200
@@ -105,6 +105,11 @@
return NULL;
#ifdef CONFIG_NUMA
temp = NODE_DATA(numa_node_id());
+ if((gfp_mask & GFP_DMA) == GFP_DMA)
+ {
+ printk(KERN_WARNING "RECOMS: Umleitung DMA auf CPU 0\n");
+ temp = NODE_DATA(0);
+ }
#else
spin_lock_irqsave(&node_lock, flags);
if (!next) next = pgdat_list;

And in mm/page_alloc.c I added in void __init free_area_init_core(..):

*gmap = pgdat->node_mem_map = lmem_map;
pgdat->node_size = totalpages;
pgdat->node_start_paddr = zone_start_paddr;
pgdat->node_start_mapnr = (lmem_map - mem_map);
pgdat->nr_zones = 0;
+// Alex:
+pgdat->node_id = nid;

offset = lmem_map - mem_map;
for (j = 0; j < MAX_NR_ZONES; j++) {

This seemed to work, the scsi error didn't occur any more.

But then we ran into various other problems: Sometimes we got MCEs (GART
TLB) and different kernel errors like page fault, NULL pointer
dereference and general protection fault. These errors are not
reproducible but occur frequently.

I hope you will find a solution!

Regards,

J?rgen and Alexander

--
-----------------------------------------------------------------------
Dipl.-Ing. Alexander von Buelow http://www.rcs.ei.tum.de
Institute for Real-Time Computersystems (RCS) fon +49/89-289-23556
Technische Universitaet Muenchen, D-80290 Muenchen fax +49/89-289-23555