Hi,
I have got a problem similar to the one discussed in the thread
"[BUG]linux-2.4.24 with k8 numa support panic when init scsi": When
booting the 2.4.26 on a quad Opteron box, in most of the cases the
kernel crashes when initializing SCSI.
It seems to me that this bug is caused by a race, as in a few
cases the machine is able to boot.
The machine always boots if I set maxcpus=1.
If I append numa=off to the command line the kernel crashes with a
"Machine Check Exception" (but is able to initialize SCSI; perhaps
this is another bug?)
Does anybody know how to solve this problem or is anybody working on it?
Can someone give me a hint where to start when debugging this race?
I will attach the config of my kernel, the syslogs and the output of
ksymoops.
Regards,
J?rgen
On Tue, Apr 27, 2004 at 12:17:20PM +0200, Juergen Stohr wrote:
> Hi,
>
> I have got a problem similar to the one discussed in the thread
> "[BUG]linux-2.4.24 with k8 numa support panic when init scsi": When
> booting the 2.4.26 on a quad Opteron box, in most of the cases the
> kernel crashes when initializing SCSI.
> It seems to me that this bug is caused by a race, as in a few
> cases the machine is able to boot.
>
> The machine always boots if I set maxcpus=1.
>
> If I append numa=off to the command line the kernel crashes with a
> "Machine Check Exception" (but is able to initialize SCSI; perhaps
> this is another bug?)
>
> Does anybody know how to solve this problem or is anybody working on it?
> Can someone give me a hint where to start when debugging this race?
>
> I will attach the config of my kernel, the syslogs and the output of
> ksymoops.
Andi,
Have you seen Juergen's ksymoops trace?
It seems some BUG() (maybe BAD_RANGE()) is triggering at startup in
__alloc_pages().
Hi,
> It looks like you compiled this on a SuSE 9.0/64bit system, right?
> I presume this means the SuSE 2.4.21 smp kernel worked on the same
> box, right?
Yes, that's right.
> Can you perhaps try to narrow down where it broke between (mainline)
> 2.4.21 and 2.4.26 ?
We tried: 2.4.21 -> ok
2.4.22 -> ok
2.4.23 -> not ok !!
Then we tried to find the error and I changed in mm/numa.c:
--- linux-2.4.26/mm/numa.c 2001-09-18 01:15:02.000000000 +0200
+++ linux-2.4.26-recoms/mm/numa.c 2004-04-27 18:25:28.000000000
+0200
@@ -105,6 +105,11 @@
return NULL;
#ifdef CONFIG_NUMA
temp = NODE_DATA(numa_node_id());
+ if((gfp_mask & GFP_DMA) == GFP_DMA)
+ {
+ printk(KERN_WARNING "RECOMS: Umleitung DMA auf CPU 0\n");
+ temp = NODE_DATA(0);
+ }
#else
spin_lock_irqsave(&node_lock, flags);
if (!next) next = pgdat_list;
And in mm/page_alloc.c I added in void __init free_area_init_core(..):
*gmap = pgdat->node_mem_map = lmem_map;
pgdat->node_size = totalpages;
pgdat->node_start_paddr = zone_start_paddr;
pgdat->node_start_mapnr = (lmem_map - mem_map);
pgdat->nr_zones = 0;
+// Alex:
+pgdat->node_id = nid;
offset = lmem_map - mem_map;
for (j = 0; j < MAX_NR_ZONES; j++) {
This seemed to work, the scsi error didn't occur any more.
But then we ran into various other problems: Sometimes we got MCEs (GART
TLB) and different kernel errors like page fault, NULL pointer
dereference and general protection fault. These errors are not
reproducible but occur frequently.
I hope you will find a solution!
Regards,
J?rgen and Alexander
--
-----------------------------------------------------------------------
Dipl.-Ing. Alexander von Buelow http://www.rcs.ei.tum.de
Institute for Real-Time Computersystems (RCS) fon +49/89-289-23556
Technische Universitaet Muenchen, D-80290 Muenchen fax +49/89-289-23555