Hi.
The new code in rc3aa3 makes a dual Xeon box hang on boot just
when stating migration threads. I get two simultaneous oops, one
for migration_thread=1 and =2. Decoded oops for one of them:
*pde = 00000000
CPU: 1
EIP: 0010:[<80119f9d>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010006
eax: 00002700 ebx: 000000ff ecx: 00000004 edx: 00000000
esi: 00000000 edi: 00000004 ebp: bffe7f80 esp: bffe7f40
ds: 0018 es: 0018 ss: 0018
Process migration_CPU1 (pid: 3, stackpage=bffe7000)
Stack: 00000000 00000004 0000000a bffe7fc8 ffffffff 00000001
bffe634b ffffffff 00000001 00000000 0000000a 00000001
802fe240 00000000 bffe6000 802fe240 bffe7fc0 8011a7e7
802fe240 00000001 802fe240 8025c116 bffe633e bffe6000
Call Trace: [<8011a7e7>] [<8025c116>] [<8025c134>] [<8011c009>]
[<80105000>] [<80105000>] [<80107256>] [<8011bec0>]
Code: 8b 42 14 c7 42 14 01 00 00 00 85 c0 0f 85 88 03 00 00 8b 52
>>EIP; 80119f9c <load_balance+ec/490> <=====
Trace; 8011a7e6 <schedule+126/3a0>
Trace; 8025c116 <vsprintf+16/20>
Trace; 8025c134 <sprintf+14/20>
Trace; 8011c008 <migration_thread+148/320>
Trace; 80105000 <_stext+0/0>
Trace; 80105000 <_stext+0/0>
Trace; 80107256 <kernel_thread+26/30>
Trace; 8011bec0 <migration_thread+0/320>
Code; 80119f9c <load_balance+ec/490>
00000000 <_EIP>:
Code; 80119f9c <load_balance+ec/490> <=====
0: 8b 42 14 mov 0x14(%edx),%eax <=====
Code; 80119f9e <load_balance+ee/490>
3: c7 42 14 01 00 00 00 movl $0x1,0x14(%edx)
Code; 80119fa6 <load_balance+f6/490>
a: 85 c0 test %eax,%eax
Code; 80119fa8 <load_balance+f8/490>
c: 0f 85 88 03 00 00 jne 39a <_EIP+0x39a> 8011a336 <load_balance+486/490>
Code; 80119fae <load_balance+fe/490>
12: 8b 52 00 mov 0x0(%edx),%edx
On Mon, Jul 29, 2002 at 07:42:38PM +0200, J.A. Magallon wrote:
> Hi.
>
> The new code in rc3aa3 makes a dual Xeon box hang on boot just
> when stating migration threads. I get two simultaneous oops, one
> for migration_thread=1 and =2. Decoded oops for one of them:
can you find out the exact line of C code that oopses (i.e. what it is
supposed to be edx)? If you can't find it please send me the disassembly
of the function load_balance, thanks.
Also please try to reproduce with Ingo's latest, I merged a few fixes
for the migration thread startup from his latest update.
>
> *pde = 00000000
> CPU: 1
> EIP: 0010:[<80119f9d>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010006
> eax: 00002700 ebx: 000000ff ecx: 00000004 edx: 00000000
> esi: 00000000 edi: 00000004 ebp: bffe7f80 esp: bffe7f40
> ds: 0018 es: 0018 ss: 0018
> Process migration_CPU1 (pid: 3, stackpage=bffe7000)
> Stack: 00000000 00000004 0000000a bffe7fc8 ffffffff 00000001
> bffe634b ffffffff 00000001 00000000 0000000a 00000001
> 802fe240 00000000 bffe6000 802fe240 bffe7fc0 8011a7e7
> 802fe240 00000001 802fe240 8025c116 bffe633e bffe6000
> Call Trace: [<8011a7e7>] [<8025c116>] [<8025c134>] [<8011c009>]
> [<80105000>] [<80105000>] [<80107256>] [<8011bec0>]
> Code: 8b 42 14 c7 42 14 01 00 00 00 85 c0 0f 85 88 03 00 00 8b 52
>
> >>EIP; 80119f9c <load_balance+ec/490> <=====
> Trace; 8011a7e6 <schedule+126/3a0>
> Trace; 8025c116 <vsprintf+16/20>
> Trace; 8025c134 <sprintf+14/20>
> Trace; 8011c008 <migration_thread+148/320>
> Trace; 80105000 <_stext+0/0>
> Trace; 80105000 <_stext+0/0>
> Trace; 80107256 <kernel_thread+26/30>
> Trace; 8011bec0 <migration_thread+0/320>
> Code; 80119f9c <load_balance+ec/490>
> 00000000 <_EIP>:
> Code; 80119f9c <load_balance+ec/490> <=====
> 0: 8b 42 14 mov 0x14(%edx),%eax <=====
> Code; 80119f9e <load_balance+ee/490>
> 3: c7 42 14 01 00 00 00 movl $0x1,0x14(%edx)
> Code; 80119fa6 <load_balance+f6/490>
> a: 85 c0 test %eax,%eax
> Code; 80119fa8 <load_balance+f8/490>
> c: 0f 85 88 03 00 00 jne 39a <_EIP+0x39a> 8011a336 <load_balance+486/490>
> Code; 80119fae <load_balance+fe/490>
> 12: 8b 52 00 mov 0x0(%edx),%edx
Andrea
On 20020729 Andrea Arcangeli wrote:
> On Mon, Jul 29, 2002 at 07:42:38PM +0200, J.A. Magallon wrote:
> > Hi.
> >
> > The new code in rc3aa3 makes a dual Xeon box hang on boot just
> > when stating migration threads. I get two simultaneous oops, one
> > for migration_thread=1 and =2. Decoded oops for one of them:
>
> can you find out the exact line of C code that oopses (i.e. what it is
> supposed to be edx)? If you can't find it please send me the disassembly
> of the function load_balance, thanks.
>
Assembler listing for load_balance attached, got by objdump -d in
/usr/src/linux/vmlinux (correct procedure ?).
> Also please try to reproduce with Ingo's latest, I merged a few fixes
> for the migration thread startup from his latest update.
>
Does this mean I can merge Ingo's updates in -aa ? Don't they use any
infrastructure not present in 2.4 ?
TIA
On Tue, Jul 30, 2002 at 12:35:39AM +0200, J.A. Magallon wrote:
> On 20020729 Andrea Arcangeli wrote:
> > On Mon, Jul 29, 2002 at 07:42:38PM +0200, J.A. Magallon wrote:
> > > Hi.
> > >
> > > The new code in rc3aa3 makes a dual Xeon box hang on boot just
> > > when stating migration threads. I get two simultaneous oops, one
> > > for migration_thread=1 and =2. Decoded oops for one of them:
> >
> > can you find out the exact line of C code that oopses (i.e. what it is
> > supposed to be edx)? If you can't find it please send me the disassembly
> > of the function load_balance, thanks.
> >
>
> Assembler listing for load_balance attached, got by objdump -d in
> /usr/src/linux/vmlinux (correct procedure ?).
it's not attached but never mind :) and yes it's the correct procedure.
btw, is it an hyperthreading cpu? Had you any problem with aa2?
>
> > Also please try to reproduce with Ingo's latest, I merged a few fixes
> > for the migration thread startup from his latest update.
> >
>
> Does this mean I can merge Ingo's updates in -aa ? Don't they use any
> infrastructure not present in 2.4 ?
I just merged all Ingo's updates, except the new features like
SCHED_BATCH and SCHED_IDLE, I don't feel they're needed in 2.4 and now
the o1 is finally stable after the last fixes that apparently improved
tbench of another 10% and that should avoid the sluggish behaviour under
high load in smp and now that sched_yield doesn't hang anymore by
refiling to the expired queue. I only skept those two features (they're
not even in 2.5 yet).
Andrea
On 20020730 J.A. Magallon wrote:
>
> Assembler listing for load_balance attached, got by objdump -d in
> /usr/src/linux/vmlinux (correct procedure ?).
>
Ejem....
Here it goes, true...
On 20020730 Andrea Arcangeli wrote:
> On Tue, Jul 30, 2002 at 12:35:39AM +0200, J.A. Magallon wrote:
>
> it's not attached but never mind :) and yes it's the correct procedure.
>
Realized...
> btw, is it an hyperthreading cpu? Had you any problem with aa2?
>
Full story: -rc3-jam1 (==aa1) works ok. Also -rc3-jam2 works.
But -rc3-jam3 bombed on the dual-p4xeon box, but works on a PIII laptop.
So I tried plain -rc3-aa3 on the big box. It crashes with that
couple oopses.
Can't test on my home dual-pII (SMP but not xeon),
'cause the power source fried recently...
The laptop kernel is built with gcc-3.2 (latest Mandrake cooker), and the
dual-xeon one is built with gcc-2.96 (plain Mandrake 8.2).
On Tue, Jul 30, 2002 at 01:12:03AM +0200, J.A. Magallon wrote:
> But -rc3-jam3 bombed on the dual-p4xeon box, but works on a PIII laptop.
I decored the oops and in short rq_target->idle is NULL, so then
resched_task bugs out while reading p->need_resched.
it's the hyperthreading support that bugs out infact.
I had a look and this should fix it (the first one is just a theorical
bug, since it's under an ifdef i386 cpu_number_map is an identity, the
++ thing was the reason I think). Can you test it?
--- 2.4.19rc3aa3/kernel/sched.c.~1~ Sun Jul 28 18:12:19 2002
+++ 2.4.19rc3aa3/kernel/sched.c Tue Jul 30 01:42:08 2002
@@ -490,7 +490,7 @@ static inline void pull_task(runqueue_t
*/
static inline int find_idle_package(int this_cpu)
{
- int i = this_cpu + 1;
+ int i = cpu_number_map(this_cpu) + 1;
if (i == smp_num_cpus)
i = 0;
@@ -500,7 +500,7 @@ static inline int find_idle_package(int
physical = cpu_logical_map(i);
sibling = cpu_sibling_map[physical];
- if (i++ == smp_num_cpus)
+ if (++i == smp_num_cpus)
i = 0;
if (idle_cpu(physical) && idle_cpu(sibling))
return physical;
Andrea
On 20020730 Andrea Arcangeli wrote:
> On Tue, Jul 30, 2002 at 01:12:03AM +0200, J.A. Magallon wrote:
> > But -rc3-jam3 bombed on the dual-p4xeon box, but works on a PIII laptop.
>
> I decored the oops and in short rq_target->idle is NULL, so then
> resched_task bugs out while reading p->need_resched.
>
> it's the hyperthreading support that bugs out infact.
>
> I had a look and this should fix it (the first one is just a theorical
> bug, since it's under an ifdef i386 cpu_number_map is an identity, the
> ++ thing was the reason I think). Can you test it?
>
Thanks, I will try it tomorrow on the real box.
By