Message-ID: <478E2662.2070606@siemens.com>
Date: Wed, 16 Jan 2008 16:44:34 +0100
From: Jan Kiszka <jan.kiszka@siemens.com>
User-Agent: Thunderbird 2.0.0.9 (X11/20070801)
MIME-Version: 1.0
To: Jason Wessel <jason.wessel@windriver.com>
CC: Jan Kiszka <jan.kiszka@web.de>,
       Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: State of kgdb on x86-64
References: <478BB35B.9060507@siemens.com> <478BB74E.6020506@windriver.com> <478C786A.3090709@siemens.com> <478CB724.3000900@windriver.com> <478CFF08.1090608@web.de> <478D839A.4010201@windriver.com>
In-Reply-To: <478D839A.4010201@windriver.com>
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4059
Lines: 101

Jason Wessel wrote:
> Jan Kiszka wrote:
>> Jason Wessel wrote:
>>   
>>> Jan Kiszka wrote:
>>>     
>>>> Jason Wessel wrote:
>>>>   
>>>>       
>>>>> It was working at the point that I tested it with the 2.6.24-rc5 on
>>>>> x86_64.  However I suspect my kernel config may differ drastically from
>>>>> what you are using.
>>>>>
>>>>> Without any other context provided than the generic message, it is hard
>>>>> to know what might have happened. 
>>>>>     
>>>>>         
>>>> Here is the promised .config. I could also dig out the backtrace of the
>>>> panic as kgdb sees it if that helps, just let me know.
>>>>
>>>> Jan
>>>>
>>>>   
>>>>       
>>> The backtrace might be very telling as to what happened.  More
>>> information is always better than less :-)
>>>
>>>     
>> My primary test box is again out of reach, but meanwhile I was able to
>> reproduce some kind of problem under QEMU - that one at least is
>> triggered by SMP. With only one CPU -> all apparently fine. Once booting
>> QEMU with "-smp 2" -> this happens:
>>
>> (gdb) tar remote /dev/pts/6
>> Remote debugging using /dev/pts/6
>> Not all CPUs have been synced for KGDB
>> breakpoint () at kernel/kgdb.c:1895
>> 1895            wmb(); /* Sync point after breakpoint */
>> (gdb) c
>> Continuing.
>> Not all CPUs have been synced for KGDB
>> [New Thread 32769]
>>
>> Program received signal SIGFPE, Arithmetic exception.
>> [Switching to Thread 32769]
>> 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
>> 140             __asm__ __volatile__("sti; hlt" : : : "memory");
>> (gdb) bt
>> #0  0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
>> #1  0xffffffff8020ae65 in cpu_idle () at arch/x86/kernel/process_64.c:225
>> #2  0xffffffff8021ccb9 in start_secondary () at arch/x86/kernel/smpboot_64.c:375
>> #3  0x0000000000000000 in ?? ()
>> (gdb)                                                                                     
>>
>> The problem seems to be related to continuing SMP boxes. I'm able to
>> boot my box up if I leave kgdb unattached. But when I then later attach
>> and continue execution, I get the same crash. Any ideas what goes wrong,
>> any suggestion where to start digging? Maybe at "Not all CPUs have been
>> synched"?
>>   
> 
> Generally speaking when you get an error that the CPUs have not been
> synced, it means that the IPI which was sent to all the non-master
> processors failed.  I took a quick look and it appears that the DIE_TRAP
> is occuring after kgdb sends the IPI to the non master cores with the call:
> 
>     send_IPI_allbutself(APIC_DM_NMI);
> 
> In prior kernels that ultimately resulted in an NMI trap.  I am not sure
> of the cause of the DIE_TRAP as a result of the IPI.  For now, if you
> add the statement "case DIE_TRAP:" right before "    case
> DIE_NMIWATCHDOG:" in arch/x86/kernel/kgdb_64.c it will sync te
> processors, however the kernel should not be trapping for this error
> code from the IPI event.  I suspect there has been some kind of change
> to the way the IPI/NMI handling is being done in the latest kernels.

Things I found out so far:

 - delivery of this IPI under QEMU somehow doesn't work
   (I would dare to say: at emulation level. But I'm not sure yet.)

 - the breakdown of my Xeon box with kdbg is a separate issue

 - stuffing my Xeon kernel into QEMU triggers the original bug there
   too, even under UP => debugging with QEMU should be feasible

But I'm first trying to identify the related config switch that makes it
pop up, because this mostly means compiling and quick testing (I can't
spent my full time on it yet). Will let you know about news, and I would
appreciate to hear from you if you have any updates.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/