LinuxLists.cc - State of kgdb on x86-64

2008-01-14 19:09:42

Subject: State of kgdb on x86-64

Hi Jason,

what is the state of the kgbd git repos? What should work, what might be
broken?

I'm asking as today I tried to get kgdb up and running on a 4-way x86-64
Xeon box with both 2.6.24-rc6 and -rc7. Once kgdb is enabled in .config,
the boot stops early with this panic:

Kernel panic - not syncing: Attempted to kill init!

May I have more success with 2.6.23? Was x86-64 tested and found working
once? I can dig deeper into the above issue, but before starting
blindly, I would like to asses if there could be more issues ahead on
this arch.

Thanks,
Jan

--
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

2008-01-14 19:26:41

by Jason Wessel

[permalink] [raw]

Subject: Re: State of kgdb on x86-64

It was working at the point that I tested it with the 2.6.24-rc5 on
x86_64. However I suspect my kernel config may differ drastically from
what you are using.

Without any other context provided than the generic message, it is hard
to know what might have happened.

Jason.

Jan Kiszka wrote:
> Hi Jason,
>
> what is the state of the kgbd git repos? What should work, what might be
> broken?
>
> I'm asking as today I tried to get kgdb up and running on a 4-way x86-64
> Xeon box with both 2.6.24-rc6 and -rc7. Once kgdb is enabled in .config,
> the boot stops early with this panic:
>
> Kernel panic - not syncing: Attempted to kill init!
>
> May I have more success with 2.6.23? Was x86-64 tested and found working
> once? I can dig deeper into the above issue, but before starting
> blindly, I would like to asses if there could be more issues ahead on
> this arch.
>
> Thanks,
> Jan
>
>

2008-01-14 20:00:19

by Jan Kiszka

[permalink] [raw]

Subject: Re: State of kgdb on x86-64

Jason Wessel wrote:
> It was working at the point that I tested it with the 2.6.24-rc5 on
> x86_64. However I suspect my kernel config may differ drastically from
> what you are using.

Yeah, that might be the case. The only thing I tried to vary so far was
applying maxcpus=1, but without success.

>
> Without any other context provided than the generic message, it is hard
> to know what might have happened.

OK, will throw my .config over tomorrow or on Wednesday, it's out of
reach ATM. And if you have a reference .config, I would appreciate if
you could send it over.

Thanks,
Jan

Attachments:

signature.asc (250.00 B)
OpenPGP digital signature

2008-01-15 18:44:41

by Jan Kiszka

[permalink] [raw]

Subject: Re: State of kgdb on x86-64

Jason Wessel wrote:
> Jan Kiszka wrote:
>> Jason Wessel wrote:
>>
>>> It was working at the point that I tested it with the 2.6.24-rc5 on
>>> x86_64. However I suspect my kernel config may differ drastically from
>>> what you are using.
>>>
>>> Without any other context provided than the generic message, it is hard
>>> to know what might have happened.
>>>
>> Here is the promised .config. I could also dig out the backtrace of the
>> panic as kgdb sees it if that helps, just let me know.
>>
>> Jan
>>
>>
> The backtrace might be very telling as to what happened. More
> information is always better than less :-)
>

My primary test box is again out of reach, but meanwhile I was able to
reproduce some kind of problem under QEMU - that one at least is
triggered by SMP. With only one CPU -> all apparently fine. Once booting
QEMU with "-smp 2" -> this happens:

(gdb) tar remote /dev/pts/6
Remote debugging using /dev/pts/6
Not all CPUs have been synced for KGDB
breakpoint () at kernel/kgdb.c:1895
1895 wmb(); /* Sync point after breakpoint */
(gdb) c
Continuing.
Not all CPUs have been synced for KGDB
[New Thread 32769]

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 32769]
0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
140 __asm__ __volatile__("sti; hlt" : : : "memory");
(gdb) bt
#0 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
#1 0xffffffff8020ae65 in cpu_idle () at arch/x86/kernel/process_64.c:225
#2 0xffffffff8021ccb9 in start_secondary () at arch/x86/kernel/smpboot_64.c:375
#3 0x0000000000000000 in ?? ()
(gdb)

The problem seems to be related to continuing SMP boxes. I'm able to
boot my box up if I leave kgdb unattached. But when I then later attach
and continue execution, I get the same crash. Any ideas what goes wrong,
any suggestion where to start digging? Maybe at "Not all CPUs have been
synched"?

Jan

Attachments:

signature.asc (254.00 B)
OpenPGP digital signature

2008-01-16 04:10:26

by Jason Wessel

[permalink] [raw]

Subject: Re: State of kgdb on x86-64

Jan Kiszka wrote:
> Jason Wessel wrote:
>
>> Jan Kiszka wrote:
>>
>>> Jason Wessel wrote:
>>>
>>>
>>>> It was working at the point that I tested it with the 2.6.24-rc5 on
>>>> x86_64. However I suspect my kernel config may differ drastically from
>>>> what you are using.
>>>>
>>>> Without any other context provided than the generic message, it is hard
>>>> to know what might have happened.
>>>>
>>>>
>>> Here is the promised .config. I could also dig out the backtrace of the
>>> panic as kgdb sees it if that helps, just let me know.
>>>
>>> Jan
>>>
>>>
>>>
>> The backtrace might be very telling as to what happened. More
>> information is always better than less :-)
>>
>>
>
> My primary test box is again out of reach, but meanwhile I was able to
> reproduce some kind of problem under QEMU - that one at least is
> triggered by SMP. With only one CPU -> all apparently fine. Once booting
> QEMU with "-smp 2" -> this happens:
>
> (gdb) tar remote /dev/pts/6
> Remote debugging using /dev/pts/6
> Not all CPUs have been synced for KGDB
> breakpoint () at kernel/kgdb.c:1895
> 1895 wmb(); /* Sync point after breakpoint */
> (gdb) c
> Continuing.
> Not all CPUs have been synced for KGDB
> [New Thread 32769]
>
> Program received signal SIGFPE, Arithmetic exception.
> [Switching to Thread 32769]
> 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
> 140 __asm__ __volatile__("sti; hlt" : : : "memory");
> (gdb) bt
> #0 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
> #1 0xffffffff8020ae65 in cpu_idle () at arch/x86/kernel/process_64.c:225
> #2 0xffffffff8021ccb9 in start_secondary () at arch/x86/kernel/smpboot_64.c:375
> #3 0x0000000000000000 in ?? ()
> (gdb)
>
> The problem seems to be related to continuing SMP boxes. I'm able to
> boot my box up if I leave kgdb unattached. But when I then later attach
> and continue execution, I get the same crash. Any ideas what goes wrong,
> any suggestion where to start digging? Maybe at "Not all CPUs have been
> synched"?
>

Generally speaking when you get an error that the CPUs have not been
synced, it means that the IPI which was sent to all the non-master
processors failed. I took a quick look and it appears that the DIE_TRAP
is occuring after kgdb sends the IPI to the non master cores with the call:

send_IPI_allbutself(APIC_DM_NMI);

In prior kernels that ultimately resulted in an NMI trap. I am not sure
of the cause of the DIE_TRAP as a result of the IPI. For now, if you
add the statement "case DIE_TRAP:" right before " case
DIE_NMIWATCHDOG:" in arch/x86/kernel/kgdb_64.c it will sync te
processors, however the kernel should not be trapping for this error
code from the IPI event. I suspect there has been some kind of change
to the way the IPI/NMI handling is being done in the latest kernels.

Jason.

2008-01-16 15:44:50

by Jan Kiszka

[permalink] [raw]

Subject: Re: State of kgdb on x86-64

Jason Wessel wrote:
> Jan Kiszka wrote:
>> Jason Wessel wrote:
>>
>>> Jan Kiszka wrote:
>>>
>>>> Jason Wessel wrote:
>>>>
>>>>
>>>>> It was working at the point that I tested it with the 2.6.24-rc5 on
>>>>> x86_64. However I suspect my kernel config may differ drastically from
>>>>> what you are using.
>>>>>
>>>>> Without any other context provided than the generic message, it is hard
>>>>> to know what might have happened.
>>>>>
>>>>>
>>>> Here is the promised .config. I could also dig out the backtrace of the
>>>> panic as kgdb sees it if that helps, just let me know.
>>>>
>>>> Jan
>>>>
>>>>
>>>>
>>> The backtrace might be very telling as to what happened. More
>>> information is always better than less :-)
>>>
>>>
>> My primary test box is again out of reach, but meanwhile I was able to
>> reproduce some kind of problem under QEMU - that one at least is
>> triggered by SMP. With only one CPU -> all apparently fine. Once booting
>> QEMU with "-smp 2" -> this happens:
>>
>> (gdb) tar remote /dev/pts/6
>> Remote debugging using /dev/pts/6
>> Not all CPUs have been synced for KGDB
>> breakpoint () at kernel/kgdb.c:1895
>> 1895 wmb(); /* Sync point after breakpoint */
>> (gdb) c
>> Continuing.
>> Not all CPUs have been synced for KGDB
>> [New Thread 32769]
>>
>> Program received signal SIGFPE, Arithmetic exception.
>> [Switching to Thread 32769]
>> 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
>> 140 __asm__ __volatile__("sti; hlt" : : : "memory");
>> (gdb) bt
>> #0 0xffffffff8020adb7 in default_idle () at include/asm/irqflags_64.h:140
>> #1 0xffffffff8020ae65 in cpu_idle () at arch/x86/kernel/process_64.c:225
>> #2 0xffffffff8021ccb9 in start_secondary () at arch/x86/kernel/smpboot_64.c:375
>> #3 0x0000000000000000 in ?? ()
>> (gdb)
>>
>> The problem seems to be related to continuing SMP boxes. I'm able to
>> boot my box up if I leave kgdb unattached. But when I then later attach
>> and continue execution, I get the same crash. Any ideas what goes wrong,
>> any suggestion where to start digging? Maybe at "Not all CPUs have been
>> synched"?
>>
>
> Generally speaking when you get an error that the CPUs have not been
> synced, it means that the IPI which was sent to all the non-master
> processors failed. I took a quick look and it appears that the DIE_TRAP
> is occuring after kgdb sends the IPI to the non master cores with the call:
>
> send_IPI_allbutself(APIC_DM_NMI);
>
> In prior kernels that ultimately resulted in an NMI trap. I am not sure
> of the cause of the DIE_TRAP as a result of the IPI. For now, if you
> add the statement "case DIE_TRAP:" right before " case
> DIE_NMIWATCHDOG:" in arch/x86/kernel/kgdb_64.c it will sync te
> processors, however the kernel should not be trapping for this error
> code from the IPI event. I suspect there has been some kind of change
> to the way the IPI/NMI handling is being done in the latest kernels.

Things I found out so far:

- delivery of this IPI under QEMU somehow doesn't work
(I would dare to say: at emulation level. But I'm not sure yet.)

- the breakdown of my Xeon box with kdbg is a separate issue

- stuffing my Xeon kernel into QEMU triggers the original bug there
too, even under UP => debugging with QEMU should be feasible

But I'm first trying to identify the related config switch that makes it
pop up, because this mostly means compiling and quick testing (I can't
spent my full time on it yet). Will let you know about news, and I would
appreciate to hear from you if you have any updates.

Jan

--
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux