2003-07-18 19:51:51

by linas

[permalink] [raw]
Subject: KDB in the mainstream 2.4.x kernels?



Hi,

Will there be a day that I can expect to find KDB in the 2.4.x kernel?
I know that a traditional answer has been 'never', but I would like the
various influencers and decision makers to reconsider ...

I agree with Linus Torvalds that debuggers are 100% useless when you
are working on code that you know intimately. I know, I've written a
lot of code, I'm proud of it, and I sneer at people who use words like
'development environment'. Crap, if you can't figure out why your
code crashed, you shouldn't be a programmer. But these days, I am not
debugging my code. I'm debugging code that I've never seen before.
And for that, I use KDB.

Right now, I work in a job where the *only* thing that I do is to analyze
and sometimes (when I'm lucky) fix kernel crashes. Its all I do.
I don't write any new code, don't do any porting at all. I also don't
debug any 2.5/2.6 'unstable' kernels, nor do I handle any new/unstable
device drivers. I focus entirely on the 2.4.x kernels, and, with a
small team here, there are more than enough kernel bugs to keep us all
completely busy. The crashes are generated by a test team of 8 people
with dozens of machines. Ostensibly their mission is to test new
hardware, but in fact, almost all the crashes that they find are kernel
bugs. The *only* thing that the test team does is to run stress tests.
Basic stuff. Kernel stress. File create/delete/copy. Reiser, jfs, ext3,
swap, OOM, scsi. Network, nfs, samba. Some tests take hours to crash
the kernel, some take days. But the kernel crashes. Its always crashing.
Corruption, races, missing locks, typos, bad hardware, you name it.
When I get it, it has a KDB prompt in front of it. KDB is great.
I can figure out where it crashed, I can look at the assembly, I can
examine memory locations. I can chase pointers by hand. And I can
do it all symbolically, with the symbol names in front of me. Now,
KDB rarely points right at the bug, but it is invaluable for figuring
out where to start looking. Sometimes I even find the bug, often
I don't. But anyway, this is all academic, because its at work, in
a controlled environment, where I have the time and resources I need.

But the real reason I write this note is that I want to have the same
capability at home. It suddenly occurred to me that the servers I run
at home sometimes (rarely) crash with the same symptoms as those at work.
Sure, I can probably blame buggy PC hardware. But .. I dunno. I've been
consistently ignoring these crashes cause its just too much of a hassle
to try to debug them. Its not worth the effort. But hey ... if I had
KDB at home... maybe it would be worth looking into the hangs. I could
see getting motivated to look into some of these. At least get some
idea of where the machine got hung. Maybe no fix, but at least
somewhere to lay the blame.

Yes, of course I could just apply the KDB patches myself, but frankly
its a hassle. I already play the patch game and I hate it. Every new
kernel, I have to try to remember where to find patch x, how to apply
it, fix up this and that... its just plain painful.

I know that this is not a forceful argument. But crashes are a fact of
life, whatever the reason may be. And the crashes almost always happen
in a piece of code I have *never* laid eyes on before, so its unrealistic
to try to puzzle it out with the small dollop of info from magic-sysreq.
Debuggers can be useless, or worse than useless, when you are a developer
on a piece of code you know well. But when plunging into foreign territory,
all the tools and firepower that you can muster are worth every bit.
This is why KDB belongs in the mainstream kernel distros.

--linas



2003-07-18 20:31:23

by Andi Kleen

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?


One argument i have against it: KDB is incredibly ugly code.
Before it could be even considered for merging it would need quite a lot
of cleanup.

I actually started on porting the KDB backtracer recently to get
reliable frame pointer based backtraces, but it turns out the code
for that is so complicated and ugly that the chances of ever merging
it would be very slim.

As for your home crash debugging I suspect you would be better of with LKCD
or Mcore (crash dump, saving an crash image on some partition, with examining
the crash dump after reboot)

KDB is usually not useful for debugging hangs on desktop boxes (and even
many servers) because you have usually X running. When the machine crashes and
goes in KDB you cannot see the text output and debug anything. I learned
to type "go<return>" blind when I had still an KDB aware kernel, but
it's not very useful overall.

On a development machine you can avoid that by connecting a serial cable,
but that's usually not easily possible for a desktop box. Doing a post-mortem
after reboot is more practical. That is what LKCD/mcore do.

Disadvantage is that the current crash dump mechanisms (lkcd, mcore
crash, netdump) are all still not very reliable and have horrible
error handling. And you can eat a lot of disk space for the dumps and
they tend to overflow your file systems. But still it's the only
realistic option for "desktop bugs"

BTW debugging on the X server works on linuxppc/mac with xmon because it
has a fbcon based console and X server. The debugger can just
work on the X background while the X server is stopped. Very nifty.
But I don't see the x86 XFree86 switching to a similar fbcon model any
time soon, so it's unlikely to help.

-Andi

2003-07-19 00:17:03

by linas

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

Hi Andi,

I'm happy to get a response...

On Fri, Jul 18, 2003 at 10:43:57PM +0200, Andi Kleen wrote:
>
> One argument i have against it: KDB is incredibly ugly code.
> Before it could be even considered for merging it would need quite a lot
> of cleanup.

What in particular? I just looked at kdb/kdbmain.c and kdb/kdb_bt.c
and it looks fine to me; fairly minimal even. I don't know about
arch-specific code. Is there a particular file you're complaining about?

(very very off-topic: I love reading about the neat things that
reiser v4 will do, but wow, every time I read reiserfs code, 'ugly'
is indeed the word that flies to mind).

> I actually started on porting the KDB backtracer recently to get
> reliable frame pointer based backtraces, but it turns out the code
> for that is so complicated and ugly that the chances of ever merging
> it would be very slim.
?

I have not (yet?) studied the code in detail, so point me at something
ugly; I'm not sure what you are talking about. Now, stack traces are
in general ugly because registers and args are splattered all over
the place, and the struct pt_regs are even worse. So there's some
inherent ugliness there ...

Since I live in KDB, I might have some spare time to cleanup/fix,etc.
so nows a good time to talk ..

> As for your home crash debugging I suspect you would be better of with LKCD
> or Mcore (crash dump, saving an crash image on some partition, with examining
> the crash dump after reboot)

I'll look ... given that I own lots flaky IDE hardware, though, I'm catious.
I get 'DriveReady SeekComplete Error' messages daily ... I learned the hard
way that these aren't necessarily the fault of the hard drive, and I have
suffered through corrupted fs's as a result ...

Generically, for servers, if you can just save the dump, reboot, and
let the server go on, and analyze the dump at leisure, that is the
prefered way to do things. Especially if you are doing customer support.
(Linux is at the dawn of the era of having customers who have actually
spent in excess of $100K or $1M on a server and who will be going
apoplectic when it crashes. This will put a spotlight on dump tools).

> KDB is usually not useful for debugging hangs on desktop boxes (and even
> many servers) because you have usually X running. When the machine crashes and
> goes in KDB you cannot see the text output and debug anything. I learned

I'm willing to put console on serial port. I've got enough machines
& serial cables, this doesn't bother me.

> Disadvantage is that the current crash dump mechanisms (lkcd, mcore
> crash, netdump) are all still not very reliable and have horrible
> error handling.

This statement makes me nervous. One of the worst feelings one can get
when debugging is not being able to trust the data you are looking at.
Its too easy to loose a lot of time (and credibility) making incorrect
hypothesis based on bad data.

Dedicating a partition that is unformated, and whose sole purpose
in life is to record a dump -- that is a viable option, at least on
servers, where high uptime is more important, and storage is cheap.

On my home machines, its sort of the other way around: I don't trust
IDE, I don't have the disk space.

But you convinced me; I need more time on lkcd, etc.

--linas

2003-07-19 00:46:03

by Andi Kleen

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Fri, Jul 18, 2003 at 07:31:08PM -0500, [email protected] wrote:
> > One argument i have against it: KDB is incredibly ugly code.
> > Before it could be even considered for merging it would need quite a lot
> > of cleanup.
>
> What in particular? I just looked at kdb/kdbmain.c and kdb/kdb_bt.c
> and it looks fine to me; fairly minimal even. I don't know about
> arch-specific code. Is there a particular file you're complaining about?

Check the kdbsupport.c code too.

All the code together for the i386 backtracer is approaching 1000 LOC and
it's quite ugly.

> Dedicating a partition that is unformated, and whose sole purpose
> in life is to record a dump -- that is a viable option, at least on
> servers, where high uptime is more important, and storage is cheap.

Typically you don't need a dedicated partition, you can dump on swap.
netdump does also dump over the network. This may be the safer choice
when you don't trust your block subsystem after crashes.

-Andi

2003-07-20 12:40:46

by Keith Owens

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Fri, 18 Jul 2003 22:43:57 +0200,
>I actually started on porting the KDB backtracer recently to get
>reliable frame pointer based backtraces, but it turns out the code
>for that is so complicated and ugly that the chances of ever merging
>it would be very slim.

Mainly because the kernel is full of special cases and i386 provides no
unwind data to help decode those special cases, so all the special case
code ends up in kdba_bt.c. Compare the complexity of i386 kdba_bt.c
with ia64 kdba_bt.c, the latter is significantly simpler because ia64
mandates unwind data. Without unwind data, kdb has to use lots of
awkward heuristics to even guess at an accurate backtrace. Don't blame
kdb for the lack of i386 unwind data.

2003-07-20 13:16:57

by David Miller

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Sun, 20 Jul 2003 22:55:18 +1000
Keith Owens <[email protected]> wrote:

> i386 provides no unwind data

We could tell gcc to emit dwarf2 unwind tables on x86 for debugging
kernel builds.

2003-07-20 22:12:53

by Keith Owens

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Sun, 20 Jul 2003 06:31:37 -0700,
"David S. Miller" <[email protected]> wrote:
>On Sun, 20 Jul 2003 22:55:18 +1000
>Keith Owens <[email protected]> wrote:
>
>> i386 provides no unwind data
>
>We could tell gcc to emit dwarf2 unwind tables on x86 for debugging
>kernel builds.

C code is not really an issue. Most of the unwind complexity is
handling the special case asm code, interrupt handlers, out of line
lock contention paths, anything in entry.S. Much of the IA64 asm code
has explicit unwind directives in the asm code, i386 asm would need
equivalent kernel changes.

2003-07-21 14:51:47

by Andi Kleen

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Sun, Jul 20, 2003 at 10:55:18PM +1000, Keith Owens wrote:
> On Fri, 18 Jul 2003 22:43:57 +0200,
> >I actually started on porting the KDB backtracer recently to get
> >reliable frame pointer based backtraces, but it turns out the code
> >for that is so complicated and ugly that the chances of ever merging
> >it would be very slim.
>
> Mainly because the kernel is full of special cases and i386 provides no

Yes I agree. It is an ugly problem, which usually results in ugly
solutions too.

-Andi

2003-07-29 19:44:45

by Robin Holt

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Fri, Jul 18, 2003 at 10:43:57PM +0200, Andi Kleen wrote:
>
> One argument i have against it: KDB is incredibly ugly code.
> Before it could be even considered for merging it would need quite a lot
> of cleanup.
> ...

I believe there was a consensus reached that the ugliness was out of
necessity. I hope my understanding is correct.

>
> As for your home crash debugging I suspect you would be better of with LKCD
> or Mcore (crash dump, saving an crash image on some partition, with examining
> the crash dump after reboot)

I personally think _BOTH_ belong in a kernel. KDB for locating problems
in my code during development and a crash dump facility so I can look at
the problems that I don't believe are part of my code at a later time.
Additionally, when I am no longer debugging the code, I would rather
turn KDB off and leave LKCD in there to capture the particular
problem I am chasing. How many problems have been encountered on
user machines which have resulted in a reboot and ignore method?
This is no way to fix problems!

For me, and I believe I am not alone, the most infuriating problem is the
one line change in your code which "tickles" someone else's bug and I
spend 3 days trying to find what really went wrong.

I am sure that nearly anyone working in the bowels of the kernel has
also had a user land program that terminated with just a bus error.

>
> KDB is usually not useful for debugging hangs on desktop boxes (and even
> many servers) because you have usually X running. When the machine crashes and
> goes in KDB you cannot see the text output and debug anything. I learned
> to type "go<return>" blind when I had still an KDB aware kernel, but
> it's not very useful overall.

I believe that this could be addressed once KDB gets into the kernel. If
this were clearly stated as a condition for getting KDB in, then I am
sure someone can figure a method out.

>
> On a development machine you can avoid that by connecting a serial cable,
> but that's usually not easily possible for a desktop box. Doing a post-mortem
> after reboot is more practical. That is what LKCD/mcore do.
>
> Disadvantage is that the current crash dump mechanisms (lkcd, mcore
> crash, netdump) are all still not very reliable and have horrible
> error handling. And you can eat a lot of disk space for the dumps and
> they tend to overflow your file systems. But still it's the only
> realistic option for "desktop bugs"

I am not sure what you mean by LKCD is not reliable. I use LKCD to
create crash dumps at work all the time. The only problems I have
are when the NMI doesn't propagate correctly into LKCD and initiate
the dump. I believe Keith Owens has improved LKCD's understanding
of being called for KDB and now that seems to always work as well.

I think the error handling being referred to is the device driver itself.
That should probably get fixed on a case-by-case basis instead of
putting the blame on LKCD. I don't see driving force to fixing
drivers for a crash dump facility without a clear direction as to
which facility will be accepted.

As for space, I view that as an admin problem. If you are selling
a machine with a lot of memory, you need to size for the dump you
will typically get. I recently initiated an LKCD dump on a machine
with 8GB of memory and had dumping of kernel pages including buffers.
The uncompressed dump output was still only 161MB. I have also taken
a dump from a machine with over 128GB of memory in 64K pages shortly
after booting with init=/bin/sh. It took only 108MB. That doesn't
seem too large in the days of 120GB drives. This can be reduced even
further by using compression (Already supported by LKCD) and not dumping
kernel buffers.

>
> BTW debugging on the X server works on linuxppc/mac with xmon because it
> has a fbcon based console and X server. The debugger can just
> work on the X background while the X server is stopped. Very nifty.
> But I don't see the x86 XFree86 switching to a similar fbcon model any
> time soon, so it's unlikely to help.
>
> -Andi

I believe that crash dumps and in core debugging are useful for a lot of
others as well. If not, why have Red Hat's netdump and United Linux's
LKCD implementations been done? It seems the only people saying "no"
to dumps and in core debugging are the people who claim to benefit from
it least. I view that as equally arrogant to the PPC people being able
to say no to adding machine vectors required by ia64 on a "just because"
basis without justifying based on technical merit.

I agree with _EVERYTHING_ that Linas from IBM pointed out. What needs to
be done to get this in?

If you want to hear user testimonials, I have used KDB for the last
year. It has been invaluable in locating race conditions with spin locks
when you have large processor counts and you are stress testing a machine.

I have complained bitterly (ask Keith Owens) that LKCD needs to be
working. One of my primary uses of LKCD since Keith got it working for
our machine has been: hitting a problem; doing a cursory check to see
if it is something I introduced and debug with KDB. When I am certain I
am done using KDB, I take the dump and reanalyze using lcrash to ensure
the answer is consistent and do a cursory check for other failures which
may be hiding behind this bug. It also gives me a method of determining
if my fix introduced a new bug or if a future failure was from a long
running behavior which just now got bad enough. When I hit something new,
I now have "perfect recall" of the previous failure.

I believe this should be in the 2.6 kernel as well. I believe it
will save time during the bringup effort for countless people.

Robin Holt

2003-08-13 04:40:48

by Martin Pool

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Fri, 18 Jul 2003 22:43:57 +0200, Andi Kleen wrote:

> KDB is usually not useful for debugging hangs on desktop boxes (and even
> many servers) because you have usually X running. When the machine crashes
> and goes in KDB you cannot see the text output and debug anything. I
> learned to type "go<return>" blind when I had still an KDB aware kernel,
> but it's not very useful overall.

Perhaps in the case where the console is on a vt, kdb could try to
switch to the right vc before presenting its prompt? I realize calling into
the vc code might be risky but it seems like there's not much to lose.
(If you do have a bug in say the agp driver then you need a serial
console...) If it works, you'll be able to debug and continue.

It could even set the colors to white on blue. :-)

--
Martin

2003-08-13 11:04:58

by Andi Kleen

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Wed, Aug 13, 2003 at 02:40:31PM +1000, Martin Pool wrote:
> On Fri, 18 Jul 2003 22:43:57 +0200, Andi Kleen wrote:
>
> > KDB is usually not useful for debugging hangs on desktop boxes (and even
> > many servers) because you have usually X running. When the machine crashes
> > and goes in KDB you cannot see the text output and debug anything. I
> > learned to type "go<return>" blind when I had still an KDB aware kernel,
> > but it's not very useful overall.
>
> Perhaps in the case where the console is on a vt, kdb could try to
> switch to the right vc before presenting its prompt? I realize calling into
> the vc code might be risky but it seems like there's not much to lose.
> (If you do have a bug in say the agp driver then you need a serial
> console...) If it works, you'll be able to debug and continue.

Only the X server can switch away, because only it knows how
to talk to the graphic chipset. And running user space here is
far too risky.

It's possible when the resolutions are controlled by the kernel
in fbcon. That's the case on linux/ppc and you can indeed debug on
top of an X server there. But it's unlikely to happen for linux/x86, the
xfree86 people don't want to move parts of their drivers into the kernel.

-Andi

2003-08-25 12:16:44

by Greg Stark

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?


Andi Kleen <[email protected]> writes:

> Only the X server can switch away, because only it knows how
> to talk to the graphic chipset. And running user space here is
> far too risky.

There was a proposal a long ways back to allow X to download instructions to
the kernel on how to restore the video mode. The proposal was to code the
instructions as a forth program that frobbed registers appropriately. The
kernel would have a small forth interpretor to run it. Then switching
resolutions could happen safely in the kernel.

--
greg

2003-08-25 16:23:20

by Andi Kleen

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Mon, Aug 25, 2003 at 08:16:41AM -0400, Greg Stark wrote:
> There was a proposal a long ways back to allow X to download instructions to
> the kernel on how to restore the video mode. The proposal was to code the
> instructions as a forth program that frobbed registers appropriately. The
> kernel would have a small forth interpretor to run it. Then switching
> resolutions could happen safely in the kernel.

Did the proposal come with working code?

-Andi

2003-08-26 13:39:18

by Greg Stark

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?


Andi Kleen <[email protected]> writes:

> Did the proposal come with working code?

Not that I recall. I'm going back, uh, probably 10-15 years.
But it seems as relevant today as it was then.


--
greg

2003-08-27 13:51:13

by Alan

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

On Llu, 2003-08-25 at 17:23, Andi Kleen wrote:
> > instructions as a forth program that frobbed registers appropriately. The
> > kernel would have a small forth interpretor to run it. Then switching
> > resolutions could happen safely in the kernel.
>
> Did the proposal come with working code?

I've seen workable non forth versions of the proposal yes. It isnt
actually that hard to do for most video cards

2003-08-28 17:08:57

by Tolentino, Matthew E

[permalink] [raw]
Subject: RE: KDB in the mainstream 2.4.x kernels?

> On Llu, 2003-08-25 at 17:23, Andi Kleen wrote:
> > > instructions as a forth program that frobbed registers
> appropriately. The
> > > kernel would have a small forth interpretor to run it.
> Then switching
> > > resolutions could happen safely in the kernel.
> >
> > Did the proposal come with working code?
>
> I've seen workable non forth versions of the proposal yes. It isnt
> actually that hard to do for most video cards

Interesting. So did the interpreted forth (or other) program then interact with the VGA BIOS or was it more generic?

matt

2003-08-28 20:27:11

by Alan

[permalink] [raw]
Subject: RE: KDB in the mainstream 2.4.x kernels?

On Iau, 2003-08-28 at 18:08, Tolentino, Matthew E wrote:
> > I've seen workable non forth versions of the proposal yes. It isnt
> > actually that hard to do for most video cards
>
> Interesting. So did the interpreted forth (or other) program then interact with the VGA BIOS or was it more generic?

It consisted simply of a list of in/out values. Thats sufficient for
most cards it turned out. It expected the X server to dump the sequence
of values to the kernel.

A BIOS32/ACPI/whatever is currently trendy service to save/restore video
states would actually be a real help to a lot of things. I guess the
perfect would API would support something like

SaveCurrentMode
SetMode (some properties)
GetLinearFBDetails()
RestoreSavedMode
LoadColor() [for 8bit modes]

ie roughly what vesa bios provides. Given the cost of executing a
virtual machine like ACPI its less clear if cards could describe
basic acceleration this way, at least if it was something like ACPI
or forth which is hard to compile. A bytecode description that can
be turned into native code obviously has different properties.

2003-08-30 10:45:07

by Pavel Machek

[permalink] [raw]
Subject: Re: KDB in the mainstream 2.4.x kernels?

Hi!

> > > instructions as a forth program that frobbed registers appropriately. The
> > > kernel would have a small forth interpretor to run it. Then switching
> > > resolutions could happen safely in the kernel.
> >
> > Did the proposal come with working code?
>
> I've seen workable non forth versions of the proposal yes. It isnt
> actually that hard to do for most video cards

We could make them use code for ACPI interpretter, that's already in
and has advantage that graphics people might eventually ship it in
card roms....
Pavel


--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2003-09-02 20:40:42

by Tolentino, Matthew E

[permalink] [raw]
Subject: RE: KDB in the mainstream 2.4.x kernels?


> > > > instructions as a forth program that frobbed registers
> appropriately. The
> > > > kernel would have a small forth interpretor to run it.
> Then switching
> > > > resolutions could happen safely in the kernel.
> > >
> > > Did the proposal come with working code?
> >
> > I've seen workable non forth versions of the proposal yes. It isnt
> > actually that hard to do for most video cards
>
> We could make them use code for ACPI interpretter, that's already in
> and has advantage that graphics people might eventually ship it in
> card roms....

The reason I was asking before was because I've been working on a kernel implementation of the EBC (EFI Byte Code) interpreter so that one could employ the use of the UGA (Universal Graphics Adapter) at OS runtime instead of having to rely on VGA (BIOS or hardware) support. UGA is essentially an EFI driver (aka option ROM) that is intended to be used in pre-OS boot space as well as during OS runtime. When built as an EBC image the driver can be interpreted and thus used on any platform.

The UGA protocols defined in the EFI spec enable the capability to perform the mode switching mentioned above. I hate to keep pointing at ia64, but Tiger systems currently ship with a minimal UGA driver for the embedded ATI controller (this can be seen with the EFI command drivers) and x86 systems with EFI firmware will as well (in addition to traditional VGA support).

Although this doesn't resolve the immediate issue, this might provide the support needed in the future...

matt