2002-10-31 20:16:43

by Andreas Herrmann

[permalink] [raw]
Subject: Re: [lkcd-devel] Re: What's left over.


Linus Torvalds <[email protected]>
Sent by: [email protected]
10/31/02 04:46 PM

On Wed, 30 Oct 2002, Matt D. Robinson wrote:

> People have to realize that my kernel is not for random new
> features. The stuff I consider important are things that people
> use on their own, or stuff that is the base for other work.

A dump mechanism within the kernel is a base for much easier
kernel debugging.
IMHO, analyzing a dump is much more effective than guessing
a kernel bug solely with help of an oops message.
Using lkcd/lcrash, I've debugged enough problems in
kernel modules that were otherwise quite hard to determine.
It is hard to understand why developers do not want the
aid of dump/dump-analysis for kernel development.


Regards,

Andreas


2002-10-31 20:34:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [lkcd-devel] Re: What's left over.


On Thu, 31 Oct 2002, Andreas Herrmann wrote:
>
> A dump mechanism within the kernel is a base for much easier
> kernel debugging.
> IMHO, analyzing a dump is much more effective than guessing
> a kernel bug solely with help of an oops message.

And imnsho, debugging the kernel on a source level is the way to do it.

Which is why it's not going to be me who merges it.

Read my emails.

Linus

2002-10-31 20:46:04

by Patrick Finnegan

[permalink] [raw]
Subject: Re: [lkcd-devel] Re: What's left over.

On Thu, 31 Oct 2002, Linus Torvalds wrote:

> On Thu, 31 Oct 2002, Andreas Herrmann wrote:
> >
> > A dump mechanism within the kernel is a base for much easier
> > kernel debugging.
> > IMHO, analyzing a dump is much more effective than guessing
> > a kernel bug solely with help of an oops message.
>
> And imnsho, debugging the kernel on a source level is the way to do it.
>
> Which is why it's not going to be me who merges it.

But, LKCD is useful also for tracing crashes back to hardware that causes
it. It's really hard to find problems in hardware using source code,
since the source code DOENS'T have anything to do with the problems.

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif



2002-10-31 21:01:36

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [lkcd-devel] Re: What's left over.

On Thu, Oct 31, 2002 at 12:40:28PM -0800, Linus Torvalds wrote:
> And imnsho, debugging the kernel on a source level is the way to do it.
>
> Which is why it's not going to be me who merges it.
>
> Read my emails.

That is one of the reasons that crash dumps are useful. Quite a few
problems that customers hit are not easy to reproduce, but when they
provide a dump file that can be loaded into gdb with the original
kernel debugging info and the backtrace command issued and various
bits of internal structures examined, usually a good hypothesis can
be made for the cause. Feed that back into a code audit and you end
up fixing problems that are decidedly challenging.

-ben
--
"Do you seek knowledge in time travel?"

2002-10-31 21:57:49

by Bernhard Kaindl

[permalink] [raw]
Subject: Re: [lkcd-devel] Re: What's left over.

On Thu, 31 Oct 2002, Benjamin LaHaise wrote:
> On Thu, Oct 31, 2002 at 12:40:28PM -0800, Linus Torvalds wrote:
> > And imnsho, debugging the kernel on a source level is the way to do it.
> >
> > Which is why it's not going to be me who merges it.
> >
> > Read my emails.
>
> That is one of the reasons that crash dumps are useful. Quite a few
> problems that customers hit are not easy to reproduce, but when they
> provide a dump file that can be loaded into gdb with the original
> kernel debugging info and the backtrace command issued and various
> bits of internal structures examined, usually a good hypothesis can
> be made for the cause. Feed that back into a code audit and you end
> up fixing problems that are decidedly challenging.
>
> -ben

I could not have said it better. I've a good real-life example for it,
one which really happened and one just as example to give an image.

[ I'm not an expert, I'm just writing about my experiance ]
[ in order to try to make linux even better than it is ]

About debugging at source level:

Dump analysis does not say that you are not debugging on a source level,
with a vmlinux compiled with -g, (which could be stripped before making
the image) crash analysis tools could operate at source level(depending
on the compiler's reorderings of course, the assumtion that -O2 maps
source:binary 1:1 is of course not from this world)

An analogy to doctors, hospitals and patients:

dump analysis says you don't need to have a living patient
in order to cure a disease. It says you may have slept on the
other side of the world while the disease murdered your fellow
at home. But as you don't like that it happens again to another
fellow, you want to have a remote lab which gives you every info
you need to have in order to know what might have murdered him.

The dump tools are this remote lab. If you don't have it, you
may need to fly over to the site where the disease is, monitor
the patient and try to find out what's happening and you can't
find out what's up without at least one another dead patient at
the end.

But the hospital may not like to even have one single dead
patient more than neccesary(best 0) and would choose a doctor
who has the remote lab where he can quickly check what's up
and find a cure *before* the next patient gets ill.

Back to the computer world, this would mean that an OS having
the remote lab(dump tools) would be favoured over on OS that
don't has. The same goes for LTT and Dynamic Probes.

Back to crash dump: In some environments like laboratory or blood
bank information systems you need to use computers in order to
efficiently process, store and distribute data, and organize
the handling of blood. In such environments, the life of people
can change on a fast, efficiently and stably working organsation.

Of course you need to be able to recover and continue such
organisation even with the laboratory information system being
down for a reboot or maintenance.

But you simply cannot go there, halt all the distributed information
retrieval and automated job control with the laboratory apparatuses,
block all the users(maybe thousands) for debugging the kernel and
check what is going on while the whole hospital is waiting for you.

Of course you can do this, but only once or only in at a time
where every use of the system can be organized to bypass it und
use paper, in-house mail and phone to do the things the system
is normally doing. A hospital with thousands of patients cannot
wait while debugging.

> Which is why it's not going to be me who merges it.

Sure, but it would help Linux World Domination if the base
kernel would support it also.

Bernd

PS: Sorry for the extreme example but this is an example
I know from my previous work and I've just tried to describe
it as real as possible.

2002-11-01 00:27:39

by Werner Almesberger

[permalink] [raw]
Subject: Re: [lkcd-devel] Re: What's left over.

Bernhard Kaindl wrote:
> An analogy to doctors, hospitals and patients:

I have a simpler medical analogy:

- in many cases, all you know is that the patient died
(e.g. think of a router - it has no console, no user
interacting with it, etc.)
- the Oops tells you the the patient died of a heart failure
(NULL pointer dereferenced in this or that function, called
from ...)
- but it's only the autopsy (the crash dump) that reveals that
the patient was poisoned, and that this is not a routine
case

I view crash dumps as a tool that helps me imagine what the
machine was doing. Without that, I can learn many interesting
things about the code, but I won't necessarily find the actual
bug.

Examples of non-obvious bugs can be found in the various module
unload race discussions. There, usually competent people
suggested incorrect designs, simply because they failed to
imagine some constellations, and no amount of staring at the
source could have helped this lack of imagination.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/