2003-05-08 00:56:05

by Randy.Dunlap

[permalink] [raw]
Subject: garbled oopsen


I have several oopses that are garbled. Part of the problem is that
page fault code (x86: arch/i386/mm/fault.c) does not attempt to
serialize the "Unable to handle kernel ... at virtual address ..."
messages, since it's considered better to get _some_ messages out
than no messages. (and serialize it with what?)

However, after untwisting these, I can tell you that unraveling
them is not fun.

Can these be cleaned up in any reasonable way?
Any suggestions?

This is on 2.5.68 and 2.5.69.


(sample 1)
i
de-sUcnsaibl:e hdtod :h asuncd l1e 80ke22rn0e1l96 p3aging request at virtual address 6b6b6b8b

(sample 2)
i<de1->Usncsaibl:e h dtod: h saunc dl18e 0k2e20r1ne96l3 p

which decodes into:
i de - s cs i : h d d: s u c 18 0 2 20 1 96 3
< 1 >U n a bl e to h a n dl e k e r ne l p

(sample 3, much longer)
scsi_eh_<pr4>t_hfdadi: lA_TstAaPIts r: e2se:t0: 0co:m0 plcmetdse
failiedde:- sc0s,i :c anRecaeclh: ed1
idTeosctsali_ pofc_ 1in ctro mminantdersr oupnt 1 hdanedvliceres
rePaqcuikrete ceho mmwoanrdk
cosmcplsiet_ehe_d,2 : 0 abboyrttesin tgr canmdsf:e0rxrf7eddb
f1didc
e-scidsei-: shcsddi:: achbeorckt icognndoiretdi
on sfocsr i_1e41h_
2: iabdoe-rstcisngi : chmddd :f aqilueedue: 0xcmf7dd b= f1[d c3
0 0s cs0i _4e0 h_02 :]
Seinddie-ngs cBsiD:R Rsdeaevc:h ed0xf i7ddde5sccs0i0_
pci_dien-trs cisin:te rdrevuiptc e harensdelte ri
gnoidreed-s
csiid: eR-escaschi:ed h iddd:e scqsuei_ 1p4c1_i, nctmrd i=n te[r 0 ru0p 0t h0 a0nd 0le r]

/end/


2003-05-08 01:28:16

by Andrew Morton

[permalink] [raw]
Subject: Re: garbled oopsen

"Randy.Dunlap" <[email protected]> wrote:
>
> I have several oopses that are garbled.

Use kgdb.

> Can these be cleaned up in any reasonable way?

It needs some additional spinlock in there. People have moaned for over a
year, patches have been floating about but nobody has taken the time to
finish one off and submit it.

It's never bothered me, because availability of a serial console equates to
availability of kgdb.

> Any suggestions?

A Greek-to-English dictionary?


2003-05-08 02:07:55

by Martin J. Bligh

[permalink] [raw]
Subject: Re: garbled oopsen

>> Can these be cleaned up in any reasonable way?
>
> It needs some additional spinlock in there. People have moaned for over a
> year, patches have been floating about but nobody has taken the time to
> finish one off and submit it.

I tried it a while back, the obvious lock approach didn't seem to work, but
I can't seem to find the patch right now. IIRC, printk should be atomic,
so converting it to printk into a line buffer, and then printk'ing the buffer
(prefaced by cpu number) *should* work. Maybe. I think.

M.

2003-05-08 02:39:57

by Andi Kleen

[permalink] [raw]
Subject: Re: garbled oopsen

Andrew Morton <[email protected]> writes:

>> Can these be cleaned up in any reasonable way?
>
> It needs some additional spinlock in there. People have moaned for over a
> year, patches have been floating about but nobody has taken the time to
> finish one off and submit it.

I considered it for x86-64 and even implemented it, but never submitted
in fear of deadlocks e.g. when an oops recurses. For this a
spinlock_timeout() would be useful. Print anyways when you cannot get the
lock in a second or two.

-Andi

2003-05-08 04:21:40

by Randy.Dunlap

[permalink] [raw]
Subject: Re: garbled oopsen

> "Randy.Dunlap" <[email protected]> wrote:
>>
>> I have several oopses that are garbled.
>
> Use kgdb.
>
>> Can these be cleaned up in any reasonable way?
>
> It needs some additional spinlock in there. People have moaned for over a
> year, patches have been floating about but nobody has taken the time to
> finish one off and submit it.
>
> It's never bothered me, because availability of a serial console equates to
> availability of kgdb.

I'm more interested in having it clean for people who use 2.6.x.
Yes, I can get by without it or by using kgdb, but that's not the point IMO.

~Randy



2003-05-08 05:33:51

by Martin J. Bligh

[permalink] [raw]
Subject: Re: garbled oopsen

>>> Can these be cleaned up in any reasonable way?
>>
>> It needs some additional spinlock in there. People have moaned for over a
>> year, patches have been floating about but nobody has taken the time to
>> finish one off and submit it.
>
> I considered it for x86-64 and even implemented it, but never submitted
> in fear of deadlocks e.g. when an oops recurses. For this a
> spinlock_timeout() would be useful. Print anyways when you cannot get the
> lock in a second or two.

The trouble is that the subsystems you want may be broken (eg timers).
IMHO it's better to just spew whatever you can (the current crap) ...
wait a couple of seconds, then have another go at doing it properly.

That way people can't complain it's worse than it is now in any way ;-)

M.

2003-05-08 05:40:28

by Andrew Morton

[permalink] [raw]
Subject: Re: garbled oopsen

"Martin J. Bligh" <[email protected]> wrote:
>
> >>> Can these be cleaned up in any reasonable way?
> >>
> >> It needs some additional spinlock in there. People have moaned for over a
> >> year, patches have been floating about but nobody has taken the time to
> >> finish one off and submit it.
> >
> > I considered it for x86-64 and even implemented it, but never submitted
> > in fear of deadlocks e.g. when an oops recurses. For this a
> > spinlock_timeout() would be useful. Print anyways when you cannot get the
> > lock in a second or two.
>
> The trouble is that the subsystems you want may be broken (eg timers).
> IMHO it's better to just spew whatever you can (the current crap) ...
> wait a couple of seconds, then have another go at doing it properly.

A recursive oops is easy enough to detect anyway.

preempt_disable();
if (oops_cpu == -1 || oops_cpu != smp_processor_id()) {
_raw_spin_lock(&oops_lock);
oops_cpu = smp_processor_id();
}
<current stuff>
oops_cpu = -1;
spin_lock_init(&oops_lock);
preempt_enable();

or something like that.

> That way people can't complain it's worse than it is now in any way ;-)

Too many complaints, too few unified diffs on this one.

2003-05-08 06:32:35

by Andi Kleen

[permalink] [raw]
Subject: Re: garbled oopsen

On Thu, May 08, 2003 at 05:32:04AM +0200, Martin J. Bligh wrote:
> The trouble is that the subsystems you want may be broken (eg timers).

rdtsc/get_cycles() should still work. If that's broken too you have a really
serious problem. It's only on the local CPU, so you don't need any complications
for bro^wunsynced SMP systems.

-Andi

2003-05-08 07:16:55

by Andi Kleen

[permalink] [raw]
Subject: Re: garbled oopsen

On Thu, May 08, 2003 at 07:53:10AM +0200, Andrew Morton wrote:
> A recursive oops is easy enough to detect anyway.
>
> preempt_disable();
> if (oops_cpu == -1 || oops_cpu != smp_processor_id()) {
> _raw_spin_lock(&oops_lock);
> oops_cpu = smp_processor_id();
> }
> <current stuff>
> oops_cpu = -1;
> spin_lock_init(&oops_lock);
> preempt_enable();
>
> or something like that.

yes I did it this way in my old 2.4 x86-64 patch. But i never
felt comfortable enough about it to commit it.

(the in_interrupt thing was to avoid an interrupt stack problem on
x86-64, not needed anymore or on i386)

But I would prefer the spinlock timeout I think. It's an safer and more
obviously correct algorithm.

-Andi

Index: arch/x86_64/mm/fault.c
===================================================================
RCS file: /home/cvs/Repository/linux/arch/x86_64/mm/fault.c,v
retrieving revision 1.33
diff -u -u -r1.33 fault.c
--- arch/x86_64/mm/fault.c 2002/10/02 15:41:14 1.33
+++ arch/x86_64/mm/fault.c 2003/01/13 08:42:35
@@ -30,6 +30,9 @@
#include <asm/proto.h>
#include <asm/kdebug.h>

+spinlock_t pcrash_lock;
+int crashing_cpu;
+
extern spinlock_t console_lock, timerlist_lock;

void bust_spinlocks(int yes)
@@ -251,6 +254,14 @@
console_verbose();
bust_spinlocks(1);

+ if (!in_interrupt()) {
+ if (!spin_trylock(&pcrash_lock)) {
+ if (crashing_cpu != smp_processor_id())
+ spin_lock(&pcrash_lock);
+ }
+ crashing_cpu = smp_processor_id();
+ }
+
if (address < PAGE_SIZE)
printk(KERN_ALERT "Unable to handle kernel NULL pointer dereference");
else
@@ -259,7 +270,14 @@
printk(" printing rip:\n");
printk("%016lx\n", regs->rip);
dump_pagetable(address);
+
die("Oops", regs, error_code);
+
+ if (!in_interrupt()) {
+ crashing_cpu = -1; /* small harmless window */
+ spin_unlock(&pcrash_lock);
+ }
+
bust_spinlocks(0);
do_exit(SIGKILL);



2003-05-19 07:07:20

by Keith Owens

[permalink] [raw]
Subject: Re: garbled oopsen

On Wed, 7 May 2003 18:05:30 -0700,
"Randy.Dunlap" <[email protected]> wrote:
>I have several oopses that are garbled. Part of the problem is that
>page fault code (x86: arch/i386/mm/fault.c) does not attempt to
>serialize the "Unable to handle kernel ... at virtual address ..."
>messages, since it's considered better to get _some_ messages out
>than no messages. (and serialize it with what?)
>
>However, after untwisting these, I can tell you that unraveling
>them is not fun.
>
>Can these be cleaned up in any reasonable way?
>Any suggestions?

kdb_printf() has this:

/* Serialize kdb_printf if multiple cpus try to write at once.
* But if any cpu goes recursive in kdb, just print the output,
* even if it is interleaved with any other text.
*/
if (!KDB_STATE(PRINTF_LOCK)) {
KDB_STATE_SET(PRINTF_LOCK);
spin_lock(&kdb_printf_lock);
}
....
if (KDB_STATE(PRINTF_LOCK)) {
spin_unlock(&kdb_printf_lock);
KDB_STATE_CLEAR(PRINTF_LOCK);
}

KDB_STATE() is a per-cpu set of flags, PRINTF_LOCK indicates if this
cpu has got or is trying to get the kdb_printf_lock. I get no
interleave problems, except when somebody prints a line in multiple
calls to kdb_printf(), the fragments are printed as one chunk but the
individual fragments can be interleaved.