2006-11-21 13:34:05

by Kirill Korotaev

[permalink] [raw]
Subject: [SPARC64]: resumable error decoding

David,

Running stress tests on OpenVZ 2.6.18 sparc64 kernel we hit the following:
------- cut --------
[285401.094964] RESUMABLE ERROR: Reporting on cpu 0
[285401.626736] RESUMABLE ERROR: err_handle[410000000000c6f] err_stick[103921ee2007c] err_type[00000004:warning resumable]
[285402.869015] RESUMABLE ERROR: err_attrs[00000020: ]
[285403.491920] RESUMABLE ERROR: err_raddr[0000000000000000] err_size[0] err_cpu[0]
[285404.347508] TSTATE: 0000004480001602 TPC: 000000000041931c TNPC: 0000000000419320 Y: 00000000 Not tainted
[285405.496613] TPC: <cpu_idle+0x84/0xc0>
[285405.892615] g0: 00000000006e2531 g1: 0000000000000016 g2: 0000000000000014 g3: 00000000006def80
[285406.550536] g4: 00000000006e2f80 g5: fffff8000449bd40 g6: 00000000006def80 g7: 0000038000004000
[285406.884717] o0: 0000000000000000 o1: 00000000006def88 o2: 0000000000004000 o3: 4000000000000000
[285407.214724] o4: 0000000000001290 o5: 0000000000000012 sp: 00000000006e2531 ret_pc: 0000000000419308
[285407.562135] RPC: <cpu_idle+0x70/0xc0>
[285407.701342] l0: 00000000006de800 l1: 0000000000000027 l2: 0000000000000000 l3: 00000001ff000000
[285408.029282] l4: 0000000040004110 l5: 00000000fff74080 l6: 00000000fff4d701 l7: 00000000f0254040
[285408.348195] i0: 0000000100000000 i1: 0000000000000000 i2: 0000000000000000 i3: 0000000100000000
[285408.681920] i4: 0000000000000080 i5: 0000000000000080 i6: 00000000006e25f1 i7: 00000000007a67ec
[285409.010870] I7: <start_kernel+0x294/0x300>
------- cut --------

it looks like the hardware reports some problem and
the most interesting field is err_attrs...

u32 err_attrs;
#define SUN4V_ERR_ATTRS_PROCESSOR 0x00000001
#define SUN4V_ERR_ATTRS_MEMORY 0x00000002
#define SUN4V_ERR_ATTRS_PIO 0x00000004
#define SUN4V_ERR_ATTRS_INT_REGISTERS 0x00000008
#define SUN4V_ERR_ATTRS_FPU_REGISTERS 0x00000010
#define SUN4V_ERR_ATTRS_USER_MODE 0x01000000
#define SUN4V_ERR_ATTRS_PRIV_MODE 0x02000000
#define SUN4V_ERR_ATTRS_RES_QUEUE_FULL 0x80000000

.. which should explain what subsystem is faulty.
However, 2.6.18 kernel knows nothing about the value 0x20 :/
I also didn't find anything in available documenation about this.
Can you sched some light on this please?
A link to the doc or some hint would be very much appreciated.

Thanks,
Kirill


2006-11-22 00:11:58

by David Miller

[permalink] [raw]
Subject: Re: [SPARC64]: resumable error decoding

From: Kirill Korotaev <[email protected]>
Date: Tue, 21 Nov 2006 16:42:47 +0300

> Running stress tests on OpenVZ 2.6.18 sparc64 kernel we hit the following:
> ------- cut --------
> [285401.094964] RESUMABLE ERROR: Reporting on cpu 0
> [285401.626736] RESUMABLE ERROR: err_handle[410000000000c6f] err_stick[103921ee2007c] err_type[00000004:warning resumable]
> [285402.869015] RESUMABLE ERROR: err_attrs[00000020: ]
> [285403.491920] RESUMABLE ERROR: err_raddr[0000000000000000] err_size[0] err_cpu[0]

This is a power-off request, did someone push the power-off button
or give the power-off command from the System Controller console?

I should add proper support for this, this report is a good reminder
:-)

All resumable errors of type 0x4 are power-off requests.
Unfortunately these encodings are not in any of the publicly published
documents.

2006-11-22 10:10:40

by Kirill Korotaev

[permalink] [raw]
Subject: Re: [SPARC64]: resumable error decoding

>>Running stress tests on OpenVZ 2.6.18 sparc64 kernel we hit the following:
>>------- cut --------
>>[285401.094964] RESUMABLE ERROR: Reporting on cpu 0
>>[285401.626736] RESUMABLE ERROR: err_handle[410000000000c6f] err_stick[103921ee2007c] err_type[00000004:warning resumable]
>>[285402.869015] RESUMABLE ERROR: err_attrs[00000020: ]
>>[285403.491920] RESUMABLE ERROR: err_raddr[0000000000000000] err_size[0] err_cpu[0]
>
>
> This is a power-off request, did someone push the power-off button
> or give the power-off command from the System Controller console?
ahh, looks like this :)
one of our users reproduced an issue which causes both mainstream and
2.6.18 OVZ kerenls to hang on sparc :/ will investigate...
probably he reset the box after the hang :)

> I should add proper support for this, this report is a good reminder
> :-)
would be nice :@)

> All resumable errors of type 0x4 are power-off requests.
> Unfortunately these encodings are not in any of the publicly published
> documents.
thanks a lot for the explanation!

Thanks,
Kirill

2006-11-30 20:29:46

by David Miller

[permalink] [raw]
Subject: Re: [SPARC64]: resumable error decoding

From: Kirill Korotaev <[email protected]>
Date: Wed, 22 Nov 2006 13:19:28 +0300

> > I should add proper support for this, this report is a good reminder
> > :-)
> would be nice :@)

I tested the following patch and it worked fine for me on a T2000, let
me know if it works for you too:

commit 035f09edbbc921b9688a65ec58c0f49b822e605c
Author: David S. Miller <[email protected]>
Date: Wed Nov 29 21:16:21 2006 -0800

[SPARC64]: Run ctrl-alt-del action for sun4v powerdown request.

Signed-off-by: David S. Miller <[email protected]>

diff --git a/arch/sparc64/kernel/traps.c b/arch/sparc64/kernel/traps.c
index ec7a601..ad67784 100644
--- a/arch/sparc64/kernel/traps.c
+++ b/arch/sparc64/kernel/traps.c
@@ -10,7 +10,7 @@
*/

#include <linux/module.h>
-#include <linux/sched.h> /* for jiffies */
+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/kallsyms.h>
#include <linux/signal.h>
@@ -1873,6 +1873,16 @@ void sun4v_resum_error(struct pt_regs *r

put_cpu();

+ if (ent->err_type == SUN4V_ERR_TYPE_WARNING_RES) {
+ /* If err_type is 0x4, it's a powerdown request. Do
+ * not do the usual resumable error log because that
+ * makes it look like some abnormal error.
+ */
+ printk(KERN_INFO "Power down request...\n");
+ kill_cad_pid(SIGINT, 1);
+ return;
+ }
+
sun4v_log_error(regs, &local_copy, cpu,
KERN_ERR "RESUMABLE ERROR",
&sun4v_resum_oflow_cnt);