2016-04-17 21:30:19

by Bob Tracy

[permalink] [raw]
Subject: [BUG] machine check Oops on Alpha

Apologies in advance for the "poor" quality of this bug report. No idea
how to proceed, because the issue historically has been intermittent to
non-existant for reasons unknown.

Within 24 hours of booting my Alpha (PWS 433au), I'm pretty much
guaranteed to see a "machine check" Oops which typically will occur
during a period of high disk activity (for example, during an "apt-get
update / upgrade". If I want a huge mess to clean up afterward, "git
pull" on the kernel source tree will generally suffice as well :-(.

As long as the "Oops" trace doesn't include evidence of filesystem write
activity (calls to ext3/4 functions), the machine is perfectly stable
afterward for as long as I care to let it run -- days, weeks, whatever
-- no further Oopses will occur, regardless of how hard I flog the
machine. A "bad" Oops will cause an immediate system lockup if any
process attempts to access the region of disk that was active at the
time the Oops occurred.

While a "machine check" is normally indicative of an underlying hardware
issue, the fact this is a one-time-per-boot issue has me thinking
otherwise. I suspect a code path being traversed prior to the Oops that
gets bypassed afterward. As previously mentioned, there have been months-
long intervals in the past where the issue has either been masked or non-
existent. Currently, the issue has persisted through several 4.X kernel
release candidates and releases.

Attached is an example of precisely what I'm talking about as far as a
"good" Oops. It occurred within a day of the last reboot, and the
machine has been running fine since. Been flogging the devil out of it,
too: lots of updates (hundreds of megabytes), kernel builds, etc.

While any and all help tracking this down will be appreciated, please
know that kernel rebuilds (to turn on debugging or for whatever reason)
are an overnight affair on this system. In other words, turnaround time
on diagnostic iterations involving kernel modifications will be slow.

--Bob


Attachments:
(No filename) (1.96 kB)
good_oops (3.63 kB)
Download all attachments

2016-04-18 01:32:59

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [BUG] machine check Oops on Alpha

On Sun, 17 Apr 2016, Bob Tracy wrote:

> While a "machine check" is normally indicative of an underlying hardware
> issue, the fact this is a one-time-per-boot issue has me thinking
> otherwise. I suspect a code path being traversed prior to the Oops that
> gets bypassed afterward. As previously mentioned, there have been months-
> long intervals in the past where the issue has either been masked or non-
> existent. Currently, the issue has persisted through several 4.X kernel
> release candidates and releases.

It may or may not be a hardware issue it would seem, there's this comment
in `process_mcheck_info':

/*
* See if the machine check is due to a badaddr() and if so,
* ignore it.
*/

> Attached is an example of precisely what I'm talking about as far as a
> "good" Oops. It occurred within a day of the last reboot, and the
> machine has been running fine since. Been flogging the devil out of it,
> too: lots of updates (hundreds of megabytes), kernel builds, etc.

So from this dump it looks like the immediate problem is not the machine
check itself but rather a null pointer dereference (offset by 0x10, so
likely a structure member access):

Unable to handle kernel paging request at virtual address 0000000000000010

which happens at:

pc is at process_mcheck_info+0x54/0x370

and the offending instruction is:

10 00 89 a2 ldl a4,16(s0)

and s0 is indeed null. To me it looks like we're here:

printk(KERN_CRIT "%s machine check: vector=0x%lx pc=0x%lx code=0x%x\n",
machine, vector, get_irq_regs()->pc, mchk_header->code);

(so not a benign MCE after all) trying to fetch `mchk_header->code', which
means `la_ptr' is null for some reason. This value is passed down from
`cia_machine_check', from `do_entInt', and originally comes from PALcode,
supposed to point to the logout area.

The SCB vector, still present in a0 it would seem, is 630, which looks
legitimate, means "Processor correctable machine check" and is used for
signalling Istream or Dstream correctable ECC errors. These are dealt
with IIUC by PALcode before the machine check is dispatched, which would
explain why, except for the Oops observed, the system continues to operate
normally.

So question is whether it's PALcode doing something weird or is it a
register getting corrupted due to a bug somewhere, either in our code or
GCC. Hmm...

I'd be tempted to run with the patch below to see what's the value of
`la_ptr' early on in processing (`entInt' code in entry.S looks sane to
me, doesn't touch a2). NB a rebuild doesn't have to be costly if you only
poke at a single file or a few which aren't e.g. headers included from
everywhere.

Maciej

diff --git a/arch/alpha/kernel/irq_alpha.c b/arch/alpha/kernel/irq_alpha.c
index 1c8625c..6773bab 100644
--- a/arch/alpha/kernel/irq_alpha.c
+++ b/arch/alpha/kernel/irq_alpha.c
@@ -46,6 +46,9 @@ do_entInt(unsigned long type, unsigned long vector,
{
struct pt_regs *old_regs;

+ if (type == 2)
+ printk(KERN_CRIT "machine check: LA: %016lx\n", la_ptr);
+
/*
* Disable interrupts during IRQ handling.
* Note that there is no matching local_irq_enable() due to

2016-04-18 03:58:58

by Bob Tracy

[permalink] [raw]
Subject: Re: [BUG] machine check Oops on Alpha

On Mon, Apr 18, 2016 at 02:32:54AM +0100, Maciej W. Rozycki wrote:
> I'd be tempted to run with the patch below to see what's the value of
> `la_ptr' early on in processing (`entInt' code in entry.S looks sane to
> me, doesn't touch a2). NB a rebuild doesn't have to be costly if you only
> poke at a single file or a few which aren't e.g. headers included from
> everywhere.

Applied. Build started. Report to follow in a day or so: I've applied
other patches to my kernel source tree in the meantime, so a full build
is unavoidable at this point... I'll hold off applying any updates
after this to minimize what must be rebuilt while this issue is being
worked. Thank you for your time and trouble!

--Bob

2016-04-18 12:31:45

by Bob Tracy

[permalink] [raw]
Subject: Re: [BUG] machine check Oops on Alpha

On Sun, Apr 17, 2016 at 10:58:48PM -0500, Bob Tracy wrote:
> On Mon, Apr 18, 2016 at 02:32:54AM +0100, Maciej W. Rozycki wrote:
> > I'd be tempted to run with the patch below to see what's the value of
> > `la_ptr' early on in processing (`entInt' code in entry.S looks sane to
> > me, doesn't touch a2). NB a rebuild doesn't have to be costly if you only
> > poke at a single file or a few which aren't e.g. headers included from
> > everywhere.
>
> Applied. Build started. Report to follow in a day or so: I've applied
> other patches to my kernel source tree in the meantime, so a full build
> is unavoidable at this point... I'll hold off applying any updates
> after this to minimize what must be rebuilt while this issue is being
> worked. Thank you for your time and trouble!

Build delayed slightly. Ran into "fs/binfmt_em86.o" build failure
patched by Daniel Wagner back in February (incompatible-pointer-types
warning treated as error by compiler). Is Daniel's patch queued for
incorporation into the main kernel source tree?

--Bob

2016-04-18 13:47:43

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [BUG] machine check Oops on Alpha

On Mon, 18 Apr 2016, Bob Tracy wrote:

> Build delayed slightly. Ran into "fs/binfmt_em86.o" build failure
> patched by Daniel Wagner back in February (incompatible-pointer-types
> warning treated as error by compiler). Is Daniel's patch queued for
> incorporation into the main kernel source tree?

No idea. I've had a peek at the patch though and it groups unrelated
changes together and also mixes obvious semantics fixes (missing `const'
qualifier) with semantic changes (`i_arg' removal) which may need further
consideration. I think splitting that proposal into ~3 self-contained
changes may rise the likelihood of at least the critical parts being
accepted.

Maciej

2016-04-19 02:52:52

by Bob Tracy

[permalink] [raw]
Subject: Re: [BUG] machine check Oops on Alpha

On Mon, Apr 18, 2016 at 02:47:40PM +0100, Maciej W. Rozycki wrote:
> On Mon, 18 Apr 2016, Bob Tracy wrote:
>
> > Build delayed slightly. Ran into "fs/binfmt_em86.o" build failure
> > patched by Daniel Wagner back in February (incompatible-pointer-types
> > warning treated as error by compiler). Is Daniel's patch queued for
> > incorporation into the main kernel source tree?
>
> No idea. I've had a peek at the patch though and it groups unrelated
> changes together and also mixes obvious semantics fixes (missing `const'
> qualifier) with semantic changes (`i_arg' removal) which may need further
> consideration. I think splitting that proposal into ~3 self-contained
> changes may rise the likelihood of at least the critical parts being
> accepted.

4.6.0-rc4 build complete, including suggested (by Alan Young) "Verbose
Machine Checks" option set to level 2 by default. System rebooted, and
now we wait... Thanks for everyone's continued patience.

--Bob

2016-04-19 23:57:01

by Bob Tracy

[permalink] [raw]
Subject: Re: [BUG] machine check Oops on Alpha

On Mon, Apr 18, 2016 at 09:52:43PM -0500, Bob Tracy wrote:
> 4.6.0-rc4 build complete, including suggested (by Alan Young) "Verbose
> Machine Checks" option set to level 2 by default. System rebooted, and
> now we wait... Thanks for everyone's continued patience.

Within three minutes of rebooting, I got a machine check, but perhaps
significantly, no "Oops". I'm guessing the only reason I'm seeing the
ECC errors now (haven't seen them before) is because of the stepped-up
debug output. Syslog output attached...

Machine has been stable since the machine check. Kernel is 4.6.0-rc4.

--Bob


Attachments:
(No filename) (599.00 B)
machine_check (5.46 kB)
Download all attachments

2016-04-20 00:46:23

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [BUG] machine check Oops on Alpha

On Tue, 19 Apr 2016, Bob Tracy wrote:

> > 4.6.0-rc4 build complete, including suggested (by Alan Young) "Verbose
> > Machine Checks" option set to level 2 by default. System rebooted, and
> > now we wait... Thanks for everyone's continued patience.
>
> Within three minutes of rebooting, I got a machine check, but perhaps
> significantly, no "Oops". I'm guessing the only reason I'm seeing the
> ECC errors now (haven't seen them before) is because of the stepped-up
> debug output. Syslog output attached...

If this is a code generation bug, which I now suspect even more highly
than before, then the debug verbosity configuration change may well have
made the compiler behave indeed. As you can see from the log the logout
area pointer is not null:

machine check: LA: fffffc0000006000

(of course the lone insertion of this `printk' call may have covered the
bug, regardless of the debug verbosity change). Consequently further
information is printed -- the:

CIA machine check: vector=0x630 pc=0xfffffc00005b66ac code=0x86

line would have been printed anyway -- in fact the Oops previously
happened in an attempt to retrieve `code' to print with this line.

I can see if I can find anything suspicious there if you send me original
copies (i.e. those that oopsed) of arch/alpha/kernel/irq_alpha.o and
arch/alpha/kernel/core_cia.o.

> Machine has been stable since the machine check. Kernel is 4.6.0-rc4.

Yeah, it was a correctable error after all.

Maciej

2016-04-20 03:58:04

by Bob Tracy

[permalink] [raw]
Subject: Re: [BUG] machine check Oops on Alpha

On Wed, Apr 20, 2016 at 01:46:13AM +0100, Maciej W. Rozycki wrote:
> I can see if I can find anything suspicious there if you send me original
> copies (i.e. those that oopsed) of arch/alpha/kernel/irq_alpha.o and
> arch/alpha/kernel/core_cia.o.
>
> > Machine has been stable since the machine check. Kernel is 4.6.0-rc4.
>
> Yeah, it was a correctable error after all.

:-)

Regrettably, the constituent object files of the 4.5.0 kernel that was
generating the "Oopses" are no longer available. I *do* have the
"vmlinux.gz" image, the corresponding "System.map-4.5.0", and all the
related modules if those would be of any use. With a bit of guidance, I
could probably extract the desired objects from the kernel image.
Alternatively, if there's an upload location where I could leave you the
image and map files, that might work as well.

Pending your reply, I'll see if I can figure out how to dump/extract the
requested object code from the kernel image file.

--Bob