LinuxLists.cc - [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

2014-02-17 18:08:10

Subject: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

If MCE decoding support does not exist for a particular family/model,
and if one tries to inject errors using mce_amd_inj module, it leads
to kernel OOPS. Especially if we inject errors to MC0, MC1, MC2 banks.

Sample:
[ 60.567808] [Hardware Error]: MC0 Error:
[ 60.567826] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 60.567840] IP: [<ffffffffa0019f36>] amd_decode_mce+0x526/0x900 [edac_mce_amd]
[ 60.567855] PGD ba665067 PUD 37168067 PMD 0
[ 60.567865] Oops: 0000 [#1] SMP
[ 60.567872] Modules linked in: mce_amd_inj amd64_edac_mod edac_core edac_mce_amd r8169
[ 60.567889] CPU: 2 PID: 2011 Comm: sh Not tainted 3.14.0-rc3.spinoff_ML+ #7
[ 60.567898] Hardware name: AMD Lamar/Lamar, BIOS WLA3904N_Weekly_13_09_0 09/04/2013
[ 60.567907] task: ffff88040a58e040 ti: ffff8800bb206000 task.ti: ffff8800bb206000
[ 60.567916] RIP: 0010:[<ffffffffa0019f36>] [<ffffffffa0019f36>] amd_decode_mce+0x526/0x900 [edac_mce_amd]
[ 60.567930] RSP: 0018:ffff8800bb207dc8 EFLAGS: 00010206
[ 60.567937] RAX: 0000000000000000 RBX: ffffffffa0014300 RCX: 00000000000010a5
[ 60.567945] RDX: 0000000000002825 RSI: 0000000000000001 RDI: 0000000000000f0f
[ 60.567953] RBP: ffff8800bb207e48 R08: 0000000000000000 R09: 0000000000000370
[ 60.567961] R10: ffffffff81a6ace0 R11: f000000000000000 R12: 0000000000012980
[ 60.567968] R13: a000000000010f0f R14: 0000000000000001 R15: ffff88041fc00000
[ 60.567978] FS: 00007f4709286700(0000) GS:ffff88041fd00000(0000) knlGS:0000000000000000
[ 60.567988] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 60.567995] CR2: 0000000000000000 CR3: 00000000bb1b3000 CR4: 00000000000407e0
[ 60.568003] Stack:
[ 60.568007] ffffea0000daa000 0000000000000000 ffff88040a592020 0000000000c504a8
[ 60.568021] ffff880409e7b6d0 ffff8800bb207e60 ffff88040d99aba0 ffff8800bb207e38
[ 60.568033] ffffffff813ce4fc 0000000a00000006 0000000000c504a8 0000000000000002
[ 60.568045] Call Trace:
[ 60.568059] [<ffffffff813ce4fc>] ? _kstrtoull+0x2c/0x90
[ 60.568069] [<ffffffffa0012056>] edac_inject_bank_store+0x56/0x90 [mce_amd_inj]
[ 60.568083] [<ffffffff81272b10>] ? kernfs_fop_write+0x50/0x150
[ 60.568094] [<ffffffff813bc56f>] kobj_attr_store+0xf/0x20
[ 60.568104] [<ffffffff8126ee25>] sysfs_kf_write+0x45/0x60
[ 60.568114] [<ffffffff81272b9e>] kernfs_fop_write+0xde/0x150
[ 60.568125] [<ffffffff811fb422>] vfs_write+0xc2/0x1d0
[ 60.568134] [<ffffffff811fb8f2>] SyS_write+0x52/0xa0
[ 60.568144] [<ffffffff8184361e>] ? do_page_fault+0xe/0x10
[ 60.568154] [<ffffffff81848512>] system_call_fastpath+0x16/0x1b
[ 60.568162] Code: c7 17 ba 01 a0 31 c0 4a 8b 34 ed e0 b3 01 a0 e8 12 64 81 e1 4c 8b 2b e9 3f fb ff ff 48 8b 05 7a 33 00 00 41 0f b6 f6 41 0f b7 fd <ff> 10 84 c0 0f 85 fc fd ff ff 48 c7 c7 28 c3 01 a0 31 c0 e8 e3
[ 60.568228] RIP [<ffffffffa0019f36>] amd_decode_mce+0x526/0x900 [edac_mce_amd]
[ 60.568240] RSP <ffff8800bb207dc8>
[ 60.568245] CR2: 0000000000000000
[ 60.568252] ---[ end trace 6ba951fb82ecbc10 ]---

In this patch, we fix the bug by checking if fam_ops struct has been
alloc-ed before we proceed with fam/model specific decoding.

Signed-off-by: Aravind Gopalakrishnan <[email protected]>
---
drivers/edac/mce_amd.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 30f7309..9b03daa 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -281,6 +281,12 @@ static void decode_mc0_mce(struct mce *m)

pr_emerg(HW_ERR "MC0 Error: ");

+ if (!fam_ops) {
+ pr_err("fam_ops structure not alloc-ed."
+ " Cannot provide detailed family/model"
+ " specific error decoding.\n");
+ return;
+ }
/* TLB error signatures are the same across families */
if (TLB_ERROR(ec)) {
if (TT(ec) == TT_DATA) {
@@ -391,6 +397,12 @@ static void decode_mc1_mce(struct mce *m)

pr_emerg(HW_ERR "MC1 Error: ");

+ if (!fam_ops) {
+ pr_err("fam_ops structure not alloc-ed."
+ " Cannot provide detailed family/model"
+ " specific error decoding.\n");
+ return;
+ }
if (TLB_ERROR(ec))
pr_cont("%s TLB %s.\n", LL_MSG(ec),
(xec ? "multimatch" : "parity error"));
@@ -522,6 +534,12 @@ static void decode_mc2_mce(struct mce *m)

pr_emerg(HW_ERR "MC2 Error: ");

+ if (!fam_ops) {
+ pr_err("fam_ops structure not alloc-ed."
+ " Cannot provide detailed family/model"
+ " specific error decoding.\n");
+ return;
+ }
if (!fam_ops->mc2_mce(ec, xec))
pr_cont(HW_ERR "Corrupted MC2 MCE info?\n");
}
--
1.7.10.4

2014-02-17 18:27:33

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On Mon, Feb 17, 2014 at 11:49:51AM -0600, Aravind Gopalakrishnan wrote:
> If MCE decoding support does not exist for a particular family/model,
> and if one tries to inject errors using mce_amd_inj module, it leads
> to kernel OOPS. Especially if we inject errors to MC0, MC1, MC2 banks.

Well, we shouldn't even be loading the module on unsupported hw, i.e.,
something like that:

--
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 30f7309446a6..3ef997bfb89d 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -816,10 +816,10 @@ static int __init mce_amd_init(void)
struct cpuinfo_x86 *c = &boot_cpu_data;

if (c->x86_vendor != X86_VENDOR_AMD)
- return 0;
+ return -ENODEV;

if (c->x86 < 0xf || c->x86 > 0x16)
- return 0;
+ return -ENODEV;

fam_ops = kzalloc(sizeof(struct amd_decoder_ops), GFP_KERNEL);
if (!fam_ops)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-17 19:26:35

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On 2/17/2014 12:27 PM, Borislav Petkov wrote:
> On Mon, Feb 17, 2014 at 11:49:51AM -0600, Aravind Gopalakrishnan wrote:
>> If MCE decoding support does not exist for a particular family/model,
>> and if one tries to inject errors using mce_amd_inj module, it leads
>> to kernel OOPS. Especially if we inject errors to MC0, MC1, MC2 banks.
> Well, we shouldn't even be loading the module on unsupported hw, i.e.,
> something like that:
>
>
> if (c->x86_vendor != X86_VENDOR_AMD)
> - return 0;
> + return -ENODEV;
>
> if (c->x86 < 0xf || c->x86 > 0x16)
> - return 0;
> + return -ENODEV;
>
Ah. you are right. (I had tested against different family/model checks
that happen to throw NULL dereference)
You can ignore the patch.

Apologies for the trouble.

-Aravind.

2014-02-17 19:42:00

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On Mon, Feb 17, 2014 at 01:26:20PM -0600, Aravind Gopalakrishnan wrote:
> > if (c->x86_vendor != X86_VENDOR_AMD)
> >- return 0;
> >+ return -ENODEV;
> > if (c->x86 < 0xf || c->x86 > 0x16)
> >- return 0;
> >+ return -ENODEV;
> Ah. you are right. (I had tested against different family/model
> checks that happen to throw NULL dereference)
> You can ignore the patch.
>
> Apologies for the trouble.

But we still need a fix, I guess the one I sent you does the job, yes,
no?

If not, I'd need more background info though - you're loading this on an
unsupported family, right?

If so, we need to fix the family check at the beginning:

if (c->x86 < 0xf || c->x86 > 0x16)

maybe even look at c->x86_model if we have to.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-17 22:57:31

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On 2/17/2014 1:41 PM, Borislav Petkov wrote:
> On Mon, Feb 17, 2014 at 01:26:20PM -0600, Aravind Gopalakrishnan wrote:
>>> if (c->x86_vendor != X86_VENDOR_AMD)
>>> - return 0;
>>> + return -ENODEV;
>>> if (c->x86 < 0xf || c->x86 > 0x16)
>>> - return 0;
>>> + return -ENODEV;
>>
> But we still need a fix, I guess the one I sent you does the job, yes,
> no?

Actually, the changes you sent above does the job only if 'edac-mce-amd'
is configured as module.
If it is built-in (and looks like this is what Kconfig recommends as
well..), then -
* We can still modprobe mce_amd_inj
* inject error
* mce_amd_inj calls into amd_decode_mce and
* this for mc0, mc1 and mc2 will lead to NULL dereference if family
is unsupported.

So we'd still need the patch I sent earlier.
I guess what we really need is a mash-up of both changes..

> If not, I'd need more background info though - you're loading this on an
> unsupported family, right?
Yes. (more or less :) )
(background info to make things clear-)
someone did try this on unsupported HW and got kernel oops.
But since I can't get my hands on one, I am simulating it by using a
fam15, M30h box and setting the init condition as
if (c->x86 < 0xf || c->x86 > 0x14)

snapshot of the oops from simulating on my system:
[ 28.846200] [Hardware Error]: MC0 Error:
[ 28.846218] BUG: unable to handle kernel NULL pointer dereference
at (null)
[ 28.846232] IP: [<ffffffff81608526>] amd_decode_mce+0x526/0x900
[ 28.846247] PGD 40bc9e067 PUD 40c677067 PMD 0
[ 28.846257] Oops: 0000 [#1] SMP
[ 28.846264] Modules linked in: mce_amd_inj amd64_edac_mod r8169

here's a sample mc0 error injected after applying both sets of changes
to the code:
[ 94.109090] [Hardware Error]: MC0 Error:
[ 94.109103] fam_ops structure not alloc-ed. Cannot provide detailed
family/model specific error decoding.
[ 94.109119] [Hardware Error]: Error Status: Uncorrected, software
containable error.
[ 94.109132] [Hardware Error]: CPU:0 (15:30:0)
MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f
[ 94.109146] [Hardware Error]: cache level: L3/GEN, mem/io: GEN,
mem-tx: GEN, part-proc: GEN (timed out)

Shall I work up the patch with both sets of changes and resend?

Thanks,
-Aravind.

2014-02-18 00:36:59

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On Mon, Feb 17, 2014 at 04:36:24PM -0600, Aravind Gopalakrishnan wrote:
> snapshot of the oops from simulating on my system:
> [ 28.846200] [Hardware Error]: MC0 Error:
> [ 28.846218] BUG: unable to handle kernel NULL pointer dereference
> at (null)
> [ 28.846232] IP: [<ffffffff81608526>] amd_decode_mce+0x526/0x900

Ok, I see it now.

mce_amd_inj.c calls directly amd_decode_mce() which is wrong. What it
should do, instead, is what mce_log does:

ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
...

and send the mce struct through the notifier chain so that if we haven't
registered on it, we don't decode the error.

> Shall I work up the patch with both sets of changes and resend?

Just replace the amd_decode_mce() call in mce_amd_inj.c with the above
call to the notifier and it should work, AFAICT. We should still add the
hunk I sent you earlier to take care of the module case. Oh, and make
amd_decode_mce() static so that nothing outside of mce_amd.c uses it.

I might be missing something but it is too late now and I'm going to bed
- more fun tomorrow :-)

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-18 08:46:42

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

Ok, let's try a simpler thing. Only build-tested here:

---
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 30f7309446a6..30592838f1da 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -738,6 +738,9 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
struct cpuinfo_x86 *c = &cpu_data(m->extcpu);
int ecc;

+ if (!fam_ops)
+ return NOTIFY_DONE;
+
if (amd_filter_mce(m))
return NOTIFY_STOP;

@@ -816,10 +819,10 @@ static int __init mce_amd_init(void)
struct cpuinfo_x86 *c = &boot_cpu_data;

if (c->x86_vendor != X86_VENDOR_AMD)
- return 0;
+ return -ENODEV;

if (c->x86 < 0xf || c->x86 > 0x16)
- return 0;
+ return -ENODEV;

fam_ops = kzalloc(sizeof(struct amd_decoder_ops), GFP_KERNEL);
if (!fam_ops)
@@ -874,6 +877,7 @@ static int __init mce_amd_init(void)
default:
printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
kfree(fam_ops);
+ fam_ops = NULL;
return -EINVAL;
}

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-18 18:27:29

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On 2/18/2014 2:46 AM, Borislav Petkov wrote:
> Ok, let's try a simpler thing. Only build-tested here:
>
>
> + if (!fam_ops)
> + return NOTIFY_DONE;
> +
> if (amd_filter_mce(m))
> return NOTIFY_STOP;
>
> @@ -816,10 +819,10 @@ static int __init mce_amd_init(void)
> struct cpuinfo_x86 *c = &boot_cpu_data;
>
> if (c->x86_vendor != X86_VENDOR_AMD)
> - return 0;
> + return -ENODEV;
>
> if (c->x86 < 0xf || c->x86 > 0x16)
> - return 0;
> + return -ENODEV;
>
> fam_ops = kzalloc(sizeof(struct amd_decoder_ops), GFP_KERNEL);
> if (!fam_ops)
> @@ -874,6 +877,7 @@ static int __init mce_amd_init(void)
> default:
> printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
> kfree(fam_ops);
> + fam_ops = NULL;
> return -EINVAL;
> }
>
>

This works. But a drawback is that you wouldn't get the output from more
generic error decoding that happens after the 'switch' in amd_decode_mce:

pr_emerg(HW_ERR "Error Status: %s\n", decode_error_status(m))
(etc..) (etc..)
amd_decode_err_code(m->status & 0xffff);

A quick fix for this is to rearrange the above chunk of code to happen
before the 'switch'
Tried it on local machine.Here's some sample outputs:

on unsupported h/w:
[ 46.822828] [Hardware Error]: Error Status: Uncorrected, software
containable error.
[ 46.822846] [Hardware Error]: CPU:0 (15:30:0)
MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f
[ 46.822858] [Hardware Error]: cache level: L3/GEN, mem/io: GEN,
mem-tx: GEN, part-proc: GEN (timed out)

on supported h/w:(a MC0 error)
[ 84.305292] [Hardware Error]: Error Status: Uncorrected, software
containable error.
[ 84.305312] [Hardware Error]: CPU:0 (15:30:0)
MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f
[ 84.305327] [Hardware Error]: cache level: L3/GEN, mem/io: GEN,
mem-tx: GEN, part-proc: GEN (timed out)
[ 84.305343] [Hardware Error]: MC0 Error: Internal error condition
type 1.

on supported h/w:(a MC4 ECC error)
[ 128.942878] [Hardware Error]: Error Status: System Fatal error.
[ 128.942897] [Hardware Error]: CPU:0 (15:30:0)
MC4_STATUS[-|UE|-|PCC|AddrV|-|-|UECC]: 0xa600200000080a23
[ 128.942914] [Hardware Error]: MC4_ADDR: 0x0000000000000000
[ 128.942922] [Hardware Error]: cache level: L3/GEN, mem/io: MEM,
mem-tx: WR, part-proc: RES (no timeout)
[ 128.942939] [Hardware Error]: MC4 Error (node 0): DRAM ECC error
detected on the NB.
[ 128.942971] EDAC MC0: 1 UE on mc#0csrow#2channel#0 (csrow:2 channel:0
page:0x0 offset:0x0 grain:0)

A word about your earlier suggestion of using amd_notifier_call_chain in
mce_amd_inj:
The changes will need to be more involved..
- Firstly, x86_mce_decoder_chain is defined in mce.c. So we'd need to
move it to somewhere in asm/mce.h
- include notifier.h in asm/mce.h (build error saying there are multiple
definitions of 'x86_mce_decoder_chain' when I tried this.. haven't
figured out why yet..)
- You'd need to change i_mce to pointer type which in turn will need
changes in the manner we reference the struct variables in the code

Not sure if you need these many changes, not to mention - touch common
mce code.
Simpler solution might be to rearrange the code in amd_decode_mce and
use your hunk..
Thoughts?

Thanks,
-Aravind.

2014-02-20 09:32:15

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On Tue, Feb 18, 2014 at 12:27:19PM -0600, Aravind Gopalakrishnan wrote:
> This works. But a drawback is that you wouldn't get the output from
> more generic error decoding that happens after the 'switch' in
> amd_decode_mce:
>
> pr_emerg(HW_ERR "Error Status: %s\n", decode_error_status(m))
> (etc..) (etc..)
> amd_decode_err_code(m->status & 0xffff);
>
> A quick fix for this is to rearrange the above chunk of code to
> happen before the 'switch'

Is that better (I also dropped the "Error Status: " prefix because it is
not needed):

- [ 46.822828] [Hardware Error]: Error Status: Uncorrected, software containable error.
+ [ 46.822828] [Hardware Error]: Uncorrected, software containable error.

--
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 30f7309446a6..528b0c4998d9 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -741,6 +741,36 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
if (amd_filter_mce(m))
return NOTIFY_STOP;

+ pr_emerg(HW_ERR "%s\n", decode_error_status(m));
+
+ pr_emerg(HW_ERR "CPU:%d (%x:%x:%x) MC%d_STATUS[%s|%s|%s|%s|%s",
+ m->extcpu,
+ c->x86, c->x86_model, c->x86_mask,
+ m->bank,
+ ((m->status & MCI_STATUS_OVER) ? "Over" : "-"),
+ ((m->status & MCI_STATUS_UC) ? "UE" : "CE"),
+ ((m->status & MCI_STATUS_MISCV) ? "MiscV" : "-"),
+ ((m->status & MCI_STATUS_PCC) ? "PCC" : "-"),
+ ((m->status & MCI_STATUS_ADDRV) ? "AddrV" : "-"));
+
+ if (c->x86 == 0x15 || c->x86 == 0x16)
+ pr_cont("|%s|%s",
+ ((m->status & MCI_STATUS_DEFERRED) ? "Deferred" : "-"),
+ ((m->status & MCI_STATUS_POISON) ? "Poison" : "-"));
+
+ /* do the two bits[14:13] together */
+ ecc = (m->status >> 45) & 0x3;
+ if (ecc)
+ pr_cont("|%sECC", ((ecc == 2) ? "C" : "U"));
+
+ pr_cont("]: 0x%016llx\n", m->status);
+
+ if (m->status & MCI_STATUS_ADDRV)
+ pr_emerg(HW_ERR "MC%d_ADDR: 0x%016llx\n", m->bank, m->addr);
+
+ if (!fam_ops)
+ goto err_code;
+
switch (m->bank) {
case 0:
decode_mc0_mce(m);
@@ -774,33 +804,7 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
break;
}

- pr_emerg(HW_ERR "Error Status: %s\n", decode_error_status(m));
-
- pr_emerg(HW_ERR "CPU:%d (%x:%x:%x) MC%d_STATUS[%s|%s|%s|%s|%s",
- m->extcpu,
- c->x86, c->x86_model, c->x86_mask,
- m->bank,
- ((m->status & MCI_STATUS_OVER) ? "Over" : "-"),
- ((m->status & MCI_STATUS_UC) ? "UE" : "CE"),
- ((m->status & MCI_STATUS_MISCV) ? "MiscV" : "-"),
- ((m->status & MCI_STATUS_PCC) ? "PCC" : "-"),
- ((m->status & MCI_STATUS_ADDRV) ? "AddrV" : "-"));
-
- if (c->x86 == 0x15 || c->x86 == 0x16)
- pr_cont("|%s|%s",
- ((m->status & MCI_STATUS_DEFERRED) ? "Deferred" : "-"),
- ((m->status & MCI_STATUS_POISON) ? "Poison" : "-"));
-
- /* do the two bits[14:13] together */
- ecc = (m->status >> 45) & 0x3;
- if (ecc)
- pr_cont("|%sECC", ((ecc == 2) ? "C" : "U"));
-
- pr_cont("]: 0x%016llx\n", m->status);
-
- if (m->status & MCI_STATUS_ADDRV)
- pr_emerg(HW_ERR "MC%d_ADDR: 0x%016llx\n", m->bank, m->addr);
-
+ err_code:
amd_decode_err_code(m->status & 0xffff);

return NOTIFY_STOP;
@@ -816,10 +820,10 @@ static int __init mce_amd_init(void)
struct cpuinfo_x86 *c = &boot_cpu_data;

if (c->x86_vendor != X86_VENDOR_AMD)
- return 0;
+ return -ENODEV;

if (c->x86 < 0xf || c->x86 > 0x16)
- return 0;
+ return -ENODEV;

fam_ops = kzalloc(sizeof(struct amd_decoder_ops), GFP_KERNEL);
if (!fam_ops)
@@ -874,6 +878,7 @@ static int __init mce_amd_init(void)
default:
printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
kfree(fam_ops);
+ fam_ops = NULL;
return -EINVAL;
}
--

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-20 16:07:43

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On 2/20/2014 3:32 AM, Borislav Petkov wrote:
> Is that better (I also dropped the "Error Status: " prefix because it is
> not needed):
>
> - [ 46.822828] [Hardware Error]: Error Status: Uncorrected, software containable error.
> + [ 46.822828] [Hardware Error]: Uncorrected, software containable error.

Okay.

> --
> diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
> index 30f7309446a6..528b0c4998d9 100644
> --- a/drivers/edac/mce_amd.c
> +++ b/drivers/edac/mce_amd.c
> @@ -741,6 +741,36 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
> if (amd_filter_mce(m))
> return NOTIFY_STOP;
>
> + pr_emerg(HW_ERR "%s\n", decode_error_status(m));
> +
> + pr_emerg(HW_ERR "CPU:%d (%x:%x:%x) MC%d_STATUS[%s|%s|%s|%s|%s",
> + m->extcpu,
> + c->x86, c->x86_model, c->x86_mask,
> + m->bank,
> + ((m->status & MCI_STATUS_OVER) ? "Over" : "-"),
> + ((m->status & MCI_STATUS_UC) ? "UE" : "CE"),
> + ((m->status & MCI_STATUS_MISCV) ? "MiscV" : "-"),
> + ((m->status & MCI_STATUS_PCC) ? "PCC" : "-"),
> + ((m->status & MCI_STATUS_ADDRV) ? "AddrV" : "-"));
> +
> + if (c->x86 == 0x15 || c->x86 == 0x16)
> + pr_cont("|%s|%s",
> + ((m->status & MCI_STATUS_DEFERRED) ? "Deferred" : "-"),
> + ((m->status & MCI_STATUS_POISON) ? "Poison" : "-"));
> +
> + /* do the two bits[14:13] together */
> + ecc = (m->status >> 45) & 0x3;
> + if (ecc)
> + pr_cont("|%sECC", ((ecc == 2) ? "C" : "U"));
> +
> + pr_cont("]: 0x%016llx\n", m->status);
> +
> + if (m->status & MCI_STATUS_ADDRV)
> + pr_emerg(HW_ERR "MC%d_ADDR: 0x%016llx\n", m->bank, m->addr);
> +
> + if (!fam_ops)
> + goto err_code;
> +
> switch (m->bank) {
> case 0:
> decode_mc0_mce(m);
> @@ -774,33 +804,7 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
> break;
> }
>
> - pr_emerg(HW_ERR "Error Status: %s\n", decode_error_status(m));
> -
> - pr_emerg(HW_ERR "CPU:%d (%x:%x:%x) MC%d_STATUS[%s|%s|%s|%s|%s",
> - m->extcpu,
> - c->x86, c->x86_model, c->x86_mask,
> - m->bank,
> - ((m->status & MCI_STATUS_OVER) ? "Over" : "-"),
> - ((m->status & MCI_STATUS_UC) ? "UE" : "CE"),
> - ((m->status & MCI_STATUS_MISCV) ? "MiscV" : "-"),
> - ((m->status & MCI_STATUS_PCC) ? "PCC" : "-"),
> - ((m->status & MCI_STATUS_ADDRV) ? "AddrV" : "-"));
> -
> - if (c->x86 == 0x15 || c->x86 == 0x16)
> - pr_cont("|%s|%s",
> - ((m->status & MCI_STATUS_DEFERRED) ? "Deferred" : "-"),
> - ((m->status & MCI_STATUS_POISON) ? "Poison" : "-"));
> -
> - /* do the two bits[14:13] together */
> - ecc = (m->status >> 45) & 0x3;
> - if (ecc)
> - pr_cont("|%sECC", ((ecc == 2) ? "C" : "U"));
> -
> - pr_cont("]: 0x%016llx\n", m->status);
> -
> - if (m->status & MCI_STATUS_ADDRV)
> - pr_emerg(HW_ERR "MC%d_ADDR: 0x%016llx\n", m->bank, m->addr);
> -
> + err_code:
> amd_decode_err_code(m->status & 0xffff);
>
> return NOTIFY_STOP;
> @@ -816,10 +820,10 @@ static int __init mce_amd_init(void)
> struct cpuinfo_x86 *c = &boot_cpu_data;
>
> if (c->x86_vendor != X86_VENDOR_AMD)
> - return 0;
> + return -ENODEV;
>
> if (c->x86 < 0xf || c->x86 > 0x16)
> - return 0;
> + return -ENODEV;
>
> fam_ops = kzalloc(sizeof(struct amd_decoder_ops), GFP_KERNEL);
> if (!fam_ops)
> @@ -874,6 +878,7 @@ static int __init mce_amd_init(void)
> default:
> printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
> kfree(fam_ops);
> + fam_ops = NULL;
> return -EINVAL;
> }
> --
>

Tested the above a final time on local machine and it works fine..

Thanks,
-Aravind.

2014-02-20 16:13:56

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On Thu, Feb 20, 2014 at 10:07:33AM -0600, Aravind Gopalakrishnan wrote:
> Tested the above a final time on local machine and it works fine..

Ok, I'll queue it up with your Tested-by. Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-21 14:23:26

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On Thu, Feb 20, 2014 at 05:13:51PM +0100, Borislav Petkov wrote:
> On Thu, Feb 20, 2014 at 10:07:33AM -0600, Aravind Gopalakrishnan wrote:
> > Tested the above a final time on local machine and it works fine..
>
> Ok, I'll queue it up with your Tested-by. Thanks.

So I dropped the family check altogether, modulo the warning that says
that we're getting loaded on unsupported hardware. We want to allow
loading, even if we don't have family ops.

Ok?

--
From: Borislav Petkov <[email protected]>
Subject: [PATCH] MCE, AMD: Fix decoding module loading on unsupported hw

We want to still be able to issue some error information on systems for
which there is no decoding support (think older distro kernels here,
for example). Therefore, we allow module registration but skip the
per-family bank-specific decoders and issue the general information
only, i.e.:

[ 46.822828] [Hardware Error]: Error Status: Uncorrected, software containable error.
[ 46.822846] [Hardware Error]: CPU:0 (15:30:0) MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f
[ 46.822858] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)

with the hope that it still contains helpful useful bits.

Suggested-by: Aravind Gopalakrishnan <[email protected]>
Tested-by: Aravind Gopalakrishnan <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
---
drivers/edac/mce_amd.c | 65 +++++++++++++++++++++++++-------------------------
1 file changed, 33 insertions(+), 32 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 30f7309446a6..51b9caa0b024 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -741,6 +741,36 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
if (amd_filter_mce(m))
return NOTIFY_STOP;

+ pr_emerg(HW_ERR "%s\n", decode_error_status(m));
+
+ pr_emerg(HW_ERR "CPU:%d (%x:%x:%x) MC%d_STATUS[%s|%s|%s|%s|%s",
+ m->extcpu,
+ c->x86, c->x86_model, c->x86_mask,
+ m->bank,
+ ((m->status & MCI_STATUS_OVER) ? "Over" : "-"),
+ ((m->status & MCI_STATUS_UC) ? "UE" : "CE"),
+ ((m->status & MCI_STATUS_MISCV) ? "MiscV" : "-"),
+ ((m->status & MCI_STATUS_PCC) ? "PCC" : "-"),
+ ((m->status & MCI_STATUS_ADDRV) ? "AddrV" : "-"));
+
+ if (c->x86 == 0x15 || c->x86 == 0x16)
+ pr_cont("|%s|%s",
+ ((m->status & MCI_STATUS_DEFERRED) ? "Deferred" : "-"),
+ ((m->status & MCI_STATUS_POISON) ? "Poison" : "-"));
+
+ /* do the two bits[14:13] together */
+ ecc = (m->status >> 45) & 0x3;
+ if (ecc)
+ pr_cont("|%sECC", ((ecc == 2) ? "C" : "U"));
+
+ pr_cont("]: 0x%016llx\n", m->status);
+
+ if (m->status & MCI_STATUS_ADDRV)
+ pr_emerg(HW_ERR "MC%d_ADDR: 0x%016llx\n", m->bank, m->addr);
+
+ if (!fam_ops)
+ goto err_code;
+
switch (m->bank) {
case 0:
decode_mc0_mce(m);
@@ -774,33 +804,7 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
break;
}

- pr_emerg(HW_ERR "Error Status: %s\n", decode_error_status(m));
-
- pr_emerg(HW_ERR "CPU:%d (%x:%x:%x) MC%d_STATUS[%s|%s|%s|%s|%s",
- m->extcpu,
- c->x86, c->x86_model, c->x86_mask,
- m->bank,
- ((m->status & MCI_STATUS_OVER) ? "Over" : "-"),
- ((m->status & MCI_STATUS_UC) ? "UE" : "CE"),
- ((m->status & MCI_STATUS_MISCV) ? "MiscV" : "-"),
- ((m->status & MCI_STATUS_PCC) ? "PCC" : "-"),
- ((m->status & MCI_STATUS_ADDRV) ? "AddrV" : "-"));
-
- if (c->x86 == 0x15 || c->x86 == 0x16)
- pr_cont("|%s|%s",
- ((m->status & MCI_STATUS_DEFERRED) ? "Deferred" : "-"),
- ((m->status & MCI_STATUS_POISON) ? "Poison" : "-"));
-
- /* do the two bits[14:13] together */
- ecc = (m->status >> 45) & 0x3;
- if (ecc)
- pr_cont("|%sECC", ((ecc == 2) ? "C" : "U"));
-
- pr_cont("]: 0x%016llx\n", m->status);
-
- if (m->status & MCI_STATUS_ADDRV)
- pr_emerg(HW_ERR "MC%d_ADDR: 0x%016llx\n", m->bank, m->addr);
-
+ err_code:
amd_decode_err_code(m->status & 0xffff);

return NOTIFY_STOP;
@@ -816,10 +820,7 @@ static int __init mce_amd_init(void)
struct cpuinfo_x86 *c = &boot_cpu_data;

if (c->x86_vendor != X86_VENDOR_AMD)
- return 0;
-
- if (c->x86 < 0xf || c->x86 > 0x16)
- return 0;
+ return -ENODEV;

fam_ops = kzalloc(sizeof(struct amd_decoder_ops), GFP_KERNEL);
if (!fam_ops)
@@ -874,7 +875,7 @@ static int __init mce_amd_init(void)
default:
printk(KERN_WARNING "Huh? What family is it: 0x%x?!\n", c->x86);
kfree(fam_ops);
- return -EINVAL;
+ fam_ops = NULL;
}

pr_info("MCE: In-kernel MCE decoding enabled.\n");
--
1.8.5.2.192.g7794a68

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-21 16:46:19

by Aravind Gopalakrishnan

[permalink] [raw]

Subject: Re: [PATCH] EDAC, MCE, AMD: Fix code to prevent NULL dereference

On 2/21/2014 8:23 AM, Borislav Petkov wrote:
> On Thu, Feb 20, 2014 at 05:13:51PM +0100, Borislav Petkov wrote:
>> On Thu, Feb 20, 2014 at 10:07:33AM -0600, Aravind Gopalakrishnan wrote:
>>> Tested the above a final time on local machine and it works fine..
>> Ok, I'll queue it up with your Tested-by. Thanks.
> So I dropped the family check altogether, modulo the warning that says
> that we're getting loaded on unsupported hardware. We want to allow
> loading, even if we don't have family ops.
>
> Ok?

Yes, makes sense.

> --
> From: Borislav Petkov <[email protected]>
> Subject: [PATCH] MCE, AMD: Fix decoding module loading on unsupported hw
>
> We want to still be able to issue some error information on systems for
> which there is no decoding support (think older distro kernels here,
> for example). Therefore, we allow module registration but skip the
> per-family bank-specific decoders and issue the general information
> only, i.e.:
>
> [ 46.822828] [Hardware Error]: Error Status: Uncorrected, software containable error.
> [ 46.822846] [Hardware Error]: CPU:0 (15:30:0) MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f
> [ 46.822858] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)
>
> with the hope that it still contains helpful useful bits.
>
> Suggested-by: Aravind Gopalakrishnan <[email protected]>
> Tested-by: Aravind Gopalakrishnan <[email protected]>
> Signed-off-by: Borislav Petkov <[email protected]>
> ---
> drivers/edac/mce_amd.c | 65 +++++++++++++++++++++++++-------------------------
> 1 file changed, 33 insertions(+), 32 deletions(-)
>
>

Works for me:

[ 0.321623] Huh? What family is it: 0x15?!
[ 0.321629] MCE: In-kernel MCE decoding enabled.
...
[ 79.294568] [Hardware Error]: Uncorrected, software containable error.
[ 79.294587] [Hardware Error]: CPU:0 (15:30:0)
MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f
[ 79.294602] [Hardware Error]: cache level: L3/GEN, mem/io: GEN,
mem-tx: GEN, part-proc: GEN (timed out)

Thanks,
-Aravind