2018-01-10 23:06:58

by Gabriel C

[permalink] [raw]
Subject: EDAC-AMD64: what is the ecc_msg good for ?


Hi Borislav,

while doing some testings with a EPYC box I notice
these strange messages when a Node is disabled.

I really do think the message is confusing since
we tell 'Node X: ... disabled' followed by a
INFO on the edac driver which tells the driver will not load.

Also even worse , we suggest to use ecc_enable_override then,
which can cause wrose things.. We really should not suggest
something like this by default.

So why this is still needed ?
I think is clear what 'Node X: .... disabled' means ?

Also if is still needed I suggest to chage that a bit like:


static const char *ecc_msg =
"No ECC capability or ECC disabled in BIOS , module will not load.\n"


then add the node to the amd64_info()

....

if (!ecc_en || !nb_mce_en) {
amd64_info("Node %d: %s", nid, ecc_msg);
....

Or move that all to edac_dbg() ?


Regards,

Gabriel C


2018-01-10 23:13:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: EDAC-AMD64: what is the ecc_msg good for ?

On Thu, Jan 11, 2018 at 12:06:49AM +0100, Gabriel C wrote:
> while doing some testings with a EPYC box I notice
> these strange messages when a Node is disabled.
>
> I really do think the message is confusing since
> we tell 'Node X: ... disabled' followed by a
> INFO on the edac driver which tells the driver will not load.

And that is confusing because?

> Also even worse , we suggest to use ecc_enable_override then,
> which can cause wrose things.. We really should not suggest
> something like this by default.

That is an remnant from the old times. Family 0x17 and newer doesn't
allow that anymore.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-01-10 23:31:12

by Gabriel C

[permalink] [raw]
Subject: Re: EDAC-AMD64: what is the ecc_msg good for ?

On 11.01.2018 00:12, Borislav Petkov wrote:
> On Thu, Jan 11, 2018 at 12:06:49AM +0100, Gabriel C wrote:
>> while doing some testings with a EPYC box I notice
>> these strange messages when a Node is disabled.
>>
>> I really do think the message is confusing since
>> we tell 'Node X: ... disabled' followed by a
>> INFO on the edac driver which tells the driver will not load.
>
> And that is confusing because?

Beacuse we see the following:

[ 4.694948] EDAC amd64: Node 6: DRAM ECC disabled.
[ 4.694949] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)

The first one tells the Node is disabled the second is a
KERN INFO message telling the *module* will not load.

Telling then *module* will not load for 'this Node' should be clear for everone.

Don't get me wrong for me is clear what this means , I don't need the
second message at all but I have here folks didn't understand wth that means.

>
>> Also even worse , we suggest to use ecc_enable_override then,
>> which can cause wrose things.. We really should not suggest
>> something like this by default.
>
> That is an remnant from the old times. Family 0x17 and newer doesn't
> allow that anymore.
>

So do we need an < fam17h check for that message then ?

2018-01-10 23:45:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: EDAC-AMD64: what is the ecc_msg good for ?

On Thu, Jan 11, 2018 at 12:31:08AM +0100, Gabriel C wrote:
> Beacuse we see the following:
>
> [ 4.694948] EDAC amd64: Node 6: DRAM ECC disabled.
> [ 4.694949] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
> Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
> (Note that use of the override may cause unknown side effects.)
>
> The first one tells the Node is disabled

The first one says *DRAM ECC* is disabled on that node - not the node
itself. Looks like it confuses you too.

> the second is a
> KERN INFO message telling the *module* will not load.
>
> Telling then *module* will not load for 'this Node' should be clear for everone.

So this is a purely informational message. There's a lot of messages
like that in the kernel. I still don't understand what your problem is
with this particular one.

> Don't get me wrong for me is clear what this means , I don't need the
> second message at all but I have here folks didn't understand wth that means.

"ECC disabled in the BIOS or no ECC capability, module will not load." -
I think that sentence is explaining the situation pretty good:

either ECC checking is disabled in the BIOS

or

ECC capability cannot be detected.

What do you think it should say instead?

> So do we need an < fam17h check for that message then ?

That message says:

"Either enable ECC checking or ..."

That first part refers to the user going into the BIOS and enabling ECC
checking. It says so above too: "ECC disabled in the BIOS".

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-01-11 00:27:21

by Gabriel C

[permalink] [raw]
Subject: Re: EDAC-AMD64: what is the ecc_msg good for ?

On 11.01.2018 00:45, Borislav Petkov wrote:
> On Thu, Jan 11, 2018 at 12:31:08AM +0100, Gabriel C wrote:
>> Beacuse we see the following:
>>
>> [ 4.694948] EDAC amd64: Node 6: DRAM ECC disabled.
>> [ 4.694949] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>> Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>> (Note that use of the override may cause unknown side effects.)
>>
>> The first one tells the Node is disabled
>
> The first one says *DRAM ECC* is disabled on that node - not the node
> itself. Looks like it confuses you too.

Yes ofc :) is what I meant :)

>
>> the second is a
>> KERN INFO message telling the *module* will not load.
>>
>> Telling then *module* will not load for 'this Node' should be clear for everone.
>
> So this is a purely informational message. There's a lot of messages
> like that in the kernel. I still don't understand what your problem is
> with this particular one.

Nothing agains messages but I get this one twice for each node is disabled
with a INFO I partially don't even need on that platform.

>
>> Don't get me wrong for me is clear what this means , I don't need the
>> second message at all but I have here folks didn't understand wth that means.
>
> "ECC disabled in the BIOS or no ECC capability, module will not load." -
> I think that sentence is explaining the situation pretty good:
>
> either ECC checking is disabled in the BIOS
>
> or
>
> ECC capability cannot be detected.
>
> What do you think it should say instead?

The message is fine if you add a Node prefix on it and let the
that ecc_enable_override stuff out.

something like this looks better and less confusing , don't you think ?

EDAC amd64: Node 4: DRAM ECC disabled.
EDAC amd64: Node 4: DDRM ECC disabled in the BIOS or no ECC capability, module will not load.
EDAC amd64: Node 5: DRAM ECC enabled.
EDAC amd64: F17h detected (node 5).
EDAC MC: UMC0 chip selects:
....