Hi Borislav,
while doing some testings with a EPYC box I notice
these strange messages when a Node is disabled.
I really do think the message is confusing since
we tell 'Node X: ... disabled' followed by a
INFO on the edac driver which tells the driver will not load.
Also even worse , we suggest to use ecc_enable_override then,
which can cause wrose things.. We really should not suggest
something like this by default.
So why this is still needed ?
I think is clear what 'Node X: .... disabled' means ?
Also if is still needed I suggest to chage that a bit like:
static const char *ecc_msg =
"No ECC capability or ECC disabled in BIOS , module will not load.\n"
then add the node to the amd64_info()
....
if (!ecc_en || !nb_mce_en) {
amd64_info("Node %d: %s", nid, ecc_msg);
....
Or move that all to edac_dbg() ?
Regards,
Gabriel C
On Thu, Jan 11, 2018 at 12:06:49AM +0100, Gabriel C wrote:
> while doing some testings with a EPYC box I notice
> these strange messages when a Node is disabled.
>
> I really do think the message is confusing since
> we tell 'Node X: ... disabled' followed by a
> INFO on the edac driver which tells the driver will not load.
And that is confusing because?
> Also even worse , we suggest to use ecc_enable_override then,
> which can cause wrose things.. We really should not suggest
> something like this by default.
That is an remnant from the old times. Family 0x17 and newer doesn't
allow that anymore.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 11.01.2018 00:12, Borislav Petkov wrote:
> On Thu, Jan 11, 2018 at 12:06:49AM +0100, Gabriel C wrote:
>> while doing some testings with a EPYC box I notice
>> these strange messages when a Node is disabled.
>>
>> I really do think the message is confusing since
>> we tell 'Node X: ... disabled' followed by a
>> INFO on the edac driver which tells the driver will not load.
>
> And that is confusing because?
Beacuse we see the following:
[ 4.694948] EDAC amd64: Node 6: DRAM ECC disabled.
[ 4.694949] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
The first one tells the Node is disabled the second is a
KERN INFO message telling the *module* will not load.
Telling then *module* will not load for 'this Node' should be clear for everone.
Don't get me wrong for me is clear what this means , I don't need the
second message at all but I have here folks didn't understand wth that means.
>
>> Also even worse , we suggest to use ecc_enable_override then,
>> which can cause wrose things.. We really should not suggest
>> something like this by default.
>
> That is an remnant from the old times. Family 0x17 and newer doesn't
> allow that anymore.
>
So do we need an < fam17h check for that message then ?
On Thu, Jan 11, 2018 at 12:31:08AM +0100, Gabriel C wrote:
> Beacuse we see the following:
>
> [ 4.694948] EDAC amd64: Node 6: DRAM ECC disabled.
> [ 4.694949] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
> Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
> (Note that use of the override may cause unknown side effects.)
>
> The first one tells the Node is disabled
The first one says *DRAM ECC* is disabled on that node - not the node
itself. Looks like it confuses you too.
> the second is a
> KERN INFO message telling the *module* will not load.
>
> Telling then *module* will not load for 'this Node' should be clear for everone.
So this is a purely informational message. There's a lot of messages
like that in the kernel. I still don't understand what your problem is
with this particular one.
> Don't get me wrong for me is clear what this means , I don't need the
> second message at all but I have here folks didn't understand wth that means.
"ECC disabled in the BIOS or no ECC capability, module will not load." -
I think that sentence is explaining the situation pretty good:
either ECC checking is disabled in the BIOS
or
ECC capability cannot be detected.
What do you think it should say instead?
> So do we need an < fam17h check for that message then ?
That message says:
"Either enable ECC checking or ..."
That first part refers to the user going into the BIOS and enabling ECC
checking. It says so above too: "ECC disabled in the BIOS".
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 11.01.2018 00:45, Borislav Petkov wrote:
> On Thu, Jan 11, 2018 at 12:31:08AM +0100, Gabriel C wrote:
>> Beacuse we see the following:
>>
>> [ 4.694948] EDAC amd64: Node 6: DRAM ECC disabled.
>> [ 4.694949] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>> Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>> (Note that use of the override may cause unknown side effects.)
>>
>> The first one tells the Node is disabled
>
> The first one says *DRAM ECC* is disabled on that node - not the node
> itself. Looks like it confuses you too.
Yes ofc :) is what I meant :)
>
>> the second is a
>> KERN INFO message telling the *module* will not load.
>>
>> Telling then *module* will not load for 'this Node' should be clear for everone.
>
> So this is a purely informational message. There's a lot of messages
> like that in the kernel. I still don't understand what your problem is
> with this particular one.
Nothing agains messages but I get this one twice for each node is disabled
with a INFO I partially don't even need on that platform.
>
>> Don't get me wrong for me is clear what this means , I don't need the
>> second message at all but I have here folks didn't understand wth that means.
>
> "ECC disabled in the BIOS or no ECC capability, module will not load." -
> I think that sentence is explaining the situation pretty good:
>
> either ECC checking is disabled in the BIOS
>
> or
>
> ECC capability cannot be detected.
>
> What do you think it should say instead?
The message is fine if you add a Node prefix on it and let the
that ecc_enable_override stuff out.
something like this looks better and less confusing , don't you think ?
EDAC amd64: Node 4: DRAM ECC disabled.
EDAC amd64: Node 4: DDRM ECC disabled in the BIOS or no ECC capability, module will not load.
EDAC amd64: Node 5: DRAM ECC enabled.
EDAC amd64: F17h detected (node 5).
EDAC MC: UMC0 chip selects:
....