LinuxLists.cc - [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

2021-03-26 18:35:19

Subject: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

From: William Roche <[email protected]>

The Corrected Error events collected by the cec_add_elem() have to be
consistently filtered out.
We fix the case where the value of find_elem() to find the slot of a pfn
was mistakenly used as the return value of the function.
Now the MCE notifiers chain relying on MCE_HANDLED_CEC would only report
filtered corrected errors that reached the action threshold.

Signed-off-by: William Roche <[email protected]>
---

Notes:
Some machines are reporting Corrected Errors events without any
information about a PFN Soft-offlining or Invalid pfn (report given by
the EDAC module or the mcelog daemon).

A research showed that it reflected the first occurrence of a CE error
on the system which should have been filtered by the RAS_CEC component.
We could also notice that if 2 PFNs are impacted by CE errors, the PFN
on the non-zero slot gets its CE errors reported every-time instead of
being filtered out.

This problem has appeared with the introduction of commit
de0e0624d86ff9fc512dedb297f8978698abf21a where the filtering logic has
been modified.

Could you please review this small suggested fix ?

Thanks in advance for any feedback you could have.
William.

drivers/ras/cec.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index ddecf25..fdb9762 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -313,7 +313,7 @@ static int cec_add_elem(u64 pfn)
{
struct ce_array *ca = &ce_arr;
unsigned int to = 0;
- int count, ret = 0;
+ int count, ret;

/*
* We can be called very early on the identify_cpu() path where we are
@@ -372,6 +372,9 @@ static int cec_add_elem(u64 pfn)
goto unlock;
}

+ /* action threshold not reached */
+ ret = 0;
+
ca->decay_count++;

if (ca->decay_count >= CLEAN_ELEMS)
--
1.8.3.1

2021-03-26 19:07:04

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

On Fri, Mar 26, 2021 at 02:30:29PM -0400, “William Roche wrote:
> From: William Roche <[email protected]>
>
> The Corrected Error events collected by the cec_add_elem() have to be
> consistently filtered out.
> We fix the case where the value of find_elem() to find the slot of a pfn
> was mistakenly used as the return value of the function.
> Now the MCE notifiers chain relying on MCE_HANDLED_CEC would only report
> filtered corrected errors that reached the action threshold.
>
> Signed-off-by: William Roche <[email protected]>
> ---
>
> Notes:
> Some machines are reporting Corrected Errors events without any
> information about a PFN Soft-offlining or Invalid pfn (report given by
> the EDAC module or the mcelog daemon).
>
> A research showed that it reflected the first occurrence of a CE error
> on the system which should have been filtered by the RAS_CEC component.
> We could also notice that if 2 PFNs are impacted by CE errors, the PFN
> on the non-zero slot gets its CE errors reported every-time instead of
> being filtered out.
>
> This problem has appeared with the introduction of commit
> de0e0624d86ff9fc512dedb297f8978698abf21a where the filtering logic has
> been modified.
>
> Could you please review this small suggested fix ?
>
> Thanks in advance for any feedback you could have.
> William.
>
> drivers/ras/cec.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)

AFAIU, I think you want something like the below untested hunk:

You set it to 0 when it cannot find an element and that "ret = 1" we can
remove because callers don't care about the offlining threshold - the
only caller that looks at its retval wants to know whether it added the
VA successfully to note that it handled the error.

Makes sense?

---
diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index ddecf25b5dd4..a29994d726d8 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -341,6 +341,8 @@ static int cec_add_elem(u64 pfn)

ca->array[to] = pfn << PAGE_SHIFT;
ca->n++;
+
+ ret = 0;
}

/* Add/refresh element generation and increment count */
@@ -363,12 +365,6 @@ static int cec_add_elem(u64 pfn)

del_elem(ca, to);

- /*
- * Return a >0 value to callers, to denote that we've reached
- * the offlining threshold.
- */
- ret = 1;
-
goto unlock;
}
---

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-03-26 22:27:05

by William Roche

[permalink] [raw]

Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

On 26/03/2021 20:02, Borislav Petkov wrote:
> On Fri, Mar 26, 2021 at 02:30:29PM -0400, “William Roche wrote:
>> From: William Roche <[email protected]>
>>
>> The Corrected Error events collected by the cec_add_elem() have to be
>> consistently filtered out.
>> We fix the case where the value of find_elem() to find the slot of a pfn
>> was mistakenly used as the return value of the function.
>> Now the MCE notifiers chain relying on MCE_HANDLED_CEC would only report
>> filtered corrected errors that reached the action threshold.
>>
>> Signed-off-by: William Roche <[email protected]>
>> ---
>>
>> Notes:
>> Some machines are reporting Corrected Errors events without any
>> information about a PFN Soft-offlining or Invalid pfn (report given by
>> the EDAC module or the mcelog daemon).
>>
>> A research showed that it reflected the first occurrence of a CE error
>> on the system which should have been filtered by the RAS_CEC component.
>> We could also notice that if 2 PFNs are impacted by CE errors, the PFN
>> on the non-zero slot gets its CE errors reported every-time instead of
>> being filtered out.
>>
>> This problem has appeared with the introduction of commit
>> de0e0624d86ff9fc512dedb297f8978698abf21a where the filtering logic has
>> been modified.
>>
>> Could you please review this small suggested fix ?
>>
>> Thanks in advance for any feedback you could have.
>> William.
>>
>> drivers/ras/cec.c | 5 ++++-
>> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> AFAIU, I think you want something like the below untested hunk:
>
> You set it to 0 when it cannot find an element and that "ret = 1" we can
> remove because callers don't care about the offlining threshold - the
> only caller that looks at its retval wants to know whether it added the
> VA successfully to note that it handled the error.
>
> Makes sense?

What we want is to make cec_add_elem() to return !0 value only
when the given pfn triggered an action, so that its callers should
log the error.

This is what I understand from the code comment :

* When an element reaches it's max count of action_threshold, we try
to poison
* it by assuming that errors triggered action_threshold times in a
single page
* are excessive and that page shouldn't be used anymore.
action_threshold is
* initialized to COUNT_MASK which is the maximum.
*
* That error event entry causes cec_add_elem() to return !0 value and thus
* signal to its callers to log the error.

This is not what I understand from your the below suggestion, as
find_elem returns the index of the pfn if found.

What I'm expecting from ras_cec is to "hide" CEs until they reach the
action threshold where an action is tried against the impacted PFN, and
it's now the time to log the error with the entire notifiers chain.

Am I missing something ?

>
> ---
> diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
> index ddecf25b5dd4..a29994d726d8 100644
> --- a/drivers/ras/cec.c
> +++ b/drivers/ras/cec.c
> @@ -341,6 +341,8 @@ static int cec_add_elem(u64 pfn)
>
> ca->array[to] = pfn << PAGE_SHIFT;
> ca->n++;
> +
> + ret = 0;
> }
>
> /* Add/refresh element generation and increment count */
> @@ -363,12 +365,6 @@ static int cec_add_elem(u64 pfn)
>
> del_elem(ca, to);
>
> - /*
> - * Return a >0 value to callers, to denote that we've reached
> - * the offlining threshold.
> - */
> - ret = 1;
> -
> goto unlock;
> }
> ---
>
> Thx.

Thanks again for your feedback.
W.

2021-03-26 22:44:40

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

On Fri, Mar 26, 2021 at 11:24:43PM +0100, William Roche wrote:
> What we want is to make cec_add_elem() to return !0 value only
> when the given pfn triggered an action, so that its callers should
> log the error.

No, this is not what the CEC does - it collects those errors and when it
reaches the threshold for any pfn, it offlines the corresponding page. I
know, the comment above talks about:

* That error event entry causes cec_add_elem() to return !0 value and thus
* signal to its callers to log the error.

but it doesn't do that. Frankly, I don't see the point of logging the
error - it already says

pr_err("Soft-offlining pfn: 0x%llx\n", pfn);

which pfn it has offlined. And that is probably only mildly interesting
to people - so what, 4K got offlined, servers have so much memory
nowadays.

The only moment one should start worrying is if one gets those pretty
often but then you're probably better off simply scheduling maintenance
and replacing the faulty DIMM - problem solved.

> What I'm expecting from ras_cec is to "hide" CEs until they reach the
> action threshold where an action is tried against the impacted PFN,

That it does.

> and it's now the time to log the error with the entire notifiers
> chain.

And I'm not sure why we'd want to do that. It simply offlines the page.

But maybe you could explain what you're trying to achieve...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-03-29 09:46:26

by William Roche

[permalink] [raw]

Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

On 26/03/2021 23:43, Borislav Petkov wrote:
> On Fri, Mar 26, 2021 at 11:24:43PM +0100, William Roche wrote:
>> What we want is to make cec_add_elem() to return !0 value only
>> when the given pfn triggered an action, so that its callers should
>> log the error.
>
> No, this is not what the CEC does - it collects those errors and when it
> reaches the threshold for any pfn, it offlines the corresponding page. I
> know, the comment above talks about:
>
> * That error event entry causes cec_add_elem() to return !0 value and thus
> * signal to its callers to log the error.
>
> but it doesn't do that. Frankly, I don't see the point of logging the
> error - it already says
>
> pr_err("Soft-offlining pfn: 0x%llx\n", pfn);
>
> which pfn it has offlined. And that is probably only mildly interesting
> to people - so what, 4K got offlined, servers have so much memory
> nowadays.
>
> The only moment one should start worrying is if one gets those pretty
> often but then you're probably better off simply scheduling maintenance
> and replacing the faulty DIMM - problem solved.

I totally agree with you, and in order to schedule a replacement, MCEs
information (enriched by the notifiers chain) are more meaningful than
only PFN values.

>
>> What I'm expecting from ras_cec is to "hide" CEs until they reach the
>> action threshold where an action is tried against the impacted PFN,
>
> That it does.
>
>> and it's now the time to log the error with the entire notifiers
>> chain.
>
> And I'm not sure why we'd want to do that. It simply offlines the page.
>
> But maybe you could explain what you're trying to achieve...

My little fix proposition as 2 goals:

1/ Giving back ras_cec a consistent behavior where the first occurrence
of a CE doesn't generate an MCE message from the MCE_HANDLED_CEC
notifiers, and a consistent behavior between the slot 0 and the other
pfn slots.

2/ Give the CE MCE information when the action threshold is reached to
help the administrator identify what generated the PFN "Soft-offlining"
or "Invalid pfn" message.

When ras_cec is enabled it hides most of the CE errors, but when the
action threshold is reached all notifiers can generate their indication
about the error that appeared too often.

An administrator getting too many action threshold CE errors can
schedule a replacement based on the indications provided by his EDAC
module etc...

>
> Thx.
>

I think it could be more useful this way than systematically hiding all CEs.

But this decision is yours :)

HTH.

2021-04-01 18:12:47

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

On Mon, Mar 29, 2021 at 11:44:05AM +0200, William Roche wrote:
> I totally agree with you, and in order to schedule a replacement, MCEs
> information (enriched by the notifiers chain) are more meaningful than
> only PFN values.

Well, if you want to collect errors and analyze patterns in order to
detect hw going bad, you're probably better off disabling the CEC
altogether - either disable it in Kconfig or boot with ras=cec_disable.

> 1/ Giving back ras_cec a consistent behavior where the first occurrence
> of a CE doesn't generate an MCE message from the MCE_HANDLED_CEC
> notifiers, and a consistent behavior between the slot 0 and the other
> pfn slots.

If by this you mean the issue with the return value, then sure.

If you mean something else, you'd have to be more specific.

> 2/ Give the CE MCE information when the action threshold is reached to
> help the administrator identify what generated the PFN "Soft-offlining"
> or "Invalid pfn" message.
>
> When ras_cec is enabled it hides most of the CE errors, but when the
> action threshold is reached all notifiers can generate their indication
> about the error that appeared too often.
>
> An administrator getting too many action threshold CE errors can
> schedule a replacement based on the indications provided by his EDAC
> module etc...

Well, this works probably only in theory.

First of all, the CEC sees the error first, before the EDAC drivers.

But, in order to map from the virtual address to the actual DIMM, you
need the EDAC drivers to have a go at the error. In many cases not even
the EDAC drivers can give you that mapping because, well, hw/fw does its
own stuff underneath, predictive fault bla, added value crap, whatever,
so that we can't even get a "DIMM X on processor Y caused the error."

I know, your assumption is that if a page gets offlined by the CEC, then
all the errors' addresses are coming from the same physical DIMM. And
that is probably correct in most cases but I'm not convinced for all.

In any case, what we could do - which is pretty easy and cheap - is to
fix the retval of cec_add_elem() to communicate to the caller that it
offlined a page and this way tell the notifier chain that the error
needs to be printed into dmesg with a statement sayin that DIMM Y got
just one more page offlined.

Over time, if a DIMM is going bad, one should be able to grep dmesg and
correlate all those offlined pages to DIMMs and then maybe see a pattern
and eventually schedule a downtime.

A lot of ifs, I know. :-\

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-04-02 16:02:16

by William Roche

[permalink] [raw]

Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

On 01/04/2021 18:12, Borislav Petkov wrote:
> On Mon, Mar 29, 2021 at 11:44:05AM +0200, William Roche wrote:
>> I totally agree with you, and in order to schedule a replacement, MCEs
>> information (enriched by the notifiers chain) are more meaningful than
>> only PFN values.
>
> Well, if you want to collect errors and analyze patterns in order to
> detect hw going bad, you're probably better off disabling the CEC
> altogether - either disable it in Kconfig or boot with ras=cec_disable.

Corrected Errors are not the best indicators for a failing DIMM, and I
agree that enabling ras_cec is a good thing to have in production.

>
>> 1/ Giving back ras_cec a consistent behavior where the first occurrence
>> of a CE doesn't generate an MCE message from the MCE_HANDLED_CEC
>> notifiers, and a consistent behavior between the slot 0 and the other
>> pfn slots.
>
> If by this you mean the issue with the return value, then sure.
>
> If you mean something else, you'd have to be more specific.

No I just want to fix the return value.
And I expressed the consequences of this fix.

>
>> 2/ Give the CE MCE information when the action threshold is reached to
>> help the administrator identify what generated the PFN "Soft-offlining"
>> or "Invalid pfn" message.
>>
>> When ras_cec is enabled it hides most of the CE errors, but when the
>> action threshold is reached all notifiers can generate their indication
>> about the error that appeared too often.
>>
>> An administrator getting too many action threshold CE errors can
>> schedule a replacement based on the indications provided by his EDAC
>> module etc...
>
> Well, this works probably only in theory.
>
> First of all, the CEC sees the error first, before the EDAC drivers.
>
> But, in order to map from the virtual address to the actual DIMM, you
> need the EDAC drivers to have a go at the error. In many cases not even
> the EDAC drivers can give you that mapping because, well, hw/fw does its
> own stuff underneath, predictive fault bla, added value crap, whatever,
> so that we can't even get a "DIMM X on processor Y caused the error."
>
> I know, your assumption is that if a page gets offlined by the CEC, then
> all the errors' addresses are coming from the same physical DIMM. And
> that is probably correct in most cases but I'm not convinced for all.
>
> In any case, what we could do - which is pretty easy and cheap - is to
> fix the retval of cec_add_elem() to communicate to the caller that it
> offlined a page and this way tell the notifier chain that the error
> needs to be printed into dmesg with a statement sayin that DIMM Y got
> just one more page offlined.

Let's just fix cec_add_elem() with this patch first.

I would love to have a mechanism indicating what physical DIMM had the
page off-lined but I know, as you said earlier, that "hw/fw does its
own stuff underneath" and that we may not have the right information in
the kernel.

>
> Over time, if a DIMM is going bad, one should be able to grep dmesg and
> correlate all those offlined pages to DIMMs and then maybe see a pattern
> and eventually schedule a downtime.
>
> A lot of ifs, I know. :-\
>

For the moment we will have the CE MCE handled my the MCE_HANDLED_CEC
aware notifiers only when a page is off-lined, like it used to be.

Can we start with that small fix ?

2021-04-02 17:08:28

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v1] RAS/CEC: Memory Corrected Errors consistent event filtering

On Fri, Apr 02, 2021 at 06:00:42PM +0200, William Roche wrote:
> Corrected Errors are not the best indicators for a failing DIMM

In the OS, errors reported through different mechanisms is all we have.

> For the moment we will have the CE MCE handled my the MCE_HANDLED_CEC
> aware notifiers only when a page is off-lined, like it used to be.
>
> Can we start with that small fix ?

Sure but do two variables pls - an "err" one which catches the
function's retval and a "ret" one which ce_add_elem() itself returns so
that there's no confusion like it was before:

---
diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index ddecf25b5dd4..b926c679cdaf 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -312,8 +312,8 @@ static bool sanity_check(struct ce_array *ca)
static int cec_add_elem(u64 pfn)
{
struct ce_array *ca = &ce_arr;
+ int count, err, ret = 0;
unsigned int to = 0;
- int count, ret = 0;

/*
* We can be called very early on the identify_cpu() path where we are
@@ -330,8 +330,8 @@ static int cec_add_elem(u64 pfn)
if (ca->n == MAX_ELEMS)
WARN_ON(!del_lru_elem_unlocked(ca));

- ret = find_elem(ca, pfn, &to);
- if (ret < 0) {
+ err = find_elem(ca, pfn, &to);
+ if (err < 0) {
/*
* Shift range [to-end] to make room for one more element.
*/

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-04-07 06:45:49

by William Roche

[permalink] [raw]

Subject: [PATCH v2] RAS/CEC: Memory Corrected Errors consistent event filtering

From: William Roche <[email protected]>

The Corrected Error events collected by the cec_add_elem() have to be
consistently filtered out.
We fix the case where the value of find_elem() to find the slot of a pfn
was mistakenly used as the return value of the function.
Now the MCE notifiers chain relying on MCE_HANDLED_CEC would only report
filtered corrected errors that reached the action threshold.

Signed-off-by: William Roche <[email protected]>
---

Notes:
This is the new patch version using an additional 'err' variable.
Unit tested it on a VM instance and a "Bare Metal" machine.

No reporting is done by the MCE_HANDLED_CEC aware notifiers until
the action threshold is reached.

drivers/ras/cec.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index ddecf25..b926c67 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -312,8 +312,8 @@ static bool sanity_check(struct ce_array *ca)
static int cec_add_elem(u64 pfn)
{
struct ce_array *ca = &ce_arr;
+ int count, err, ret = 0;
unsigned int to = 0;
- int count, ret = 0;

/*
* We can be called very early on the identify_cpu() path where we are
@@ -330,8 +330,8 @@ static int cec_add_elem(u64 pfn)
if (ca->n == MAX_ELEMS)
WARN_ON(!del_lru_elem_unlocked(ca));

- ret = find_elem(ca, pfn, &to);
- if (ret < 0) {
+ err = find_elem(ca, pfn, &to);
+ if (err < 0) {
/*
* Shift range [to-end] to make room for one more element.
*/
--
1.8.3.1

2021-04-07 20:44:06

by tip-bot2 for Alexey Makhalov

[permalink] [raw]

Subject: [tip: x86/urgent] RAS/CEC: Correct ce_add_elem()'s returned values

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 3a62583c2853b0ab37a57dde79decea210b5fb89
Gitweb: https://git.kernel.org/tip/3a62583c2853b0ab37a57dde79decea210b5fb89
Author: William Roche <[email protected]>
AuthorDate: Tue, 06 Apr 2021 11:28:59 -04:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Wed, 07 Apr 2021 11:52:26 +02:00

RAS/CEC: Correct ce_add_elem()'s returned values

ce_add_elem() uses different return values to signal a result from
adding an element to the collector. Commit in Fixes: broke the case
where the element being added is not found in the array. Correct that.

[ bp: Rewrite commit message, add kernel-doc comments. ]

Fixes: de0e0624d86f ("RAS/CEC: Check count_threshold unconditionally")
Signed-off-by: William Roche <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
drivers/ras/cec.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index ddecf25..d7894f1 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -309,11 +309,20 @@ static bool sanity_check(struct ce_array *ca)
return ret;
}

+/**
+ * cec_add_elem - Add an element to the CEC array.
+ * @pfn: page frame number to insert
+ *
+ * Return values:
+ * - <0: on error
+ * - 0: on success
+ * - >0: when the inserted pfn was offlined
+ */
static int cec_add_elem(u64 pfn)
{
struct ce_array *ca = &ce_arr;
+ int count, err, ret = 0;
unsigned int to = 0;
- int count, ret = 0;

/*
* We can be called very early on the identify_cpu() path where we are
@@ -330,8 +339,8 @@ static int cec_add_elem(u64 pfn)
if (ca->n == MAX_ELEMS)
WARN_ON(!del_lru_elem_unlocked(ca));

- ret = find_elem(ca, pfn, &to);
- if (ret < 0) {
+ err = find_elem(ca, pfn, &to);
+ if (err < 0) {
/*
* Shift range [to-end] to make room for one more element.
*/