2023-10-11 16:34:13

by Filippo Sironi

[permalink] [raw]
Subject: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

On some of our large servers and some of our most sorry servers ( :) ),
we're seeing the kernel reporting the warning in mce_gen_pool_add: "MCE
records pool full!". Let's increase the amount of memory that we use to
store the MCE records from 2 to 8 pages to prevent this from happening
and be able to collect useful information.

Signed-off-by: Filippo Sironi <[email protected]>
---
arch/x86/kernel/cpu/mce/genpool.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..870dcb7011cd 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -17,9 +17,9 @@
*
* This memory pool is only to be used to save MCE records in MCE context.
* MCE events are rare, so a fixed size memory pool should be enough. Use
- * 2 pages to save MCE events for now (~80 MCE records at most).
+ * 8 pages to save MCE events for now (~320 MCE records at most).
*/
-#define MCE_POOLSZ (2 * PAGE_SIZE)
+#define MCE_POOLSZ (8 * PAGE_SIZE)

static struct gen_pool *mce_evt_pool;
static LLIST_HEAD(mce_event_llist);
--
2.33.0




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




2023-10-11 17:33:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

On 10/11/23 09:33, Filippo Sironi wrote:
> On some of our large servers and some of our most sorry servers ( ???? ),
> we're seeing the kernel reporting the warning in mce_gen_pool_add: "MCE
> records pool full!". Let's increase the amount of memory that we use to
> store the MCE records from 2 to 8 pages to prevent this from happening
> and be able to collect useful information.

MCE_POOLSZ is used to size gen_pool_buf[] which was a line out of your
diff context:

> #define MCE_POOLSZ (2 * PAGE_SIZE)
>
> static struct gen_pool *mce_evt_pool;
> static LLIST_HEAD(mce_event_llist);
> static char gen_pool_buf[MCE_POOLSZ];

That's in .bss which means it eats up memory for *everyone*. It seems a
little silly to eat up an extra 6 pages of memory for *everyone* in
order to get rid of a message on what I assume is a relatively small set
of "sorry servers".

Is there any way that the size of the pool can be more automatically
determined? Is the likelihood of a bunch errors proportional to the
number of CPUs or amount of RAM or some other aspect of the hardware?

Could the pool be emptied more aggressively so that it does not fill up?

Last, what is the _actual_ harm caused by missing this "useful
information"? Is collecting that information collectively really worth
24kb*NR_X86_SYSTEMS_ON_EARTH? Is it really that valuable to know that
the system got 4,000 ECC errors on a DIMM versus 1,000?

If there's no other choice and this extra information is *CRITICAL*,
then by all means let's enlarge the buffer. But, let's please do it for
a known, tangible benefit.

2023-10-12 11:46:50

by Filippo Sironi

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

On 10/11/23 19:33, Dave Hansen wrote:
> On 10/11/23 09:33, Filippo Sironi wrote:
> > On some of our large servers and some of our most sorry servers ( ???? ),
> > we're seeing the kernel reporting the warning in mce_gen_pool_add: "MCE
> > records pool full!". Let's increase the amount of memory that we use to
> > store the MCE records from 2 to 8 pages to prevent this from happening
> > and be able to collect useful information.
>
>
> MCE_POOLSZ is used to size gen_pool_buf[] which was a line out of your
> diff context:
>
>
> > #define MCE_POOLSZ (2 * PAGE_SIZE)
> >
> > static struct gen_pool *mce_evt_pool;
> > static LLIST_HEAD(mce_event_llist);
> > static char gen_pool_buf[MCE_POOLSZ];
>
>
> That's in .bss which means it eats up memory for *everyone*. It seems a
> little silly to eat up an extra 6 pages of memory for *everyone* in
> order to get rid of a message on what I assume is a relatively small set
> of "sorry servers".

There's correlation across the errors that we're seeing, indeed, we're
looking at the same row being responsible for multiple CPUs tripping and
running into #MC. I still don't like the full lack of visibility; it's not
uncommon in a large fleet to see to take a server out of production,
replace a DIMM and shortly after taking it out of production again to
replace another DIMM just because some of the errors weren't properly
logged.

> Is there any way that the size of the pool can be more automatically
> determined? Is the likelihood of a bunch errors proportional to the
> number of CPUs or amount of RAM or some other aspect of the hardware?
>
>
> Could the pool be emptied more aggressively so that it does not fill up?
>
>
> Last, what is the _actual_ harm caused by missing this "useful
> information"? Is collecting that information collectively really worth
> 24kb*NR_X86_SYSTEMS_ON_EARTH? Is it really that valuable to know that
> the system got 4,000 ECC errors on a DIMM versus 1,000?
>
>
> If there's no other choice and this extra information is *CRITICAL*,
> then by all means let's enlarge the buffer. But, let's please do it for
> a known, tangible benefit.

I'm happy to make this a compile time configuration where the default is
still 2 pages, to avoid changing the status quo.

Filippo






Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


2023-10-12 11:53:48

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

On Thu, Oct 12, 2023 at 11:46:13AM +0000, Sironi, Filippo wrote:
> I'm happy to make this a compile time configuration where the default is
> still 2 pages, to avoid changing the status quo.

Why don't you keep this patch in your kernels, simply, as this seems to
be an issue only for you currently?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-12 15:52:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

On 10/12/23 04:46, Sironi, Filippo wrote:
> There's correlation across the errors that we're seeing, indeed,
> we're looking at the same row being responsible for multiple CPUs
> tripping and running into #MC. I still don't like the full lack of
> visibility; it's not uncommon in a large fleet to see to take a
> server out of production, replace a DIMM and shortly after taking it
> out of production again to replace another DIMM just because some of
> the errors weren't properly logged.

So you had two nearly simultaneous DIMM failures. The first failed,
filled up the buffer and then the second failed, but there was no room.
The second failed *SO* soon after the first that there was no
opportunity to empty the buffer between.

Right?

How do you know that storing 8 pages of records will catch this case as
opposed to storing 2?

>> Is there any way that the size of the pool can be more automatically
>> determined? Is the likelihood of a bunch errors proportional to the
>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>
>> Could the pool be emptied more aggressively so that it does not fill up?

You didn't really address the additional questions I posed there.

I'll add one more: how many of the messages are duplicates or
*effectively* duplicates? Or is that hard to determine at the time that
the entries are being made that they are duplicates?

It _should_ also be fairly easy to enlarge the buffer on demand, say, if
it got half full. What's the time scale over which the buffer filled
up? Did a single #MC fill it up?

I really think we need to understand what the problem is and have _some_
confidence that the proposed solution will fix that, even if we're just
talking about a new config option.

2023-10-16 14:14:48

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

On 10/12/23 11:49 AM, Dave Hansen wrote:
> On 10/12/23 04:46, Sironi, Filippo wrote:
>> There's correlation across the errors that we're seeing, indeed,
>> we're looking at the same row being responsible for multiple CPUs
>> tripping and running into #MC. I still don't like the full lack of
>> visibility; it's not uncommon in a large fleet to see to take a
>> server out of production, replace a DIMM and shortly after taking it
>> out of production again to replace another DIMM just because some of
>> the errors weren't properly logged.
>
> So you had two nearly simultaneous DIMM failures. The first failed,
> filled up the buffer and then the second failed, but there was no room.
> The second failed *SO* soon after the first that there was no
> opportunity to empty the buffer between.
>
> Right?
>
> How do you know that storing 8 pages of records will catch this case as
> opposed to storing 2?
>
>>> Is there any way that the size of the pool can be more automatically
>>> determined? Is the likelihood of a bunch errors proportional to the
>>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>>
>>> Could the pool be emptied more aggressively so that it does not fill up?
>
> You didn't really address the additional questions I posed there.
>
> I'll add one more: how many of the messages are duplicates or
> *effectively* duplicates? Or is that hard to determine at the time that
> the entries are being made that they are duplicates?
>
> It _should_ also be fairly easy to enlarge the buffer on demand, say, if
> it got half full. What's the time scale over which the buffer filled
> up? Did a single #MC fill it up?
>
> I really think we need to understand what the problem is and have _some_
> confidence that the proposed solution will fix that, even if we're just
> talking about a new config option.

I've seen a similar issue, and it's not just related to memory errors.
In my experience it was MCA errors from a variety of hardware blocks.
For example, a bad link internal to an SoC could spew MCA errors
regardless of the scale of RAM or CPUs. Same thing is possible for a bad
cache, etc.

These were during pre-production testing, and the easy workaround is to
increase the MCE genpool size at build time.

I don't think this needs to be the default though.

How about this to start?

1) Keep the current config size for boot time.
2) Add a kernel parameter and/or sysfs file to allow users to request
additional genpool capacity.
3) Use gen_pool_add(), or whichever, to add the capacity based on user
input.

Maybe this can be expanded later to be automatic. But I think it simpler
to start with explicit user input.

Thanks,
Yazen

2023-10-16 14:25:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

On 10/16/23 07:14, Yazen Ghannam wrote:
> 1) Keep the current config size for boot time.
> 2) Add a kernel parameter
> and/or sysfs file to allow users to request additional genpool capacity.
> 3) Use gen_pool_add(), or whichever, to add the capacity based on user
> input. Maybe this can be expanded later to be automatic. But I think it
> simpler to start with explicit user input.

I guarantee virtually nobody will ever use an explicit kernel interface
to bump the size up. It'll be the same exact folks that recompile their
kernels.

An automatic resizing one doesn't have to be fancy and only has to
expand once:

static bool expanded = false;

...

if (full && !expanded) {
expand();
expanded = true;
}

It might be a _wee_ bit worse than that because you might have to queue
some work outside of #MC context but seriously we're talking 10-ish
lines of code. It'd probably be even smaller than doing it when poked
by userspace and wouldn't involve new ABI.

2023-10-16 14:41:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

On Mon, Oct 16, 2023 at 07:24:34AM -0700, Dave Hansen wrote:
> On 10/16/23 07:14, Yazen Ghannam wrote:
> > 1) Keep the current config size for boot time.
> > 2) Add a kernel parameter
> > and/or sysfs file to allow users to request additional genpool capacity.
> > 3) Use gen_pool_add(), or whichever, to add the capacity based on user
> > input. Maybe this can be expanded later to be automatic. But I think it
> > simpler to start with explicit user input.
>
> I guarantee virtually nobody will ever use an explicit kernel interface
> to bump the size up. It'll be the same exact folks that recompile their
> kernels.
>
> An automatic resizing one doesn't have to be fancy and only has to
> expand once:

Can we slow down here pls?

You expand once and then that stuck bit error increases its rate of MCEs
reported and all of a sudden that raised size is overflowed again.

Or, you add logic which thresholds duplicate errors similar to the cmci
storm logic we have.

*If* that is the case at all.

So can we analyze first what type of errors are reported and why they're
overflowing the pool before we do ad-hoc hackery?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-16 14:47:40

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

On 10/16/23 10:24 AM, Dave Hansen wrote:
> On 10/16/23 07:14, Yazen Ghannam wrote:
>> 1) Keep the current config size for boot time.
>> 2) Add a kernel parameter
>> and/or sysfs file to allow users to request additional genpool capacity.
>> 3) Use gen_pool_add(), or whichever, to add the capacity based on user
>> input. Maybe this can be expanded later to be automatic. But I think it
>> simpler to start with explicit user input.
>
> I guarantee virtually nobody will ever use an explicit kernel interface
> to bump the size up. It'll be the same exact folks that recompile their
> kernels.

Right, I agree. My intent was to help avoid a recompile if, for example,
the config size was not enough. Meaning even if a fleet has a custom
kernel with 8 pages of size, it may happen that more is needed. But "may
happen" isn't a strong reason to make a change here. :)

>
> An automatic resizing one doesn't have to be fancy and only has to
> expand once:
>

Why once?

> static bool expanded = false;
>
> ...
>
> if (full && !expanded) {
> expand();
> expanded = true;
> }
>
> It might be a _wee_ bit worse than that because you might have to queue
> some work outside of #MC context but seriously we're talking 10-ish
> lines of code. It'd probably be even smaller than doing it when poked
> by userspace and wouldn't involve new ABI.

Okay, I'm with you here.

Thanks,
Yazen

2023-10-16 16:26:18

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages

> > An automatic resizing one doesn't have to be fancy and only has to
> > expand once:
> >
>
> Why once?
>
> > static bool expanded = false;
> >
> > ...
> >
> > if (full && !expanded) {
> > expand();
> > expanded = true;
> > }
> >
> > It might be a _wee_ bit worse than that because you might have to queue
> > some work outside of #MC context but seriously we're talking 10-ish
> > lines of code. It'd probably be even smaller than doing it when poked
> > by userspace and wouldn't involve new ABI.
>
> Okay, I'm with you here.

I'm not. The reason that pool exists is so that an error record (struct mce)
can be copied into it during machine check context and atomically added
to the mce_event_llist. See mce_gen_pool_add().

I don't see how your "expand()" function is going to safely allocate additional
pages to the pool while executing in the machine check handler.

Just make it one of:

1) CONFIG option
2) boot command line option.
3) Dynamic based on num_online_cpus()

-Tony