2024-02-07 22:58:15

by Avadhut Naik

[permalink] [raw]
Subject: [PATCH 0/2] Extend size of the MCE Records pool

This patchset extends size of the existing MCE records pool to ensure
that MCE records are not dropped, particularly on systems with higher
CPU count.

This is a followup of the below discussion:
https://lore.kernel.org/linux-edac/[email protected]/

The fist patch extends the size of MCE Records pool in accordance to the
CPU count of the system.

The second patch adds a new command line parameter to extend the size of
MCE Records pool by a predetermined number of pages.

Avadhut Naik (2):
x86/MCE: Extend size of the MCE Records pool
x86/MCE: Add command line option to extend MCE Records pool

.../admin-guide/kernel-parameters.txt | 2 +
arch/x86/kernel/cpu/mce/core.c | 3 ++
arch/x86/kernel/cpu/mce/genpool.c | 38 +++++++++++++++++++
arch/x86/kernel/cpu/mce/internal.h | 1 +
4 files changed, 44 insertions(+)


base-commit: 93e1e1fe2f97859cb079078b6b750542ebbfdea8
--
2.34.1



2024-02-07 22:58:31

by Avadhut Naik

[permalink] [raw]
Subject: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

Currently, 2 pages are allocated for the MCE Records pool during system
bootup. Records of MCEs (struct mce) occurring on the system are added to
the pool through mce_gen_pool_add() in MC context. These records are then
decoded later, in process context through notifier chains.

However, on systems with high CPU count, the 2 pages allocated for the
pool might not be sufficient in some instances. Successive MCEs received
back-to-back, before they are decoded through mce_gen_pool_process(), will
result in the pool getting exhausted. Consequently, some MCE records will
be missed. The issue further intensifies since the amount of memory
associated with a system typically increases with the CPU count, thereby,
increasing the probability of MCEs being received.

The limit of 2 pages for the MCE records pool was set more than 8 years
ago and has not been revised till date. The CPU count and the amount of
memory associated with a system however, have increased enormously since
then. Additionally, the size of MCE Records (struct mce) too has increased
from 88 bytes to 124 bytes.

Extend the size of MCE Records pool to better serve modern systems. The
increase in size depends on the CPU count of the system. Currently, since
size of struct mce is 124 bytes, each logical CPU of the system will have
space for at least 2 MCE records available in the pool. To get around the
allocation woes during early boot time, the same is undertaken using
late_initcall().

Signed-off-by: Avadhut Naik <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 3 +++
arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++++++++
arch/x86/kernel/cpu/mce/internal.h | 1 +
3 files changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index b5cc557cfc37..5d6d7994d549 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void)
if (mca_cfg.recovery)
enable_copy_mc_fragile();

+ if (mce_gen_pool_extend())
+ pr_info("Couldn't extend MCE records pool!\n");
+
mcheck_debugfs_init();

/*
diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..aed01612d342 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -20,6 +20,7 @@
* 2 pages to save MCE events for now (~80 MCE records at most).
*/
#define MCE_POOLSZ (2 * PAGE_SIZE)
+#define CPU_GEN_MEMSZ 256

static struct gen_pool *mce_evt_pool;
static LLIST_HEAD(mce_event_llist);
@@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce)
return 0;
}

+int mce_gen_pool_extend(void)
+{
+ unsigned long addr, len;
+ int ret = -ENOMEM;
+ u32 num_threads;
+
+ num_threads = num_present_cpus();
+ len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
+ addr = (unsigned long)kzalloc(len, GFP_KERNEL);
+
+ if (!addr)
+ goto out;
+
+ ret = gen_pool_add(mce_evt_pool, addr, len, -1);
+ if (ret)
+ kfree((void *)addr);
+
+out:
+ return ret;
+}
+
static int mce_gen_pool_create(void)
{
struct gen_pool *tmpp;
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 01f8f03969e6..81e35ec58ebc 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -33,6 +33,7 @@ void mce_gen_pool_process(struct work_struct *__unused);
bool mce_gen_pool_empty(void);
int mce_gen_pool_add(struct mce *mce);
int mce_gen_pool_init(void);
+int mce_gen_pool_extend(void);
struct llist_node *mce_gen_pool_prepare_records(void);

int mce_severity(struct mce *a, struct pt_regs *regs, char **msg, bool is_excp);
--
2.34.1


2024-02-07 22:58:47

by Avadhut Naik

[permalink] [raw]
Subject: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

Extension of MCE Records pool, based on system's CPU count, was undertaken
through the previous patch (x86/MCE: Extend size of the MCE Records pool).

Add a new command line parameter "mce-genpool-extend" to set the size of
MCE Records pool to a predetermined number of pages instead of system's
CPU count.

Signed-off-by: Avadhut Naik <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 2 ++
arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++---
2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3a1fa1f81d9d..62e7da4d9fda 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3135,6 +3135,8 @@

mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst

+ mce-genpool-extend= [X86-64] Number of pages to add to MCE Records pool.
+
md= [HW] RAID subsystems devices and level
See Documentation/admin-guide/md.rst.

diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index aed01612d342..d6e04fa5ee07 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -22,6 +22,7 @@
#define MCE_POOLSZ (2 * PAGE_SIZE)
#define CPU_GEN_MEMSZ 256

+static unsigned int mce_genpool_extend;
static struct gen_pool *mce_evt_pool;
static LLIST_HEAD(mce_event_llist);
static char gen_pool_buf[MCE_POOLSZ];
@@ -123,10 +124,14 @@ int mce_gen_pool_extend(void)
int ret = -ENOMEM;
u32 num_threads;

- num_threads = num_present_cpus();
- len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
- addr = (unsigned long)kzalloc(len, GFP_KERNEL);
+ if (mce_genpool_extend) {
+ len = mce_genpool_extend * PAGE_SIZE;
+ } else {
+ num_threads = num_present_cpus();
+ len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
+ }

+ addr = (unsigned long)kzalloc(len, GFP_KERNEL);
if (!addr)
goto out;

@@ -159,6 +164,17 @@ static int mce_gen_pool_create(void)
return ret;
}

+static int __init parse_mce_genpool_extend(char *str)
+{
+ if (*str == '=')
+ str++;
+
+ get_option(&str, &mce_genpool_extend);
+ return 1;
+}
+
+__setup("mce-genpool-extend", parse_mce_genpool_extend);
+
int mce_gen_pool_init(void)
{
/* Just init mce_gen_pool once. */
--
2.34.1


2024-02-08 00:03:13

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

> +#define CPU_GEN_MEMSZ 256

What is this define?

Why isn't this "sizeof(struct mce)"?

Or 2* that if you are trying to reserve enough space for two records per CPU.

-Tony

2024-02-08 18:12:35

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

Hi

On 2/7/2024 18:02, Luck, Tony wrote:
>> +#define CPU_GEN_MEMSZ 256
>
> What is this define?
>
> Why isn't this "sizeof(struct mce)"?
>
> Or 2* that if you are trying to reserve enough space for two records per CPU.
>
That's the memory in bytes reserved for each logical CPU in the
extended MCE Records pool. By current size of struct mce that
equates to around 2 records.

Will change it to (2 * sizeof(struct mce)) though. Feels more
accurate. Thanks for the suggestion!

Do you have any additional concerns/comments on this patchset?

> -Tony

--
Thanks,
Avadhut Naik

2024-02-08 18:49:24

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

> Will change it to (2 * sizeof(struct mce)) though. Feels more
> accurate. Thanks for the suggestion!

Thanks.

> Do you have any additional concerns/comments on this patchset?

Overall this is an excellent addition. Reserved space to log errors does need to scale
up with the CPU count.

I think part 1 (unconditional increase based on CPU count) is a "must have" enhancement.
With the change to CPU_GEN_MEMSZ #define:

Reviewed-by: Tony Luck <[email protected]>


I'm less enthusiastic about part 2 adding a command line option to override the code in
part 1 with a bigger (or smaller?) amount. Can you describe some situation where a user
would need to use this?

-Tony

2024-02-08 21:09:39

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

On 2/7/2024 2:56 PM, Avadhut Naik wrote:

> Extend the size of MCE Records pool to better serve modern systems. The
> increase in size depends on the CPU count of the system. Currently, since
> size of struct mce is 124 bytes, each logical CPU of the system will have
> space for at least 2 MCE records available in the pool. To get around the
> allocation woes during early boot time, the same is undertaken using
> late_initcall().
>

I guess making this proportional to the number of CPUs is probably fine
assuming CPUs and memory capacity *would* generally increase in sync.

But, is there some logic to having 2 MCE records per logical cpu or it
is just a heuristic approach? In practice, the pool is shared amongst
all MCE sources and can be filled by anyone, right?

> Signed-off-by: Avadhut Naik <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/core.c | 3 +++
> arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++++++++
> arch/x86/kernel/cpu/mce/internal.h | 1 +
> 3 files changed, 26 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index b5cc557cfc37..5d6d7994d549 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void)
> if (mca_cfg.recovery)
> enable_copy_mc_fragile();
>
> + if (mce_gen_pool_extend())
> + pr_info("Couldn't extend MCE records pool!\n");
> +

Why do this unconditionally? For a vast majority of low core-count, low
memory systems the default 2 pages would be good enough.

Should there be a threshold beyond which the extension becomes active?
Let's say, for example, a check for num_present_cpus() > 32 (Roughly
based on 8Kb memory and 124b*2 estimate per logical CPU).

Whatever you choose, a comment above the code would be helpful
describing when the extension is expected to be useful.

> mcheck_debugfs_init();
>
> /*
> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> index fbe8b61c3413..aed01612d342 100644
> --- a/arch/x86/kernel/cpu/mce/genpool.c
> +++ b/arch/x86/kernel/cpu/mce/genpool.c
> @@ -20,6 +20,7 @@
> * 2 pages to save MCE events for now (~80 MCE records at most).
> */
> #define MCE_POOLSZ (2 * PAGE_SIZE)
> +#define CPU_GEN_MEMSZ 256
>

The comment above MCE_POOLSZ probably needs a complete re-write. Right
now, it reads as follows:

* This memory pool is only to be used to save MCE records in MCE context.
* MCE events are rare, so a fixed size memory pool should be enough. Use
* 2 pages to save MCE events for now (~80 MCE records at most).

Apart from the numbers being incorrect since sizeof(struct mce) has
increased, this patch is based on the assumption that the current MCE
memory pool is no longer enough in certain cases.

> static struct gen_pool *mce_evt_pool;
> static LLIST_HEAD(mce_event_llist);
> @@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce)
> return 0;
> }
>
> +int mce_gen_pool_extend(void)
> +{
> + unsigned long addr, len;

s/len/size/

> + int ret = -ENOMEM;
> + u32 num_threads;
> +
> + num_threads = num_present_cpus();
> + len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);

Nit: Can the use of the num_threads variable be avoided?
How about:

size = PAGE_ALIGN(num_present_cpus() * CPU_GEN_MEMSZ);



> + addr = (unsigned long)kzalloc(len, GFP_KERNEL);

Also, shouldn't the new allocation be incremental to the 2 pages already
present?

Let's say, for example, that you have a 40-cpu system and the calculated
size in this case comes out to 40 * 2 * 128b = 9920bytes i.e. 3 pages.
You only need to allocate 1 additional page to add to mce_evt_pool
instead of the 3 pages that the current code does.

Sohil

> +
> + if (!addr)
> + goto out;
> +
> + ret = gen_pool_add(mce_evt_pool, addr, len, -1);
> + if (ret)
> + kfree((void *)addr);
> +
> +out:
> + return ret;
> +}
> +
> static int mce_gen_pool_create(void)
> {
> struct gen_pool *tmpp;



2024-02-09 01:36:17

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On 2/7/2024 2:56 PM, Avadhut Naik wrote:
> Extension of MCE Records pool, based on system's CPU count, was undertaken
> through the previous patch (x86/MCE: Extend size of the MCE Records pool).
>

This statement is unnecessary.

> Add a new command line parameter "mce-genpool-extend" to set the size of
> MCE Records pool to a predetermined number of pages instead of system's
> CPU count.
>

Like Tony, I am unsure of when this would be useful.

Also, why does it need to override the CPU count based extension
mechanism? Would just adding more pages not work for them?

If there really is a good reason to do this, how about changing
mce-genpool-extend to mce-genpool-add-pages and keeping the description
the same?

mce-genpool-add-pages= [X86-64] Number of pages to add to MCE Records pool.

Then you can simply add these many number of additional pages to the new
CPU based mechanism.

Sohil

> Signed-off-by: Avadhut Naik <[email protected]>
> ---
> .../admin-guide/kernel-parameters.txt | 2 ++
> arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++---
> 2 files changed, 21 insertions(+), 3 deletions(-)
>


2024-02-09 19:52:34

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

Hi,

On 2/8/2024 15:09, Sohil Mehta wrote:
> On 2/7/2024 2:56 PM, Avadhut Naik wrote:
>
>> Extend the size of MCE Records pool to better serve modern systems. The
>> increase in size depends on the CPU count of the system. Currently, since
>> size of struct mce is 124 bytes, each logical CPU of the system will have
>> space for at least 2 MCE records available in the pool. To get around the
>> allocation woes during early boot time, the same is undertaken using
>> late_initcall().
>>
>
> I guess making this proportional to the number of CPUs is probably fine
> assuming CPUs and memory capacity *would* generally increase in sync.
>
> But, is there some logic to having 2 MCE records per logical cpu or it
> is just a heuristic approach? In practice, the pool is shared amongst
> all MCE sources and can be filled by anyone, right?
>
Yes, the pool is shared among all MCE sources but the logic for 256 is
that the genpool was set to 2 pages i.e. 8192 bytes in 2015.
Around that time, AFAIK, the max number of logical CPUs on a system was
32.
So, in the maximum case, each CPU will have around 256 bytes (8192/32) in
the pool. It equates to approximately 2 MCE records since sizeof(struct mce)
back then was 88 bytes.
>> Signed-off-by: Avadhut Naik <[email protected]>
>> ---
>> arch/x86/kernel/cpu/mce/core.c | 3 +++
>> arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++++++++
>> arch/x86/kernel/cpu/mce/internal.h | 1 +
>> 3 files changed, 26 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index b5cc557cfc37..5d6d7994d549 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -2901,6 +2901,9 @@ static int __init mcheck_late_init(void)
>> if (mca_cfg.recovery)
>> enable_copy_mc_fragile();
>>
>> + if (mce_gen_pool_extend())
>> + pr_info("Couldn't extend MCE records pool!\n");
>> +
>
> Why do this unconditionally? For a vast majority of low core-count, low
> memory systems the default 2 pages would be good enough.
>
> Should there be a threshold beyond which the extension becomes active?
> Let's say, for example, a check for num_present_cpus() > 32 (Roughly
> based on 8Kb memory and 124b*2 estimate per logical CPU).
>
> Whatever you choose, a comment above the code would be helpful
> describing when the extension is expected to be useful.
>
Put it in unconditionally because IMO the increase in memory even for
low-core systems didn't seem to be substantial. Just an additional page
for systems with less than 16 CPUs.

But I do get your point. Will add a check in mcheck_late_init() for CPUs
present. Something like below:

@@ -2901,7 +2901,7 @@ static int __init mcheck_late_init(void)
if (mca_cfg.recovery)
enable_copy_mc_fragile();

- if (mce_gen_pool_extend())
+ if ((num_present_cpus() > 32) && mce_gen_pool_extend())
pr_info("Couldn't extend MCE records pool!\n");

Does this look good? The genpool extension will then be undertaken only for
systems with more than 32 CPUs. Will explain the same in a comment.

>> mcheck_debugfs_init();
>>
>> /*
>> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
>> index fbe8b61c3413..aed01612d342 100644
>> --- a/arch/x86/kernel/cpu/mce/genpool.c
>> +++ b/arch/x86/kernel/cpu/mce/genpool.c
>> @@ -20,6 +20,7 @@
>> * 2 pages to save MCE events for now (~80 MCE records at most).
>> */
>> #define MCE_POOLSZ (2 * PAGE_SIZE)
>> +#define CPU_GEN_MEMSZ 256
>>
>
> The comment above MCE_POOLSZ probably needs a complete re-write. Right
> now, it reads as follows:
>
> * This memory pool is only to be used to save MCE records in MCE context.
> * MCE events are rare, so a fixed size memory pool should be enough. Use
> * 2 pages to save MCE events for now (~80 MCE records at most).
>
> Apart from the numbers being incorrect since sizeof(struct mce) has
> increased, this patch is based on the assumption that the current MCE
> memory pool is no longer enough in certain cases.
>
Yes, will change the comment to something like below:

* This memory pool is only to be used to save MCE records in MCE context.
* Though MCE events are rare, their frequency typically depends on the
* system's memory and CPU count.
* Allocate 2 pages to the MCE Records pool during early boot with the
* option to extend the pool, as needed, through command line, for systems
* with CPU count of more than 32.
* By default, each logical CPU can have around 2 MCE records in the pool
* at the same time.

Sounds good?

>> static struct gen_pool *mce_evt_pool;
>> static LLIST_HEAD(mce_event_llist);
>> @@ -116,6 +117,27 @@ int mce_gen_pool_add(struct mce *mce)
>> return 0;
>> }
>>
>> +int mce_gen_pool_extend(void)
>> +{
>> + unsigned long addr, len;
>
> s/len/size/
>
Noted.
>> + int ret = -ENOMEM;
>> + u32 num_threads;
>> +
>> + num_threads = num_present_cpus();
>> + len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
>
> Nit: Can the use of the num_threads variable be avoided?
> How about:
>
> size = PAGE_ALIGN(num_present_cpus() * CPU_GEN_MEMSZ);
>
Will do.
>
>
>> + addr = (unsigned long)kzalloc(len, GFP_KERNEL);
>
> Also, shouldn't the new allocation be incremental to the 2 pages already
> present?
>
> Let's say, for example, that you have a 40-cpu system and the calculated
> size in this case comes out to 40 * 2 * 128b = 9920bytes i.e. 3 pages.
> You only need to allocate 1 additional page to add to mce_evt_pool
> instead of the 3 pages that the current code does.
>
Will make it incremental when genpool extension is being undertaken through
the default means. Something like below:

@@ -129,6 +134,7 @@ int mce_gen_pool_extend(void)
} else {
num_threads = num_present_cpus();
len = PAGE_ALIGN(num_threads * CPU_GEN_MEMSZ);
+ len -= MCE_POOLSZ;

Does this sound good?

--
Thanks,
Avadhut Naik

> Sohil
>
>> +
>> + if (!addr)
>> + goto out;
>> +
>> + ret = gen_pool_add(mce_evt_pool, addr, len, -1);
>> + if (ret)
>> + kfree((void *)addr);
>> +
>> +out:
>> + return ret;
>> +}
>> +
>> static int mce_gen_pool_create(void)
>> {
>> struct gen_pool *tmpp;
>
>



2024-02-09 20:03:13

by Naik, Avadhut

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

Hi,

On 2/8/2024 19:36, Sohil Mehta wrote:
> On 2/7/2024 2:56 PM, Avadhut Naik wrote:
>> Extension of MCE Records pool, based on system's CPU count, was undertaken
>> through the previous patch (x86/MCE: Extend size of the MCE Records pool).
>>
>
> This statement is unnecessary.
>
Noted.

>> Add a new command line parameter "mce-genpool-extend" to set the size of
>> MCE Records pool to a predetermined number of pages instead of system's
>> CPU count.
>>
>
> Like Tony, I am unsure of when this would be useful.
>
> Also, why does it need to override the CPU count based extension
> mechanism? Would just adding more pages not work for them?
>
> If there really is a good reason to do this, how about changing
> mce-genpool-extend to mce-genpool-add-pages and keeping the description
> the same?
>
> mce-genpool-add-pages= [X86-64] Number of pages to add to MCE Records pool.
>
> Then you can simply add these many number of additional pages to the new
> CPU based mechanism.
>
Is it safe to assume that users will always want to increase the size of the
pool and not decrease it?

IMO, the command-line option provides flexibility for users to choose the size of
MCE Records pool in case, they don't agree with the CPU count logic. Just added it
to ensure that we are not enforcing this increased memory footprint across the board.

Would you agree?

--
Thanks,
Avadhut Naik

> Sohil
>
>> Signed-off-by: Avadhut Naik <[email protected]>
>> ---
>> .../admin-guide/kernel-parameters.txt | 2 ++
>> arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++---
>> 2 files changed, 21 insertions(+), 3 deletions(-)
>>
>


2024-02-09 20:09:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Fri, Feb 09, 2024 at 02:02:49PM -0600, Naik, Avadhut wrote:
> Is it safe to assume that users will always want to increase the size
> of the pool and not decrease it?

Why don't you make the gen pool size a function of the number of CPUs on
the system and have it all work automagically?

Burdening the user with yet another cmdline switch is a bad idea. We
have way too many as it is.

This stuff should work out-of-the-box, without user intervention if
possible. And it is possible in this case.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-09 20:16:24

by Naik, Avadhut

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/MCE: Extend size of the MCE Records pool

Hi,

On 2/8/2024 12:39, Luck, Tony wrote:
>> Will change it to (2 * sizeof(struct mce)) though. Feels more
>> accurate. Thanks for the suggestion!
>
> Thanks.
>
>> Do you have any additional concerns/comments on this patchset?
>
> Overall this is an excellent addition. Reserved space to log errors does need to scale
> up with the CPU count.
>
> I think part 1 (unconditional increase based on CPU count) is a "must have" enhancement.
> With the change to CPU_GEN_MEMSZ #define:
>
> Reviewed-by: Tony Luck <[email protected]>
>
>
> I'm less enthusiastic about part 2 adding a command line option to override the code in
> part 1 with a bigger (or smaller?) amount. Can you describe some situation where a user
> would need to use this?
>
I added the command-line option to ensure that we have covered all bases and are not enforcing
this memory footprint on all users.

A system with 512 logical CPUs, by the current proposed logic, will have 32 pages allocated
for the pool ((512*256)/4096)). Some users may feel that this is not needed on their systems
and they can do with just, maybe, 16 pages. The command line option gives them the flexibility
to do so without having to change kernel code, rebuild and deploy.

Conversely, some users wanting to err on the side of caution, might feel that the above 32 pages
are not enough for the pool and may want to allocate more, maybe, 48 pages. The command line
option again, provides them with the flexibility to do so.

Sounds reasonable?

--
Thanks,
Avadhut Naik

> -Tony



2024-02-09 20:17:02

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On 2/9/2024 12:02 PM, Naik, Avadhut wrote:

> Is it safe to assume that users will always want to increase the size of the
> pool and not decrease it?
>
> IMO, the command-line option provides flexibility for users to choose the size of
> MCE Records pool in case, they don't agree with the CPU count logic. Just added it
> to ensure that we are not enforcing this increased memory footprint across the board.
>
> Would you agree?
>

Not really. Providing this level of configuration seems excessive and
unnecessary.

To me, it seems that we are over-compensating with the calculations in
the previous patch and then providing a mechanism to correct it here and
putting this burden on the user.

How about being more conservative with the allocations in the previous
patch so that we don't need to introduce this additional mechanism right
now? Later, if there is really a need for some specific usage, the patch
can be re-submitted then with the supporting data.

Sohil



2024-02-09 20:28:43

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

> How about being more conservative with the allocations in the previous
> patch so that we don't need to introduce this additional mechanism right
> now? Later, if there is really a need for some specific usage, the patch
> can be re-submitted then with the supporting data.

There used to be a rule-of-thumb when configuring systems to have at least
one GByte of memory per CPU. Anyone following that rule shouldn't be
worried about sub-kilobyte allocations per CPU.

-Tony

2024-02-09 20:52:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote:
> IIUC, this is exactly what the first patch in this series is trying to
> accomplish. Please correct me if I understood wrong.

Yes, you did.

I don't mean to extend it - I mean to allocate it from the very
beginning to

min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);

There's a sane minimum and one page pro logical CPU should be fine on
pretty much every configuration...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-09 20:55:59

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

Hi,

On 2/9/2024 14:09, Borislav Petkov wrote:
> On Fri, Feb 09, 2024 at 02:02:49PM -0600, Naik, Avadhut wrote:
>> Is it safe to assume that users will always want to increase the size
>> of the pool and not decrease it?
>
> Why don't you make the gen pool size a function of the number of CPUs on
> the system and have it all work automagically?
>
IIUC, this is exactly what the first patch in this series is trying to accomplish.
Please correct me if I understood wrong.

> Burdening the user with yet another cmdline switch is a bad idea. We
> have way too many as it is.
>
> This stuff should work out-of-the-box, without user intervention if
> possible. And it is possible in this case.
> Okay. Will drop the command line argument.

--
Thanks,
Avadhut Naik
> Thx.
>

2024-02-09 21:02:48

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On 2/9/2024 12:28 PM, Luck, Tony wrote:
>> How about being more conservative with the allocations in the previous
>> patch so that we don't need to introduce this additional mechanism right
>> now? Later, if there is really a need for some specific usage, the patch
>> can be re-submitted then with the supporting data.
>
> There used to be a rule-of-thumb when configuring systems to have at least
> one GByte of memory per CPU. Anyone following that rule shouldn't be
> worried about sub-kilobyte allocations per CPU.
>

I meant, to avoid the need for this second patch we can always start
lower and increase it later.

256 bytes per cpu seems fine to me as done in the previous patch. But,
if that seems too high as described by Avadhut below then maybe we can
start with 200 bytes or any other number. It's just heuristic IIUC.

https://lore.kernel.org/lkml/[email protected]/

Sohil

2024-02-10 07:53:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On February 9, 2024 9:51:11 PM GMT+01:00, Borislav Petkov <[email protected]> wrote:
>On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote:
>> IIUC, this is exactly what the first patch in this series is trying to
>> accomplish. Please correct me if I understood wrong.
>
>Yes, you did.
>
>I don't mean to extend it - I mean to allocate it from the very
>beginning to
>
> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);

max() ofc.

>There's a sane minimum and one page pro logical CPU should be fine on
>pretty much every configuration...
>
>Thx.
>


--
Sent from a small device: formatting sucks and brevity is inevitable.

2024-02-10 21:15:44

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

Hi,

On 2/10/2024 01:52, Borislav Petkov wrote:
> On February 9, 2024 9:51:11 PM GMT+01:00, Borislav Petkov <[email protected]> wrote:
>> On Fri, Feb 09, 2024 at 02:35:12PM -0600, Naik, Avadhut wrote:
>>> IIUC, this is exactly what the first patch in this series is trying to
>>> accomplish. Please correct me if I understood wrong.
>>
>> Yes, you did.
>>
>> I don't mean to extend it - I mean to allocate it from the very
>> beginning to
>>
>> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);
>

IIUC, you wouldn't want to extend the pool through late_initcall().
Instead, you would want for memory to be allocated (on the heap) and
size of the pool to be set at the very beginning i.e. when the pool
is created (~2 seconds, according to dmesg timestamps).

Please correct me if I have understood wrong.

I was actually doing something similar initially (setting size of the
pool right after its creation) but then went with extending the pool
since I wasn't sure if heap allocations should be undertaken that
early during bootup.

> max() ofc.
>
Thanks! This does clear a one part of my confusion.

--
Thanks,
Avadhut Naik

>> There's a sane minimum and one page pro logical CPU should be fine on
>> pretty much every configuration...
>>
>> Thx.
>>
>
>



2024-02-11 11:15:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote:
> IIUC, you wouldn't want to extend the pool through late_initcall().
> Instead, you would want for memory to be allocated (on the heap) and
> size of the pool to be set at the very beginning i.e. when the pool
> is created (~2 seconds, according to dmesg timestamps).
>
> Please correct me if I have understood wrong.

Nah, you got it right. I went, looked and realized that we have to do
this early dance because we have no allocator yet. And we can't move
this gen_pool allocation to later, when we *do* have an allocator
because MCA is up and logging already.

But your extending approach doesn't fly in all cases either:

gen_pool_add->gen_pool_add_virt->gen_pool_add_owner

it grabs the pool->lock spinlock and adds to &pool->chunks while, at the
exact same time, gen_pool_alloc(), in *NMI* context iterates over that
same &pool->chunks in the case we're logging an MCE at exact that same
time when you're extending the buffer.

And Tony already said that in the thread you're quoting:

https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/

So no, that doesn't work either.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 02:54:47

by Naik, Avadhut

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

Hi,

On 2/11/2024 05:14, Borislav Petkov wrote:
> On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote:
>> IIUC, you wouldn't want to extend the pool through late_initcall().
>> Instead, you would want for memory to be allocated (on the heap) and
>> size of the pool to be set at the very beginning i.e. when the pool
>> is created (~2 seconds, according to dmesg timestamps).
>>
>> Please correct me if I have understood wrong.
>
> Nah, you got it right. I went, looked and realized that we have to do
> this early dance because we have no allocator yet. And we can't move
> this gen_pool allocation to later, when we *do* have an allocator
> because MCA is up and logging already.
>
Okay. Will make changes to allocate memory and set size of the pool
when it is created. Also, will remove the command line parameter and
resubmit.

> But your extending approach doesn't fly in all cases either:
>
> gen_pool_add->gen_pool_add_virt->gen_pool_add_owner
>
> it grabs the pool->lock spinlock and adds to &pool->chunks while, at the
> exact same time, gen_pool_alloc(), in *NMI* context iterates over that
> same &pool->chunks in the case we're logging an MCE at exact that same
> time when you're extending the buffer.
>
> And Tony already said that in the thread you're quoting:
>
> https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/
>
> So no, that doesn't work either.
>
> Thx.

Thanks for this explanation!
>

--
Thanks,
Avadhut Naik

2024-02-12 08:58:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Sun, Feb 11, 2024 at 08:54:29PM -0600, Naik, Avadhut wrote:
> Okay. Will make changes to allocate memory and set size of the pool
> when it is created. Also, will remove the command line parameter and
> resubmit.

Before you do, go read that original thread again but this time take
your time to grok it.

And then try answering those questions:

* Why are *you* fixing this? I know what the AWS reason is, what is
yours?

* Can you think of a slick deduplication scheme instead of blindly
raising the buffer size?

* What's wrong with not logging some early errors, can we live with that
too? If it were firmware-first, it cannot simply extend its buffer size
because it has limited space. So what does firmware do in such cases?

Think long and hard about the big picture, analyze the problem properly
and from all angles before you go and do patches.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 09:33:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 09:58:01AM +0100, Borislav Petkov wrote:
> * Can you think of a slick deduplication scheme instead of blindly
> raising the buffer size?

And here's the simplest scheme: you don't extend the buffer. On
overflow, you say "Buffer full, here's the MCE" and you dump the error
long into dmesg. Problem solved.

A slicker deduplication scheme would be even better, tho. Maybe struct
mce.count which gets incremented instead of adding the error record to
the buffer again. And so on...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 17:35:25

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

> And here's the simplest scheme: you don't extend the buffer. On
> overflow, you say "Buffer full, here's the MCE" and you dump the error
> long into dmesg. Problem solved.
>
> A slicker deduplication scheme would be even better, tho. Maybe struct
> mce.count which gets incremented instead of adding the error record to
> the buffer again. And so on...

Walking the structures already allocated from the genpool in the #MC
handler may be possible, but what is the criteria for "duplicates"?
Do we avoid entering duplicates into the pool altogether? Or when the pool
is full overwrite a duplicate?

How about compile time allocation of extra space. Algorithm below for
illustrative purposes only. May need some more thought about how
to scale up.

-Tony

[Diff pasted into Outlook, chances that it will automatically apply = 0%]

diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..0fc2925c0839 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -16,10 +16,15 @@
* used to save error information organized in a lock-less list.
*
* This memory pool is only to be used to save MCE records in MCE context.
- * MCE events are rare, so a fixed size memory pool should be enough. Use
- * 2 pages to save MCE events for now (~80 MCE records at most).
+ * MCE events are rare, so a fixed size memory pool should be enough.
+ * Low CPU count systems allocate 2 pages (enough for ~64 "struct mce"
+ * records). Large systems scale up the allocation based on CPU count.
*/
+#if CONFIG_NR_CPUS < 128
#define MCE_POOLSZ (2 * PAGE_SIZE)
+#else
+#define MCE_POOLSZ (CONFIG_NR_CPUS / 64 * PAGE_SIZE)
+#endif

static struct gen_pool *mce_evt_pool;
static LLIST_HEAD(mce_event_llist);
[agluck@agluck-desk3 mywork]$ vi arch/x86/kernel/cpu/mce/genpool.c
[agluck@agluck-desk3 mywork]$ git diff
diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..47bf677578ca 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -16,10 +16,15 @@
* used to save error information organized in a lock-less list.
*
* This memory pool is only to be used to save MCE records in MCE context.
- * MCE events are rare, so a fixed size memory pool should be enough. Use
- * 2 pages to save MCE events for now (~80 MCE records at most).
+ * MCE events are rare, but scale up with CPU count. Low CPU count
+ * systems allocate 2 pages (enough for ~64 "struct mce" records). Large
+ * systems scale up the allocation based on CPU count.
*/
+#if CONFIG_NR_CPUS < 128
#define MCE_POOLSZ (2 * PAGE_SIZE)
+#else
+#define MCE_POOLSZ (CONFIG_NR_CPUS / 64 * PAGE_SIZE)
+#endif

static struct gen_pool *mce_evt_pool;
static LLIST_HEAD(mce_event_llist);

2024-02-12 17:58:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 05:29:31PM +0000, Luck, Tony wrote:
> Walking the structures already allocated from the genpool in the #MC
> handler may be possible, but what is the criteria for "duplicates"?

for each i in pool:
memcmp(mce[i], new_mce, sizeof(struct mce));

It'll probably need to mask out fields like ->time etc.

> Do we avoid entering duplicates into the pool altogether?

We could do

mce[i].count++;

in the same loop.

Dunno yet if we even need to state how many duplicates are there...

> Or when the pool is full overwrite a duplicate?

Nah, not overwrite the duplicate but not add the new one. Cheaper.

> How about compile time allocation of extra space. Algorithm below for
> illustrative purposes only. May need some more thought about how
> to scale up.

Yeah, it is too static IMO. Especially if NR_CPUS is a build-time thing
- needs to be based on the actual number of CPUs on the machine.

BUT, we don't have an allocator yet.

So we end up allocating it there on the heap.

Unless we can do something with memblock...

But then this still needs code audit whether num_possible_cpus() is
final at that point. Because if it is, that would be the optimal thing
to do a memblock_alloc() there...

> [Diff pasted into Outlook, chances that it will automatically apply = 0%]

How you even do patches with outschmook is mindboggling :-)

At least use Thunderbird. That's what I do for the company mail but then
again I don't even try to do patches over company mail...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 18:49:56

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On 2/11/2024 6:14 AM, Borislav Petkov wrote:
> On Sat, Feb 10, 2024 at 03:15:26PM -0600, Naik, Avadhut wrote:
>> IIUC, you wouldn't want to extend the pool through late_initcall().
>> Instead, you would want for memory to be allocated (on the heap) and
>> size of the pool to be set at the very beginning i.e. when the pool
>> is created (~2 seconds, according to dmesg timestamps).
>>
>> Please correct me if I have understood wrong.
>
> Nah, you got it right. I went, looked and realized that we have to do
> this early dance because we have no allocator yet. And we can't move
> this gen_pool allocation to later, when we *do* have an allocator
> because MCA is up and logging already.
>
> But your extending approach doesn't fly in all cases either:
>
> gen_pool_add->gen_pool_add_virt->gen_pool_add_owner
>
> it grabs the pool->lock spinlock and adds to &pool->chunks while, at the
> exact same time, gen_pool_alloc(), in *NMI* context iterates over that
> same &pool->chunks in the case we're logging an MCE at exact that same
> time when you're extending the buffer.
>
> And Tony already said that in the thread you're quoting:
>
> https://lore.kernel.org/linux-edac/SJ1PR11MB60832922E4D036138FF390FAFCD7A@SJ1PR11MB6083.namprd11.prod.outlook.com/
>
> So no, that doesn't work either.
>

I'm confused why it won't work.

X86 has ARCH_HAVE_NMI_SAFE_CMPXCHG. I expect atomics/caches will still
work in interrupt or #MC context. If not, then we'd have a fatal error
that causes a hardware reset or a kernel panic before we get to logging,
I think.

Or is the issue when running on the same CPU? In this case, either
&pool->chunks was updated before taking the #MC, so the extra memory
is there and can be used. Or it wasn't updated, so the extra memory is
not available during the #MC which is the same behavior as now.

I need to look more at the genpool code, but I thought I'd ask too.

Thanks,
Yazen





2024-02-12 19:11:01

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

> Yeah, it is too static IMO. Especially if NR_CPUS is a build-time thing
> - needs to be based on the actual number of CPUs on the machine.
>
> BUT, we don't have an allocator yet.
>
> So we end up allocating it there on the heap.
>
> Unless we can do something with memblock...
>
> But then this still needs code audit whether num_possible_cpus() is
> final at that point. Because if it is, that would be the optimal thing
> to do a memblock_alloc() there...

I threw a printk() into mce_gen_pool_init() and got:

[ 2.948285] mce: mce_gen_pool_init: num_possible_cpus = 144

So it is currently true that number of CPUs has been computed at this point.

So using memblock_alloc() based on number of CPUs may work

> > [Diff pasted into Outlook, chances that it will automatically apply = 0%]
>
> How you even do patches with outschmook is mindboggling :-)
>
> At least use Thunderbird. That's what I do for the company mail but then
> again I don't even try to do patches over company mail...

I use "git send-email" to send out patches, and usually "b4" to get them
to my desktop. I do have "mutt" on there, but IT have made it complex
for me to use it to simply read & reply. It requires separate mutt config
files to fetch mail and a different config to send mail because of the
firewall rules on the lab where my desktop lives.

-Tony

2024-02-12 19:14:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 06:45:34PM +0000, Luck, Tony wrote:
> So it is currently true that number of CPUs has been computed at this point.

It needs a proper explanation why that's ok rather than an empirical
test only.

> I use "git send-email" to send out patches, and usually "b4" to get them
> to my desktop. I do have "mutt" on there, but IT have made it complex
> for me to use it to simply read & reply.

IT departments all around the world make sure of that. :)

> It requires separate mutt config files to fetch mail and a different
> config to send mail because of the firewall rules on the lab where my
> desktop lives.

Yeah, that's why I'm working with !company mail account. For my own
sanity.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 19:22:45

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

> I need to look more at the genpool code, but I thought I'd ask too.

Yazen,

gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool.

This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of
the list_add_rcu()

spin_lock(&pool->lock);
list_add_rcu(&chunk->next_chunk, &pool->chunks);
spin_unlock(&pool->lock);

-Tony




2024-02-12 19:43:43

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On 2/12/2024 1:58 PM, Luck, Tony wrote:
>> I need to look more at the genpool code, but I thought I'd ask too.
>
> Yazen,
>
> gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool.
>
> This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of
> the list_add_rcu()
>
> spin_lock(&pool->lock);
> list_add_rcu(&chunk->next_chunk, &pool->chunks);
> spin_unlock(&pool->lock);
>

Thanks Tony.

So the concern is not about traversal, but rather that the #MC can break the
list_add_rcu(). Is this correct?


Thanks,
Yazen

2024-02-12 19:44:48

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

Hi,

On 2/12/2024 12:58, Luck, Tony wrote:
>> I need to look more at the genpool code, but I thought I'd ask too.
>
> Yazen,
>
> gen_pool_add_owner() is the code that adds an extra chunk to an existing genpool.
>
> This bit doesn't look obviously safe against a #MC at the wrong moment in the middle of
> the list_add_rcu()
>
> spin_lock(&pool->lock);
> list_add_rcu(&chunk->next_chunk, &pool->chunks);
> spin_unlock(&pool->lock);
>

Even I am somewhat confused by this.

The spinlock is mostly held to prevent other primitives
from modifying chunks within the genpool.

In #MC context however, we are not modifying the existing
chunks, per se.

While in the MC context, records will be added to the genpool
through gen_pool_alloc() which eventually drops down into
gen_pool_alloc_algo_owner().

gen_pool_alloc_algo_owner() iterates over the existing chunks
within the genpool through list_for_each_entry_rcu(), within
an RCU read-side critical section (rcu_read_lock()).

Now, the below description of list_for_each_entry_rcu():

* list_for_each_entry_rcu - iterate over rcu list of given type
* @pos: the type * to use as a loop cursor.
* @head: the head for your list.
* @member: the name of the list_head within the struct.
* @cond: optional lockdep expression if called from non-RCU protection.
*
* This list-traversal primitive may safely run concurrently with
* the _rcu list-mutation primitives such as list_add_rcu()
* as long as the traversal is guarded by rcu_read_lock().


Makes me wonder if the genpool can be extended and traversed
concurrently.

OFC, not sure if gen_pool_alloc_algo_owner() being in #MC context
makes a difference here or if I am missing something.

--
Thanks,
Avadhut Naik
> -Tony
>
>
>
>



2024-02-12 19:45:03

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

> It needs a proper explanation why that's ok rather than an empirical
> test only.

start_kernel()
... setup_arch()
.... acpi stuff parses MADT and sets bits in possible map

... arch_cpu_finalize_init()
... calls mce_gen_pool_init()

-Tony

2024-02-12 19:49:57

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

I said:

> spin_lock(&pool->lock);
> list_add_rcu(&chunk->next_chunk, &pool->chunks);
> spin_unlock(&pool->lock);

Avadhut said:

> gen_pool_alloc_algo_owner() iterates over the existing chunks
> within the genpool through list_for_each_entry_rcu(), within
> an RCU read-side critical section (rcu_read_lock()).


> So the concern is not about traversal, but rather that the #MC can break the
> list_add_rcu(). Is this correct?

Yazen,

Yes. The question is whether a #MC that come in the middle of list_rcu_add()
can safely do list_for_each_entry_rcu() on the same list.

RCU is black magic ... maybe it can do this? Adding Paul.

-Tony


2024-02-12 20:12:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 07:49:43PM +0000, Luck, Tony wrote:
> Yes. The question is whether a #MC that come in the middle of list_rcu_add()
> can safely do list_for_each_entry_rcu() on the same list.
>
> RCU is black magic ... maybe it can do this? Adding Paul.

Yeah, the list traversal might be ok as this is what that list_add does
- you can't encounter an inconsistent list - but we still take
a spinlock on addition and the commit which added it:

7f184275aa30 ("lib, Make gen_pool memory allocator lockless")

says

The lockless operation only works if there is enough memory available.
If new memory is added to the pool a lock has to be still taken. So
any user relying on locklessness has to ensure that sufficient memory
is preallocated.

and this is exactly what we're doing - adding new memory.

So, until we're absolutely sure that it is ok to interrupt a context
holding a spinlock with a #MC which is non-maskable, I don't think we
wanna do this.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 20:19:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 01:40:21PM -0600, Naik, Avadhut wrote:
> The spinlock is mostly held to prevent other primitives
> from modifying chunks within the genpool.
>
> In #MC context however, we are not modifying the existing
> chunks, per se.

What if we decide to do

mce[i]count++;

in #MC context?

That's modifying the existing chunks.

TBH, I'm not sure what that spinlock is for. See my reply to that same
subthread.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 20:44:13

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 09:10:38PM +0100, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 07:49:43PM +0000, Luck, Tony wrote:
> > Yes. The question is whether a #MC that come in the middle of list_rcu_add()
> > can safely do list_for_each_entry_rcu() on the same list.
> >
> > RCU is black magic ... maybe it can do this? Adding Paul.
>
> Yeah, the list traversal might be ok as this is what that list_add does
> - you can't encounter an inconsistent list - but we still take
> a spinlock on addition and the commit which added it:
>
> 7f184275aa30 ("lib, Make gen_pool memory allocator lockless")
>
> says
>
> The lockless operation only works if there is enough memory available.
> If new memory is added to the pool a lock has to be still taken. So
> any user relying on locklessness has to ensure that sufficient memory
> is preallocated.
>
> and this is exactly what we're doing - adding new memory.

Is the #MC adding new memory, or is the interrupted context adding new
memory?

> So, until we're absolutely sure that it is ok to interrupt a context
> holding a spinlock with a #MC which is non-maskable, I don't think we
> wanna do this.

If it is the #MC adding new memory, agreed.

If the #MC is simply traversing the list, and the interrupted context
was in the midst of adding a new element, this should be no worse than
some other CPU traversing the list while this CPU is in the midst of
adding a new element.

Or am I missing a turn in here somewhere?

Thanx, Paul

2024-02-12 20:58:28

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

Hi,

On 2/12/2024 14:18, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 01:40:21PM -0600, Naik, Avadhut wrote:
>> The spinlock is mostly held to prevent other primitives
>> from modifying chunks within the genpool.
>>
>> In #MC context however, we are not modifying the existing
>> chunks, per se.
>
> What if we decide to do
>
> mce[i]count++;
>
> in #MC context?
>
> That's modifying the existing chunks.
>
> TBH, I'm not sure what that spinlock is for. See my reply to that same
> subthread.
>
Yes, noticed your reply. Clears most of my confusion.

--
Thanks,
Avadhut Naik

2024-02-12 21:19:04

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

>> and this is exactly what we're doing - adding new memory.
>
> Is the #MC adding new memory, or is the interrupted context adding new
> memory?

The interrupted context is adding the memory.

>> So, until we're absolutely sure that it is ok to interrupt a context
>> holding a spinlock with a #MC which is non-maskable, I don't think we
>> wanna do this.
>
> If it is the #MC adding new memory, agreed.

Not what is happening.

> If the #MC is simply traversing the list, and the interrupted context
> was in the midst of adding a new element, this should be no worse than
> some other CPU traversing the list while this CPU is in the midst of
> adding a new element.
>
> Or am I missing a turn in here somewhere?

Not missing anything. I believe you've answered the question.

-Tony

2024-02-12 21:28:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 12:44:06PM -0800, Paul E. McKenney wrote:
> If it is the #MC adding new memory, agreed.
>
> If the #MC is simply traversing the list, and the interrupted context
> was in the midst of adding a new element, this should be no worse than
> some other CPU traversing the list while this CPU is in the midst of
> adding a new element.

Right, Tony answered which context is doing what.

What I'm still scratching my head over is, why grab a spinlock around

list_add_rcu(&chunk->next_chunk, &pool->chunks);

?

That's the part that looks really weird.

And that's the interrupted context, yap.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 21:37:26

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 07:41:03PM +0000, Luck, Tony wrote:
> > It needs a proper explanation why that's ok rather than an empirical
> > test only.
>
> start_kernel()
> ... setup_arch()
> .... acpi stuff parses MADT and sets bits in possible map
>
> ... arch_cpu_finalize_init()
> ... calls mce_gen_pool_init()

This made me question the "we don't have an allocator in
mce_gen_pool_init()". Because if we got through all the
ACPI stuff, we surely have an allocator.

Below patch doesn't explode at runtime.

-Tony

diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..81de877f2a51 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -16,14 +16,12 @@
* used to save error information organized in a lock-less list.
*
* This memory pool is only to be used to save MCE records in MCE context.
- * MCE events are rare, so a fixed size memory pool should be enough. Use
- * 2 pages to save MCE events for now (~80 MCE records at most).
+ * MCE events are rare, so a fixed size memory pool should be enough.
+ * Allocate on a sliding scale based on number of CPUs.
*/
-#define MCE_POOLSZ (2 * PAGE_SIZE)

static struct gen_pool *mce_evt_pool;
static LLIST_HEAD(mce_event_llist);
-static char gen_pool_buf[MCE_POOLSZ];

/*
* Compare the record "t" with each of the records on list "l" to see if
@@ -118,14 +116,23 @@ int mce_gen_pool_add(struct mce *mce)

static int mce_gen_pool_create(void)
{
+ int mce_numrecords, mce_poolsz;
struct gen_pool *tmpp;
int ret = -ENOMEM;
+ void *mce_pool;

tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
if (!tmpp)
goto out;

- ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
+ mce_numrecords = max(80, num_possible_cpus() * 8);
+ mce_poolsz = mce_numrecords * ilog2(sizeof(struct mce_evt_llist));
+ mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
+ if (!mce_pool) {
+ gen_pool_destroy(tmpp);
+ goto out;
+ }
+ ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
if (ret) {
gen_pool_destroy(tmpp);
goto out;
--
2.43.0


2024-02-12 22:09:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 01:37:09PM -0800, Tony Luck wrote:
> This made me question the "we don't have an allocator in
> mce_gen_pool_init()". Because if we got through all the
> ACPI stuff, we surely have an allocator.
>
> Below patch doesn't explode at runtime.

Yap, I see it.

And I can't dig out why it didn't use to work and we had to do it this
way. The earliest thing I found is:

https://lore.kernel.org/all/[email protected]/T/#mf673ed669ee0d4c27b75bd48450c13c01cbb2cbf

I'll have to dig into my archives tomorrow, on a clear head...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 22:29:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 11:08:33PM +0100, Borislav Petkov wrote:
> I'll have to dig into my archives tomorrow, on a clear head...

So I checked out 648ed94038c030245a06e4be59744fd5cdc18c40 which is
4.2-something.

And even back then, mcheck_cpu_init() gets called *after* mm_init()
which already initializes the allocators. So why did we allocate that
buffer statically?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 22:42:51

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 11:19:13PM +0100, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 11:08:33PM +0100, Borislav Petkov wrote:
> > I'll have to dig into my archives tomorrow, on a clear head...
>
> So I checked out 648ed94038c030245a06e4be59744fd5cdc18c40 which is
> 4.2-something.
>
> And even back then, mcheck_cpu_init() gets called *after* mm_init()
> which already initializes the allocators. So why did we allocate that
> buffer statically?

Found it in my archives. You should have it too:

Date: Thu, 31 Jul 2014 02:51:25 -0400
From: "Chen, Gong" <[email protected]>
To: [email protected], [email protected]
Subject: Re: [RFC PATCH untest v2 1/4] x86, MCE: Provide a lock-less memory pool to save error record
Message-ID: <[email protected]>

and that's not on any ML that's why I can't find it on lore...

There's this fragment from Chen:

--------
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index bb92f38..a1b6841 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -2023,6 +2023,9 @@ int __init mcheck_init(void)
> {
> mcheck_intel_therm_init();
>
> + if (!mce_genpool_init())
> + mca_cfg.disabled = true;
> +
when setup_arch is called, memory subsystem hasn't been initialized,
which means I can't use regular page allocation function. So I still
need to put genpool init in mcheck_cpu_init.
--------

And that is still the case - mcheck_init() gets called in setup_arch()
and thus before before mm_init() which is called mm_core_init() now.

And on that same thread we agree that we should allocate it statically
but then the call to mce_gen_pool_init() ended up in mcheck_cpu_init()
which happens *after* mm_init().

What a big fscking facepalm. :-\

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-12 22:47:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 10:27:41PM +0100, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 12:44:06PM -0800, Paul E. McKenney wrote:
> > If it is the #MC adding new memory, agreed.
> >
> > If the #MC is simply traversing the list, and the interrupted context
> > was in the midst of adding a new element, this should be no worse than
> > some other CPU traversing the list while this CPU is in the midst of
> > adding a new element.
>
> Right, Tony answered which context is doing what.
>
> What I'm still scratching my head over is, why grab a spinlock around
>
> list_add_rcu(&chunk->next_chunk, &pool->chunks);
>
> ?
>
> That's the part that looks really weird.
>
> And that's the interrupted context, yap.

The usual reason is to exclude other CPUs also doing list_add_rcu()
on the same list. Or is there other synchronization that is preventing
concurrent updates?

Thanx, Paul

2024-02-12 22:54:08

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

> The usual reason is to exclude other CPUs also doing list_add_rcu()
> on the same list. Or is there other synchronization that is preventing
> concurrent updates?

In this case the patch is trying to do a one-time bump of the pool size.
So no races here.

-Tony


2024-02-12 23:10:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Mon, Feb 12, 2024 at 02:46:57PM -0800, Paul E. McKenney wrote:
> The usual reason is to exclude other CPUs also doing list_add_rcu()
> on the same list.

Doh, it even says so in the comment above list_add_rcu().

And the traversal which is happening in NMI-like context is fine.

So phew, I think we should be fine here. Thanks!

And as it turns out, we're not going to need any of that after all as
it looks like we can allocate the proper size from the very beginning...

Lovely.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-13 01:45:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

On Tue, Feb 13, 2024 at 12:10:09AM +0100, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 02:46:57PM -0800, Paul E. McKenney wrote:
> > The usual reason is to exclude other CPUs also doing list_add_rcu()
> > on the same list.
>
> Doh, it even says so in the comment above list_add_rcu().
>
> And the traversal which is happening in NMI-like context is fine.
>
> So phew, I think we should be fine here. Thanks!
>
> And as it turns out, we're not going to need any of that after all as
> it looks like we can allocate the proper size from the very beginning...

Sounds even better! ;-)

Thanx, Paul

2024-02-15 20:18:25

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

Hi,

On 2/12/2024 11:54, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 05:29:31PM +0000, Luck, Tony wrote:
>> Walking the structures already allocated from the genpool in the #MC
>> handler may be possible, but what is the criteria for "duplicates"?
>
> for each i in pool:
> memcmp(mce[i], new_mce, sizeof(struct mce));
>
> It'll probably need to mask out fields like ->time etc.
>
Also, some fields like cpuvendor, ppin, microcode will remain same
for all MCEs received on a system. Can we avoid comparing them?

We already seem to have a function mce_cmp(), introduced back in
2016, which accomplishes something similar for fatal errors. But
it only checks for bank, status, addr and misc registers. Should
we just modify this function to compare MCEs? It should work for
fatal errors too.

For my own understanding:
The general motto for #MC or interrupt contexts is *keep it short
and sweet*. Though memcmp() is fairly optimized, we would still be
running a *for* loop in MC context. In case successive back-to-back
MCEs are being received and if the pool already has a fair number of
records, wouldn't this comparison significantly extend our stay
in #MC context?
Had discussed this with Yazen, IIUC, nested MCEs are not supported
on x86. Please correct me if I am wrong in this.

--
Thanks,
Avadhut Naik

2024-02-15 20:35:00

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool



On 2/12/2024 03:32, Borislav Petkov wrote:
> On Mon, Feb 12, 2024 at 09:58:01AM +0100, Borislav Petkov wrote:
>> * Can you think of a slick deduplication scheme instead of blindly
>> raising the buffer size?
>
> And here's the simplest scheme: you don't extend the buffer. On
> overflow, you say "Buffer full, here's the MCE" and you dump the error
> long into dmesg. Problem solved.
>
Wouldn't this require logging in the #MC context?

We would know that the genpool is full in mce_gen_pool_add() which, I
think is executed in #MC context.

> A slicker deduplication scheme would be even better, tho. Maybe struct
> mce.count which gets incremented instead of adding the error record to
> the buffer again. And so on...
>

--
Thanks,
Avadhut Naik

2024-02-15 20:39:34

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH 2/2] x86/MCE: Add command line option to extend MCE Records pool

Hi,

On 2/12/2024 02:58, Borislav Petkov wrote:
> On Sun, Feb 11, 2024 at 08:54:29PM -0600, Naik, Avadhut wrote:
>> Okay. Will make changes to allocate memory and set size of the pool
>> when it is created. Also, will remove the command line parameter and
>> resubmit.
>
> Before you do, go read that original thread again but this time take
> your time to grok it.
>
> And then try answering those questions:
>
> * Why are *you* fixing this? I know what the AWS reason is, what is
> yours?
>
I think this issue of genpool getting full with MCE records can occur
on AMD system too since the pool doesn't scale up with the number of
CPUs and memory in the system. The probability of issue occurrence
only increases as CPU count and memory increases. Feel that the genpool
size should be proportional to, at least, the CPU count of the system.

> * Can you think of a slick deduplication scheme instead of blindly
> raising the buffer size?
>
> * What's wrong with not logging some early errors, can we live with that
> too? If it were firmware-first, it cannot simply extend its buffer size
> because it has limited space. So what does firmware do in such cases?
>
Think that we can live with not logging some early errors, as long as they
are correctable.
Not very sure about what you mean by Firmware First. Do you mean handling
of MCEs through HEST and GHES? Or something else?

> Think long and hard about the big picture, analyze the problem properly
> and from all angles before you go and do patches.
>
> Thx.
>

--
Thanks,
Avadhut Naik

2024-02-28 23:14:16

by Luck, Tony

[permalink] [raw]
Subject: [PATCH] x86/mce: Dynamically size space for machine check records

Systems with a large number of CPUs may generate a large
number of machine check records when things go seriously
wrong. But Linux has a fixed buffer that can only capture
a few dozen errors.

Allocate space based on the number of CPUs (with a minimum
value based on the historical fixed buffer that could store
80 records).

Signed-off-by: Tony Luck <[email protected]>
---

Discussion earlier concluded with the realization that it is
safe to dynamically allocate the mce_evt_pool at boot time.
So here's a patch to do that. Scaling algorithm here is a
simple linear "4 records per possible CPU" with a minimum
of 80 to match the legacy behavior. I'm open to other
suggestions.

Note that I threw in a "+1" to the return from ilog2() when
calling gen_pool_create(). From reading code, and running
some tests, it appears that the min_alloc_order argument
needs to be large enough to allocate one of the mce_evt_llist
structures.

Some other gen_pool users in Linux may also need this "+1".

arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..a1f0a8f29cf5 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -16,14 +16,13 @@
* used to save error information organized in a lock-less list.
*
* This memory pool is only to be used to save MCE records in MCE context.
- * MCE events are rare, so a fixed size memory pool should be enough. Use
- * 2 pages to save MCE events for now (~80 MCE records at most).
+ * MCE events are rare, so a fixed size memory pool should be enough.
+ * Allocate on a sliding scale based on number of CPUs.
*/
-#define MCE_POOLSZ (2 * PAGE_SIZE)
+#define MCE_MIN_ENTRIES 80

static struct gen_pool *mce_evt_pool;
static LLIST_HEAD(mce_event_llist);
-static char gen_pool_buf[MCE_POOLSZ];

/*
* Compare the record "t" with each of the records on list "l" to see if
@@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)

static int mce_gen_pool_create(void)
{
+ int mce_numrecords, mce_poolsz;
struct gen_pool *tmpp;
int ret = -ENOMEM;
+ void *mce_pool;
+ int order;

- tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
+ order = ilog2(sizeof(struct mce_evt_llist)) + 1;
+ tmpp = gen_pool_create(order, -1);
if (!tmpp)
goto out;

- ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
+ mce_numrecords = max(80, num_possible_cpus() * 4);
+ mce_poolsz = mce_numrecords * (1 << order);
+ mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
+ if (!mce_pool) {
+ gen_pool_destroy(tmpp);
+ goto out;
+ }
+ ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
if (ret) {
gen_pool_destroy(tmpp);
goto out;
--
2.43.0


2024-02-29 00:40:00

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

On 2/28/2024 3:14 PM, Tony Luck wrote:

> static int mce_gen_pool_create(void)
> {
> + int mce_numrecords, mce_poolsz;
> struct gen_pool *tmpp;
> int ret = -ENOMEM;
> + void *mce_pool;
> + int order;
>
> - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> + order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> + tmpp = gen_pool_create(order, -1);
> if (!tmpp)
> goto out;
>
> - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> + mce_numrecords = max(80, num_possible_cpus() * 4);
> + mce_poolsz = mce_numrecords * (1 << order);
> + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
> + if (!mce_pool) {
> + gen_pool_destroy(tmpp);
> + goto out;
> + }
> + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
> if (ret) {
> gen_pool_destroy(tmpp);

Is this missing a kfree(mce_pool) here?

> goto out;


2024-02-29 00:44:27

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/mce: Dynamically size space for machine check records

>> + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
>> if (ret) {
>> gen_pool_destroy(tmpp);
>
> Is this missing a kfree(mce_pool) here?
>
>> goto out;

Yes. Will add.

Thanks

-Tony

2024-02-29 02:06:51

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

A few other nits.

On 2/28/2024 3:14 PM, Tony Luck wrote:
> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> index fbe8b61c3413..a1f0a8f29cf5 100644
> --- a/arch/x86/kernel/cpu/mce/genpool.c
> +++ b/arch/x86/kernel/cpu/mce/genpool.c
> @@ -16,14 +16,13 @@
> * used to save error information organized in a lock-less list.
> *
> * This memory pool is only to be used to save MCE records in MCE context.
> - * MCE events are rare, so a fixed size memory pool should be enough. Use
> - * 2 pages to save MCE events for now (~80 MCE records at most).
> + * MCE events are rare, so a fixed size memory pool should be enough.
> + * Allocate on a sliding scale based on number of CPUs.
> */
> -#define MCE_POOLSZ (2 * PAGE_SIZE)
> +#define MCE_MIN_ENTRIES 80
>
> static struct gen_pool *mce_evt_pool;
> static LLIST_HEAD(mce_event_llist);
> -static char gen_pool_buf[MCE_POOLSZ];
>
> /*
> * Compare the record "t" with each of the records on list "l" to see if
> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
>
> static int mce_gen_pool_create(void)
> {
> + int mce_numrecords, mce_poolsz;

Should order be also declared in this line? That way we can have all the
uninitialized 'int's together.

> struct gen_pool *tmpp;
> int ret = -ENOMEM;
> + void *mce_pool;
> + int order;
>
> - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> + order = ilog2(sizeof(struct mce_evt_llist)) + 1;

I didn't exactly understand why a +1 is needed here. Do you have a
pointer to somewhere to help understand this?

Also, I think, a comment on top might be useful since this isn't obvious.

> + tmpp = gen_pool_create(order, -1);
> if (!tmpp)
> goto out;
>
> - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> + mce_numrecords = max(80, num_possible_cpus() * 4);

How about using MCE_MIN_ENTRIES here?

> + mce_poolsz = mce_numrecords * (1 << order);
> + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
> + if (!mce_pool) {
> + gen_pool_destroy(tmpp);
> + goto out;
> + }
> + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
> if (ret) {
> gen_pool_destroy(tmpp);
> goto out;


2024-02-29 06:44:16

by Naik, Avadhut

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

Hi,

On 2/28/2024 17:14, Tony Luck wrote:
> Systems with a large number of CPUs may generate a large
> number of machine check records when things go seriously
> wrong. But Linux has a fixed buffer that can only capture
> a few dozen errors.
>
> Allocate space based on the number of CPUs (with a minimum
> value based on the historical fixed buffer that could store
> 80 records).
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
>
> Discussion earlier concluded with the realization that it is
> safe to dynamically allocate the mce_evt_pool at boot time.
> So here's a patch to do that. Scaling algorithm here is a
> simple linear "4 records per possible CPU" with a minimum
> of 80 to match the legacy behavior. I'm open to other
> suggestions.
>
> Note that I threw in a "+1" to the return from ilog2() when
> calling gen_pool_create(). From reading code, and running
> some tests, it appears that the min_alloc_order argument
> needs to be large enough to allocate one of the mce_evt_llist
> structures.
>
> Some other gen_pool users in Linux may also need this "+1".
>

Somewhat confused here. Weren't we also exploring ways to avoid
duplicate records from being added to the genpool? Has something
changed in that regard?

> arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------
> 1 file changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> index fbe8b61c3413..a1f0a8f29cf5 100644
> --- a/arch/x86/kernel/cpu/mce/genpool.c
> +++ b/arch/x86/kernel/cpu/mce/genpool.c
> @@ -16,14 +16,13 @@
> * used to save error information organized in a lock-less list.
> *
> * This memory pool is only to be used to save MCE records in MCE context.
> - * MCE events are rare, so a fixed size memory pool should be enough. Use
> - * 2 pages to save MCE events for now (~80 MCE records at most).
> + * MCE events are rare, so a fixed size memory pool should be enough.
> + * Allocate on a sliding scale based on number of CPUs.
> */
> -#define MCE_POOLSZ (2 * PAGE_SIZE)
> +#define MCE_MIN_ENTRIES 80
>
> static struct gen_pool *mce_evt_pool;
> static LLIST_HEAD(mce_event_llist);
> -static char gen_pool_buf[MCE_POOLSZ];
>
> /*
> * Compare the record "t" with each of the records on list "l" to see if
> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
>
> static int mce_gen_pool_create(void)
> {
> + int mce_numrecords, mce_poolsz;
> struct gen_pool *tmpp;
> int ret = -ENOMEM;
> + void *mce_pool;
> + int order;
>
> - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> + order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> + tmpp = gen_pool_create(order, -1);
> if (!tmpp)
> goto out;
>
> - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> + mce_numrecords = max(80, num_possible_cpus() * 4);
> + mce_poolsz = mce_numrecords * (1 << order);
> + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);

To err on the side of caution, wouldn't kzalloc() be a safer choice here?

--
Thanks,
Avadhut Naik

2024-02-29 08:41:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
> Somewhat confused here. Weren't we also exploring ways to avoid
> duplicate records from being added to the genpool? Has something
> changed in that regard?

You can always send patches proposing how *you* think this duplicate
elimination should look like and we can talk. :)

I don't think anyone would mind it if done properly but first you'd need
a real-life use case. As in, do we log sooo many duplicates such that
we'd want to dedup?

Hmmm.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-29 15:55:26

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

On 2/28/2024 8:56 PM, Sohil Mehta wrote:
> A few other nits.
>
> On 2/28/2024 3:14 PM, Tony Luck wrote:
>> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
>> index fbe8b61c3413..a1f0a8f29cf5 100644
>> --- a/arch/x86/kernel/cpu/mce/genpool.c
>> +++ b/arch/x86/kernel/cpu/mce/genpool.c
>> @@ -16,14 +16,13 @@
>> * used to save error information organized in a lock-less list.
>> *
>> * This memory pool is only to be used to save MCE records in MCE context.
>> - * MCE events are rare, so a fixed size memory pool should be enough. Use
>> - * 2 pages to save MCE events for now (~80 MCE records at most).
>> + * MCE events are rare, so a fixed size memory pool should be enough.
>> + * Allocate on a sliding scale based on number of CPUs.
>> */
>> -#define MCE_POOLSZ (2 * PAGE_SIZE)
>> +#define MCE_MIN_ENTRIES 80
>>
>> static struct gen_pool *mce_evt_pool;
>> static LLIST_HEAD(mce_event_llist);
>> -static char gen_pool_buf[MCE_POOLSZ];
>>
>> /*
>> * Compare the record "t" with each of the records on list "l" to see if
>> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
>>
>> static int mce_gen_pool_create(void)
>> {
>> + int mce_numrecords, mce_poolsz;
>
> Should order be also declared in this line? That way we can have all the
> uninitialized 'int's together.
>
>> struct gen_pool *tmpp;
>> int ret = -ENOMEM;
>> + void *mce_pool;
>> + int order;
>>
>> - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
>> + order = ilog2(sizeof(struct mce_evt_llist)) + 1;
>
> I didn't exactly understand why a +1 is needed here. Do you have a
> pointer to somewhere to help understand this?
>
> Also, I think, a comment on top might be useful since this isn't obvious.
>

Would order_base_2() work here? It automatically rounds up to the next power.

Thanks,
Yazen


2024-02-29 17:22:20

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

On Thu, Feb 29, 2024 at 10:49:18AM -0500, Yazen Ghannam wrote:
> On 2/28/2024 8:56 PM, Sohil Mehta wrote:
> > > + order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> >
> > I didn't exactly understand why a +1 is needed here. Do you have a
> > pointer to somewhere to help understand this?
> >
> > Also, I think, a comment on top might be useful since this isn't obvious.
> >
>
> Would order_base_2() work here? It automatically rounds up to the next power.

Yes! That's exactly what is needed here. Thanks!

-Tony

2024-02-29 17:22:33

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

On Wed, Feb 28, 2024 at 05:56:26PM -0800, Sohil Mehta wrote:
> A few other nits.
>
> On 2/28/2024 3:14 PM, Tony Luck wrote:
> > diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> > index fbe8b61c3413..a1f0a8f29cf5 100644
> > --- a/arch/x86/kernel/cpu/mce/genpool.c
> > +++ b/arch/x86/kernel/cpu/mce/genpool.c
> > @@ -16,14 +16,13 @@
> > * used to save error information organized in a lock-less list.
> > *
> > * This memory pool is only to be used to save MCE records in MCE context.
> > - * MCE events are rare, so a fixed size memory pool should be enough. Use
> > - * 2 pages to save MCE events for now (~80 MCE records at most).
> > + * MCE events are rare, so a fixed size memory pool should be enough.
> > + * Allocate on a sliding scale based on number of CPUs.
> > */
> > -#define MCE_POOLSZ (2 * PAGE_SIZE)
> > +#define MCE_MIN_ENTRIES 80
> >
> > static struct gen_pool *mce_evt_pool;
> > static LLIST_HEAD(mce_event_llist);
> > -static char gen_pool_buf[MCE_POOLSZ];
> >
> > /*
> > * Compare the record "t" with each of the records on list "l" to see if
> > @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
> >
> > static int mce_gen_pool_create(void)
> > {
> > + int mce_numrecords, mce_poolsz;
>
> Should order be also declared in this line? That way we can have all the
> uninitialized 'int's together.

Sure. Even with the addition of "order" the line is still short enough.

> > struct gen_pool *tmpp;
> > int ret = -ENOMEM;
> > + void *mce_pool;
> > + int order;
> >
> > - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> > + order = ilog2(sizeof(struct mce_evt_llist)) + 1;
>
> I didn't exactly understand why a +1 is needed here. Do you have a
> pointer to somewhere to help understand this?
>
> Also, I think, a comment on top might be useful since this isn't obvious.

gen_pool works in power-of-two blocks. The user must pick a specific
size with the gen_pool_create() call. Internally the gen_pool code uses
a bitmap to track which blocks in the pool are allocated/free.

Looking at this specific case, sizeof(struct mce_evt_llist) is 136. So
the original version of this code picks order 7 to allocate in 128 byte
units. But this means that every allocation of a mce_evt_llist will take
two 128-byte blocks.

Net result is that the comment at the top of arch/x86/kernel/cpu/mce/genpool.c
that two pages are enough for ~80 records was wrong when written. At
that point struct mce_evt_llist was below 128, so order was 6, and each
allocation took two blocks. So two pages = 8192 bytes divided by (2 * 64)
results in 64 possible allocations.

But over time Intel and AMD added to the structure. So the current math
comes out at just 32 allocations before the pool is out of space.

Yazen provided the right answer for this. Change to use order_base_2()

> > + tmpp = gen_pool_create(order, -1);
> > if (!tmpp)
> > goto out;
> >
> > - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> > + mce_numrecords = max(80, num_possible_cpus() * 4);
>
> How about using MCE_MIN_ENTRIES here?

Oops. I meant to do that when I added the #define!

I've also added a "#define MCE_PER_CPU 4" instead of
a raw "4" in that expression.

-Tony

2024-02-29 17:27:06

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
> Hi,
>
> On 2/28/2024 17:14, Tony Luck wrote:
> > Systems with a large number of CPUs may generate a large
> > number of machine check records when things go seriously
> > wrong. But Linux has a fixed buffer that can only capture
> > a few dozen errors.
> >
> > Allocate space based on the number of CPUs (with a minimum
> > value based on the historical fixed buffer that could store
> > 80 records).
> >
> > Signed-off-by: Tony Luck <[email protected]>
> > ---
> >
> > Discussion earlier concluded with the realization that it is
> > safe to dynamically allocate the mce_evt_pool at boot time.
> > So here's a patch to do that. Scaling algorithm here is a
> > simple linear "4 records per possible CPU" with a minimum
> > of 80 to match the legacy behavior. I'm open to other
> > suggestions.
> >
> > Note that I threw in a "+1" to the return from ilog2() when
> > calling gen_pool_create(). From reading code, and running
> > some tests, it appears that the min_alloc_order argument
> > needs to be large enough to allocate one of the mce_evt_llist
> > structures.
> >
> > Some other gen_pool users in Linux may also need this "+1".
> >
>
> Somewhat confused here. Weren't we also exploring ways to avoid
> duplicate records from being added to the genpool? Has something
> changed in that regard?

I'm going to cover this in the reply to Boris.

> > + mce_numrecords = max(80, num_possible_cpus() * 4);
> > + mce_poolsz = mce_numrecords * (1 << order);
> > + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
>
> To err on the side of caution, wouldn't kzalloc() be a safer choice here?

Seems like too much caution. When mce_gen_pool_add() allocates
an entry from the pool it does:

memcpy(&node->mce, mce, sizeof(*mce));
llist_add(&node->llnode, &mce_event_llist);

between those two lines, every field in the struct mce_evt_llist
is written (including any holes in the struct mce part of the structure).

-Tony

2024-02-29 17:47:45

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote:
> On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
> > Somewhat confused here. Weren't we also exploring ways to avoid
> > duplicate records from being added to the genpool? Has something
> > changed in that regard?
>
> You can always send patches proposing how *you* think this duplicate
> elimination should look like and we can talk. :)
>
> I don't think anyone would mind it if done properly but first you'd need
> a real-life use case. As in, do we log sooo many duplicates such that
> we'd want to dedup?

There are definitly cases where dedup will not help. If a row fails in a
DIMM there will be a flood of correctable errors with different addresses
(depending on number of channels in the interleave schema for a system
this may be dozens or hundreds of distinct addresses).

Same for other failures in structures like column and rank.

As to "real-life" use cases. A search on Lore for "MCE records pool
full!" only finds threads about modifications to this code. So the
general population of Linux developers isn't seeing this.

But a search in my internal e-mail box kicks up a dozen or so distinct
hits from internal validation teams in just the past year. But those
folks are super-dedicated to finding corner cases. Just this morning I
got a triumphant e-mail from someone who reproduced an issue "after 1.6
million error injections".

I'd bet that large cloud providers with system numbered in the hundreds
of thousands see the MCE pool exhausted from time to time.

Open question is "How many error records do you need to diagnose the
cause of various floods of errors?"

I'm going to say that the current 32 (see earlier e-mail today to
Sohil https://lore.kernel.org/all/ZeC8_jzdFnkpPVPf@agluck-desk3/ )
isn't enough for big systems. It may be hard to distinguish between
the various bulk fail modes with just that many error logs.

-Tony

2024-02-29 18:39:13

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/mce: Dynamically size space for machine check records

> Wouldn't having dedup actually increase the time we spend #MC context?
> Comparing the new MCE record against each existing record in the
> genpool.

Yes, dedup would take extra time (increasing linearly with the number
of pending errors that were not filtered out by the dedup process).

> AFAIK, MCEs cannot be nested. Correct me if I am wrong here.

Can't be nested on the same CPU. But multiple CPUs may take
a local machine check simultaneously. Local machine check is
opt-in on Intel, I believe it is default on AMD.

Errors can also be signaled with CMCI.

> In a flood situation, like the one described above, that is exactly
> what may happen: An MCE coming in while the dedup mechanism is
> underway (in #MC context).

In a flood of errors it would be complicated to synchronize dedup filtering
on multiple CPUs. The trade-off between trying to get that code right,
and just allocating a few extra Kbytes of memory would seem to favor
allocating more memory.

--
Thanks,
Avadhut Naik

2024-02-29 18:46:02

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH] x86/mce: Dynamically size space for machine check records



On 2/29/2024 11:47, Tony Luck wrote:
> On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote:
>> On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
>>> Somewhat confused here. Weren't we also exploring ways to avoid
>>> duplicate records from being added to the genpool? Has something
>>> changed in that regard?
>>
>> You can always send patches proposing how *you* think this duplicate
>> elimination should look like and we can talk. :)
>>
>> I don't think anyone would mind it if done properly but first you'd need
>> a real-life use case. As in, do we log sooo many duplicates such that
>> we'd want to dedup?
>
> There are definitly cases where dedup will not help. If a row fails in a
> DIMM there will be a flood of correctable errors with different addresses
> (depending on number of channels in the interleave schema for a system
> this may be dozens or hundreds of distinct addresses).
>
> Same for other failures in structures like column and rank.
>

Wouldn't having dedup actually increase the time we spend #MC context?
Comparing the new MCE record against each existing record in the
genpool.

AFAIK, MCEs cannot be nested. Correct me if I am wrong here.

In a flood situation, like the one described above, that is exactly
what may happen: An MCE coming in while the dedup mechanism is
underway (in #MC context).

--
Thanks,
Avadhut Naik

2024-02-29 23:56:54

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records

On 2/29/2024 9:21 AM, Tony Luck wrote:

> Looking at this specific case, sizeof(struct mce_evt_llist) is 136. So
> the original version of this code picks order 7 to allocate in 128 byte
> units. But this means that every allocation of a mce_evt_llist will take
> two 128-byte blocks.
>
> Net result is that the comment at the top of arch/x86/kernel/cpu/mce/genpool.c
> that two pages are enough for ~80 records was wrong when written. At
> that point struct mce_evt_llist was below 128, so order was 6, and each
> allocation took two blocks. So two pages = 8192 bytes divided by (2 * 64)
> results in 64 possible allocations.
>

Thanks for the explanation. The part that got me is that I somehow
expected ilog2() to round-up and not round-down.

> But over time Intel and AMD added to the structure. So the current math
> comes out at just 32 allocations before the pool is out of space.
>
> Yazen provided the right answer for this. Change to use order_base_2()
>

Yes, I agree. order_base_2() is better than doing ilog2(struct size) +
1. In the rare scenario of the size exactly being a power of 2 we don't
need to add the +1.




2024-03-06 21:52:55

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH] x86/mce: Dynamically size space for machine check records

Hi,

On 2/28/2024 17:14, Tony Luck wrote:
> Systems with a large number of CPUs may generate a large
> number of machine check records when things go seriously
> wrong. But Linux has a fixed buffer that can only capture
> a few dozen errors.
>
> Allocate space based on the number of CPUs (with a minimum
> value based on the historical fixed buffer that could store
> 80 records).
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
>
> Discussion earlier concluded with the realization that it is
> safe to dynamically allocate the mce_evt_pool at boot time.
> So here's a patch to do that. Scaling algorithm here is a
> simple linear "4 records per possible CPU" with a minimum
> of 80 to match the legacy behavior. I'm open to other
> suggestions.
>
> Note that I threw in a "+1" to the return from ilog2() when
> calling gen_pool_create(). From reading code, and running
> some tests, it appears that the min_alloc_order argument
> needs to be large enough to allocate one of the mce_evt_llist
> structures.
>
> Some other gen_pool users in Linux may also need this "+1".
>
> arch/x86/kernel/cpu/mce/genpool.c | 22 ++++++++++++++++------
> 1 file changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
> index fbe8b61c3413..a1f0a8f29cf5 100644
> --- a/arch/x86/kernel/cpu/mce/genpool.c
> +++ b/arch/x86/kernel/cpu/mce/genpool.c
> @@ -16,14 +16,13 @@
> * used to save error information organized in a lock-less list.
> *
> * This memory pool is only to be used to save MCE records in MCE context.
> - * MCE events are rare, so a fixed size memory pool should be enough. Use
> - * 2 pages to save MCE events for now (~80 MCE records at most).
> + * MCE events are rare, so a fixed size memory pool should be enough.
> + * Allocate on a sliding scale based on number of CPUs.
> */
> -#define MCE_POOLSZ (2 * PAGE_SIZE)
> +#define MCE_MIN_ENTRIES 80
>
> static struct gen_pool *mce_evt_pool;
> static LLIST_HEAD(mce_event_llist);
> -static char gen_pool_buf[MCE_POOLSZ];
>
> /*
> * Compare the record "t" with each of the records on list "l" to see if
> @@ -118,14 +117,25 @@ int mce_gen_pool_add(struct mce *mce)
>
> static int mce_gen_pool_create(void)
> {
> + int mce_numrecords, mce_poolsz;
> struct gen_pool *tmpp;
> int ret = -ENOMEM;
> + void *mce_pool;
> + int order;
>
> - tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
> + order = ilog2(sizeof(struct mce_evt_llist)) + 1;
> + tmpp = gen_pool_create(order, -1);
> if (!tmpp)
> goto out;
>
> - ret = gen_pool_add(tmpp, (unsigned long)gen_pool_buf, MCE_POOLSZ, -1);
> + mce_numrecords = max(80, num_possible_cpus() * 4);

Per Boris's below suggestion, shouldn't this be:
mce_numrecords = max(80, num_possible_cpus() * 16);

>> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);
>
> max() ofc.
>
>> There's a sane minimum and one page pro logical CPU should be fine on
>> pretty much every configuration...

4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic
behavior you explained in the other subthread.

Apart from this, tested the patch on a couple of AMD systems. Didn't observe any
issues.

> + mce_poolsz = mce_numrecords * (1 << order);
> + mce_pool = kmalloc(mce_poolsz, GFP_KERNEL);
> + if (!mce_pool) {
> + gen_pool_destroy(tmpp);
> + goto out;
> + }
> + ret = gen_pool_add(tmpp, (unsigned long)mce_pool, mce_poolsz, -1);
> if (ret) {
> gen_pool_destroy(tmpp);
> goto out;

--
Thanks,
Avadhut Naik

2024-03-06 22:07:55

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/mce: Dynamically size space for machine check records

> > + mce_numrecords = max(80, num_possible_cpus() * 4);
>
> Per Boris's below suggestion, shouldn't this be:
> mce_numrecords = max(80, num_possible_cpus() * 16);
>
> >> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);
> >
> > max() ofc.
> >
> >> There's a sane minimum and one page pro logical CPU should be fine on
> >> pretty much every configuration...
>
> 4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic
> behavior you explained in the other subthread.

Picking a good number of records-per-core may be more art than science. Boris
is right that a page per CPU shouldn't cause any significant issue to systems with
many CPUs, because they should have copious amounts of memory to make a
balanced configuration. But 16 records per CPU feels way too high to me. The
theoretical limit in a single scan of machine check banks on Intel is 32 (since
Intel never has more than 32 banks). But those banks cover diverse h/w devices
and it seems improbable that all, or even most, of them would log errors at the
same time, with all CPUs on all sockets doing the same.

After I posted the version with num_possible_cpus() * 4 I began to wonder whether
"2" would be enough.

> Apart from this, tested the patch on a couple of AMD systems. Didn't observe any
> issues.

Thanks very much for testing.

-Tony

2024-03-06 23:22:06

by Naik, Avadhut

[permalink] [raw]
Subject: [PATCH] x86/mce: Dynamically size space for machine check records



On 3/6/2024 16:07, Luck, Tony wrote:
>>> + mce_numrecords = max(80, num_possible_cpus() * 4);
>>
>> Per Boris's below suggestion, shouldn't this be:
>> mce_numrecords = max(80, num_possible_cpus() * 16);
>>
>>>> min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);
>>>
>>> max() ofc.
>>>
>>>> There's a sane minimum and one page pro logical CPU should be fine on
>>>> pretty much every configuration...
>>
>> 4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic
>> behavior you explained in the other subthread.
>
> Picking a good number of records-per-core may be more art than science. Boris
> is right that a page per CPU shouldn't cause any significant issue to systems with
> many CPUs, because they should have copious amounts of memory to make a
> balanced configuration. But 16 records per CPU feels way too high to me. The
> theoretical limit in a single scan of machine check banks on Intel is 32 (since
> Intel never has more than 32 banks). But those banks cover diverse h/w devices
> and it seems improbable that all, or even most, of them would log errors at the
> same time, with all CPUs on all sockets doing the same.
>
> After I posted the version with num_possible_cpus() * 4 I began to wonder whether
> "2" would be enough.
>

Was thinking along the same lines that 16 MCE records per thread might be too high.
But since Boris made the suggestion, I thought there might be a use case that I am
unaware of. Perhaps, some issue that had been debugged in the past. Hence, my
earlier question if it should be 16 instead of 4.
I think 2 records should also be good. IIRC, the patch that I submitted reserved
space of 2 records per logical CPU in the genpool.

>> Apart from this, tested the patch on a couple of AMD systems. Didn't observe any
>> issues.
>
> Thanks very much for testing.
>
> -Tony

--
Thanks,
Avadhut Naik