2021-11-04 09:17:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out

On Thu, Nov 04, 2021 at 03:44:31PM +0800, Zhaolong Zhang wrote:
> set cpu_missing before mce_panic() so that it prints correct msg.
>
> Signed-off-by: Zhaolong Zhang <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 50a3e455cded..ccefe131ab55 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -903,13 +903,13 @@ static int mce_timed_out(u64 *t, const char *msg)
> if (!mca_cfg.monarch_timeout)
> goto out;
> if ((s64)*t < SPINUNIT) {
> + cpu_missing = 1;
> if (mca_cfg.tolerant <= 1) {
> if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
> pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
> cpumask_pr_args(&mce_missing_cpus));
> mce_panic(msg, NULL, NULL);
> }
> - cpu_missing = 1;
> return 1;
> }
> *t -= SPINUNIT;
> --

Frankly, we might just as well kill that cpu_missing thing because we
already say that some CPUs are not responding.

And that "Some CPUs didn't answer in synchronization" is not really
telling me a whole lot.

Tony, do you see any real need to keep it?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


2021-11-04 15:52:50

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out

> Frankly, we might just as well kill that cpu_missing thing because we
> already say that some CPUs are not responding.

Yes. The more recent commit:

7bb39313cd62 ("x86/mce: Make mce_timed_out() identify holdout CPUs")

tries to provide the more detailed message about *which* CPUs are missing

> Tony, do you see any real need to keep it?

I think cpu_missing can be dropped.

-Tony

2021-11-04 18:02:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out

On Thu, Nov 04, 2021 at 03:47:36PM +0000, Luck, Tony wrote:
> > Frankly, we might just as well kill that cpu_missing thing because we
> > already say that some CPUs are not responding.
>
> Yes. The more recent commit:
>
> 7bb39313cd62 ("x86/mce: Make mce_timed_out() identify holdout CPUs")
>
> tries to provide the more detailed message about *which* CPUs are missing

Exactly.

> I think cpu_missing can be dropped.

Zhaolong, you could send a patch doing that, instead.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-11-05 02:50:27

by Zhaolong Zhang

[permalink] [raw]
Subject: Re:Re: [PATCH] x86/mce: correct cpu_missing reporting in mce_timed_out


At 2021-11-05 02:02:36, "Borislav Petkov" <[email protected]> wrote:
>On Thu, Nov 04, 2021 at 03:47:36PM +0000, Luck, Tony wrote:
>> > Frankly, we might just as well kill that cpu_missing thing because we
>> > already say that some CPUs are not responding.
>>
>> Yes. The more recent commit:
>>
>> 7bb39313cd62 ("x86/mce: Make mce_timed_out() identify holdout CPUs")
>>
>> tries to provide the more detailed message about *which* CPUs are missing
>
>Exactly.
>
>> I think cpu_missing can be dropped.
>
>Zhaolong, you could send a patch doing that, instead.

Thanks for the reply. Let me see how to do it properly.

Regards,
Zhaolong

>
>Thx.
>
>--
>Regards/Gruss,
> Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

2021-11-08 09:00:41

by Zhaolong Zhang

[permalink] [raw]
Subject: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus

move mce_missing_cpus checking into mce_panic() as well, because we don't want
to lose the cpu missing information in case mca_cfg.tolerant > 1 and there is
no_way_out.

Signed-off-by: Zhaolong Zhang <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 38 ++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 50a3e455cded..0bb59e68a457 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -99,7 +99,6 @@ struct mca_config mca_cfg __read_mostly = {

static DEFINE_PER_CPU(struct mce, mces_seen);
static unsigned long mce_need_notify;
-static int cpu_missing;

/*
* MCA banks polled by the period polling timer for corrected events.
@@ -253,6 +252,12 @@ static atomic_t mce_panicked;
static int fake_panic;
static atomic_t mce_fake_panicked;

+/*
+ * Track which CPUs entered the MCA broadcast synchronization and which not in
+ * order to print holdouts.
+ */
+static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
+
/* Panic in progress. Enable interrupts and wait for final IPI */
static void wait_for_panic(void)
{
@@ -314,8 +319,13 @@ static void mce_panic(const char *msg, struct mce *final, char *exp)
if (!apei_err)
apei_err = apei_write_mce(final);
}
- if (cpu_missing)
- pr_emerg(HW_ERR "Some CPUs didn't answer in synchronization\n");
+ /*
+ * cpu_online_mask == &mce_missing_cpus means it is reset and no timeout happens.
+ */
+ if (!cpumask_equal(cpu_online_mask, &mce_missing_cpus) &&
+ cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
+ pr_emerg(HW_ERR "CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
+ cpumask_pr_args(&mce_missing_cpus));
if (exp)
pr_emerg(HW_ERR "Machine check: %s\n", exp);
if (!fake_panic) {
@@ -880,12 +890,6 @@ static atomic_t mce_executing;
*/
static atomic_t mce_callin;

-/*
- * Track which CPUs entered the MCA broadcast synchronization and which not in
- * order to print holdouts.
- */
-static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
-
/*
* Check if a timeout waiting for other CPUs happened.
*/
@@ -904,12 +908,8 @@ static int mce_timed_out(u64 *t, const char *msg)
goto out;
if ((s64)*t < SPINUNIT) {
if (mca_cfg.tolerant <= 1) {
- if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
- pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
- cpumask_pr_args(&mce_missing_cpus));
mce_panic(msg, NULL, NULL);
}
- cpu_missing = 1;
return 1;
}
*t -= SPINUNIT;
@@ -1079,8 +1079,10 @@ static int mce_end(int order)

if (!timeout)
goto reset;
- if (order < 0)
+ if (order < 0) {
+ timeout = 0;
goto reset;
+ }

/*
* Allow others to run.
@@ -1128,7 +1130,12 @@ static int mce_end(int order)
reset:
atomic_set(&global_nwo, 0);
atomic_set(&mce_callin, 0);
- cpumask_setall(&mce_missing_cpus);
+ /*
+ * Don't reset mce_missing_cpus if there is mce_timed_out() so that
+ * mce_panic() can report right thing.
+ */
+ if (!((s64)timeout < SPINUNIT))
+ cpumask_setall(&mce_missing_cpus);
barrier();

/*
@@ -2720,7 +2727,6 @@ struct dentry *mce_get_debugfs_dir(void)

static void mce_reset(void)
{
- cpu_missing = 0;
atomic_set(&mce_fake_panicked, 0);
atomic_set(&mce_executing, 0);
atomic_set(&mce_callin, 0);
--
2.27.0


2021-11-08 09:32:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus

On Mon, Nov 08, 2021 at 04:28:32PM +0800, Zhaolong Zhang wrote:
> move mce_missing_cpus checking into mce_panic() as well, because we don't want
> to lose the cpu missing information in case mca_cfg.tolerant > 1 and there is
> no_way_out.
>
> Signed-off-by: Zhaolong Zhang <[email protected]>
> ---
> arch/x86/kernel/cpu/mce/core.c | 38 ++++++++++++++++++++--------------
> 1 file changed, 22 insertions(+), 16 deletions(-)

I was actually expecting to see something like this:

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6ed365337a3b..30de00fe0d7a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -99,7 +99,6 @@ struct mca_config mca_cfg __read_mostly = {

static DEFINE_PER_CPU(struct mce, mces_seen);
static unsigned long mce_need_notify;
-static int cpu_missing;

/*
* MCA banks polled by the period polling timer for corrected events.
@@ -314,8 +313,6 @@ static void mce_panic(const char *msg, struct mce *final, char *exp)
if (!apei_err)
apei_err = apei_write_mce(final);
}
- if (cpu_missing)
- pr_emerg(HW_ERR "Some CPUs didn't answer in synchronization\n");
if (exp)
pr_emerg(HW_ERR "Machine check: %s\n", exp);
if (!fake_panic) {
@@ -891,7 +888,6 @@ static int mce_timed_out(u64 *t, const char *msg)
cpumask_pr_args(&mce_missing_cpus));
mce_panic(msg, NULL, NULL);
}
- cpu_missing = 1;
return 1;
}
*t -= SPINUNIT;
@@ -2702,7 +2698,6 @@ struct dentry *mce_get_debugfs_dir(void)

static void mce_reset(void)
{
- cpu_missing = 0;
atomic_set(&mce_fake_panicked, 0);
atomic_set(&mce_executing, 0);
atomic_set(&mce_callin, 0);

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-11-08 10:31:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus

On Mon, Nov 08, 2021 at 06:13:04PM +0800, Zhaolong Zhang wrote:
> I was concerning that if I simply remove the cpu_missing code, we will lose the log in the
> situation where mca_cfg.tolerant > 1 and no_way_out is set afterwards.
>
> Do you think we can safely ignore that situation?

Well, how likely is to have such a situation in practice?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-11-08 10:44:41

by Zhaolong Zhang

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus

At 2021-11-08 17:31:52, "Borislav Petkov" <[email protected]> wrote:
>On Mon, Nov 08, 2021 at 04:28:32PM +0800, Zhaolong Zhang wrote:
>> move mce_missing_cpus checking into mce_panic() as well, because we don't want
>> to lose the cpu missing information in case mca_cfg.tolerant > 1 and there is
>> no_way_out.
>>
>> Signed-off-by: Zhaolong Zhang <[email protected]>
>> ---
>> arch/x86/kernel/cpu/mce/core.c | 38 ++++++++++++++++++++--------------
>> 1 file changed, 22 insertions(+), 16 deletions(-)
>
>I was actually expecting to see something like this:

Hi Boris,

I was concerning that if I simply remove the cpu_missing code, we will lose the log in the
situation where mca_cfg.tolerant > 1 and no_way_out is set afterwards.

Do you think we can safely ignore that situation?

Regards,
Zhaolong


>
>diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>index 6ed365337a3b..30de00fe0d7a 100644
>--- a/arch/x86/kernel/cpu/mce/core.c
>+++ b/arch/x86/kernel/cpu/mce/core.c
>@@ -99,7 +99,6 @@ struct mca_config mca_cfg __read_mostly = {
>
> static DEFINE_PER_CPU(struct mce, mces_seen);
> static unsigned long mce_need_notify;
>-static int cpu_missing;
>
> /*
> * MCA banks polled by the period polling timer for corrected events.
>@@ -314,8 +313,6 @@ static void mce_panic(const char *msg, struct mce *final, char *exp)
> if (!apei_err)
> apei_err = apei_write_mce(final);
> }
>- if (cpu_missing)
>- pr_emerg(HW_ERR "Some CPUs didn't answer in synchronization\n");
> if (exp)
> pr_emerg(HW_ERR "Machine check: %s\n", exp);
> if (!fake_panic) {
>@@ -891,7 +888,6 @@ static int mce_timed_out(u64 *t, const char *msg)
> cpumask_pr_args(&mce_missing_cpus));
> mce_panic(msg, NULL, NULL);
> }
>- cpu_missing = 1;
> return 1;
> }
> *t -= SPINUNIT;
>@@ -2702,7 +2698,6 @@ struct dentry *mce_get_debugfs_dir(void)
>
> static void mce_reset(void)
> {
>- cpu_missing = 0;
> atomic_set(&mce_fake_panicked, 0);
> atomic_set(&mce_executing, 0);
> atomic_set(&mce_callin, 0);
>
>--
>Regards/Gruss,
> Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

2021-11-08 13:19:37

by Zhaolong Zhang

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus

At 2021-11-08 18:31:38, "Borislav Petkov" <[email protected]> wrote:
>On Mon, Nov 08, 2021 at 06:13:04PM +0800, Zhaolong Zhang wrote:
>> I was concerning that if I simply remove the cpu_missing code, we will lose the log in the
>> situation where mca_cfg.tolerant > 1 and no_way_out is set afterwards.
>>
>> Do you think we can safely ignore that situation?
>
>Well, how likely is to have such a situation in practice?

It is difficult to answer...
But since current code is dealing with this situation, I think I should cover it too,
although it is only a piece of log.

Regards,
Zhaolong

2021-11-09 08:31:44

by Zhaolong Zhang

[permalink] [raw]
Subject: Re: [PATCH] x86/mce: drop cpu_missing since we have more capable mce_missing_cpus

At 2021-11-08 20:47:59, "Zhaolong Zhang" <[email protected]> wrote:
>At 2021-11-08 18:31:38, "Borislav Petkov" <[email protected]> wrote:
>>On Mon, Nov 08, 2021 at 06:13:04PM +0800, Zhaolong Zhang wrote:
>>> I was concerning that if I simply remove the cpu_missing code, we will lose the log in the
>>> situation where mca_cfg.tolerant > 1 and no_way_out is set afterwards.
>>>
>>> Do you think we can safely ignore that situation?
>>
>>Well, how likely is to have such a situation in practice?
>
>It is difficult to answer...
>But since current code is dealing with this situation, I think I should cover it too,
>although it is only a piece of log.

Hi Boris,

I reconsidered the situation.
If there is a non-recoverable mce as well, just let it print that reason. No need to bring the
timeout message indeed. Because since the tolerant was set to a high level to ignore the timeout,
we can eventually ignore them.

So simply drop cpu_missing variable as you mentioned should work.

I am not sure whether it should be authored by you or suggested by you.
Anyway, I will post a new patch exactly as you suggested. Please pick it or ignore it as appropriate :)

Thanks,
Zhaolong

2021-11-09 08:36:21

by Zhaolong Zhang

[permalink] [raw]
Subject: [PATCH] x86/mce: Get rid of cpu_missing

Drop cpu_missing since we have more capable mce_missing_cpus.

Suggested-by: Borislav Petkov <[email protected]>
Signed-off-by: Zhaolong Zhang <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 50a3e455cded..51aefffe39f1 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -99,7 +99,6 @@ struct mca_config mca_cfg __read_mostly = {

static DEFINE_PER_CPU(struct mce, mces_seen);
static unsigned long mce_need_notify;
-static int cpu_missing;

/*
* MCA banks polled by the period polling timer for corrected events.
@@ -314,8 +313,6 @@ static void mce_panic(const char *msg, struct mce *final, char *exp)
if (!apei_err)
apei_err = apei_write_mce(final);
}
- if (cpu_missing)
- pr_emerg(HW_ERR "Some CPUs didn't answer in synchronization\n");
if (exp)
pr_emerg(HW_ERR "Machine check: %s\n", exp);
if (!fake_panic) {
@@ -909,7 +906,6 @@ static int mce_timed_out(u64 *t, const char *msg)
cpumask_pr_args(&mce_missing_cpus));
mce_panic(msg, NULL, NULL);
}
- cpu_missing = 1;
return 1;
}
*t -= SPINUNIT;
@@ -2720,7 +2716,6 @@ struct dentry *mce_get_debugfs_dir(void)

static void mce_reset(void)
{
- cpu_missing = 0;
atomic_set(&mce_fake_panicked, 0);
atomic_set(&mce_executing, 0);
atomic_set(&mce_callin, 0);
--
2.27.0