2005-01-09 04:29:45

by Zwane Mwaikambo

[permalink] [raw]
Subject: [PATCH] x86_64: Notify user of MCE events.

x86_64 uses a userspace mce utility to decode MCEs, this patch will ensure
that the user is notified of MCE events being logged too.

Signed-off-by: Zwane Mwaikambo <[email protected]>

Index: linux-2.6.10-mm1/arch/x86_64/kernel/mce.c
===================================================================
RCS file: /home/cvsroot/linux-2.6.10-mm1/arch/x86_64/kernel/mce.c,v
retrieving revision 1.1.1.1
diff -u -p -B -r1.1.1.1 mce.c
--- linux-2.6.10-mm1/arch/x86_64/kernel/mce.c 4 Jan 2005 04:03:35 -0000 1.1.1.1
+++ linux-2.6.10-mm1/arch/x86_64/kernel/mce.c 9 Jan 2005 03:15:48 -0000
@@ -31,6 +32,8 @@ static int mce_dont_init;
static int tolerant = 1;
static int banks;
static unsigned long bank[NR_BANKS] = { [0 ... NR_BANKS-1] = ~0UL };
+static unsigned long console_logged;
+static int notify_user;

/*
* Lockless MCE logging infrastructure.
@@ -68,6 +71,9 @@ void mce_log(struct mce *mce)
smp_wmb();
mcelog.entry[entry].finished = 1;
smp_wmb();
+
+ if (!test_and_set_bit(0, &console_logged))
+ notify_user = 1;
}

static void print_mce(struct mce *m)
@@ -252,6 +258,12 @@ static void mcheck_timer(void *data)
{
on_each_cpu(mcheck_check_cpu, NULL, 1, 1);
schedule_delayed_work(&mcheck_work, check_interval * HZ);
+
+ if (notify_user && console_logged) {
+ notify_user = 0;
+ clear_bit(0, &console_logged);
+ printk(KERN_EMERG "Machine check exception logged, run mcelog\n");
+ }
}





2005-01-09 16:45:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Notify user of MCE events.

Zwane Mwaikambo <[email protected]> writes:

> +
> + if (!test_and_set_bit(0, &console_logged))
> + notify_user = 1;
> }
>
> static void print_mce(struct mce *m)
> @@ -252,6 +258,12 @@ static void mcheck_timer(void *data)
> {
> on_each_cpu(mcheck_check_cpu, NULL, 1, 1);
> schedule_delayed_work(&mcheck_work, check_interval * HZ);
> +
> + if (notify_user && console_logged) {

Perhaps a comment here that the race is harmless?

> + notify_user = 0;
> + clear_bit(0, &console_logged);
> + printk(KERN_EMERG "Machine check exception logged, run mcelog\n");

I would drop the "run mcelog". It's misleading if mcelog is already
running in cron as it should.

-Andi

2005-01-09 17:10:43

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Notify user of MCE events (updated)

x86_64 uses a userspace mce utility to decode MCEs, this patch will ensure
that the user is notified of MCE events being logged too.

Updated to incorporate suggestions from Andi Kleen.

Signed-off-by: Zwane Mwaikambo <[email protected]>

Index: linux-2.6.10-mm1/arch/x86_64/kernel/mce.c
===================================================================
RCS file: /home/cvsroot/linux-2.6.10-mm1/arch/x86_64/kernel/mce.c,v
retrieving revision 1.1.1.1
diff -u -p -B -r1.1.1.1 mce.c
--- linux-2.6.10-mm1/arch/x86_64/kernel/mce.c 4 Jan 2005 04:03:35 -0000 1.1.1.1
+++ linux-2.6.10-mm1/arch/x86_64/kernel/mce.c 9 Jan 2005 17:09:39 -0000
@@ -31,6 +31,8 @@ static int mce_dont_init;
static int tolerant = 1;
static int banks;
static unsigned long bank[NR_BANKS] = { [0 ... NR_BANKS-1] = ~0UL };
+static unsigned long console_logged;
+static int notify_user;

/*
* Lockless MCE logging infrastructure.
@@ -68,6 +70,9 @@ void mce_log(struct mce *mce)
smp_wmb();
mcelog.entry[entry].finished = 1;
smp_wmb();
+
+ if (!test_and_set_bit(0, &console_logged))
+ notify_user = 1;
}

static void print_mce(struct mce *m)
@@ -252,6 +257,19 @@ static void mcheck_timer(void *data)
{
on_each_cpu(mcheck_check_cpu, NULL, 1, 1);
schedule_delayed_work(&mcheck_work, check_interval * HZ);
+
+ /*
+ * It's ok to read stale data here for notify_user and
+ * console_logged as we'll simply get the updated versions
+ * on the next mcheck_timer execution and atomic operations
+ * on console_logged act as synchronization for notify_user
+ * writes.
+ */
+ if (notify_user && console_logged) {
+ notify_user = 0;
+ clear_bit(0, &console_logged);
+ printk(KERN_EMERG "Machine check exception logged\n");
+ }
}


2005-01-09 18:14:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Notify user of MCE events (updated)

Zwane Mwaikambo <[email protected]> writes:
> + */
> + if (notify_user && console_logged) {
> + notify_user = 0;
> + clear_bit(0, &console_logged);
> + printk(KERN_EMERG "Machine check exception logged\n");

Another suggestion: don't make this KERN_EMERG. Make it KERN_INFO.
Logged errors are usually correct, so there is no need for an
emergency.

Also since these are not always exceptions (but can be read from
the polling timer) I would call them "machine check events"

-Andi

2005-01-09 18:22:35

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Notify user of MCE events (updated 2)

x86_64 uses a userspace mce utility to decode MCEs, this patch will ensure
that the user is notified of MCE events being logged too.

Updated to incorporate suggestions from Andi Kleen

Signed-off-by: Zwane Mwaikambo <[email protected]>

Index: linux-2.6.10-mm1/arch/x86_64/kernel/mce.c
===================================================================
RCS file: /home/cvsroot/linux-2.6.10-mm1/arch/x86_64/kernel/mce.c,v
retrieving revision 1.1.1.1
diff -u -p -B -r1.1.1.1 mce.c
--- linux-2.6.10-mm1/arch/x86_64/kernel/mce.c 4 Jan 2005 04:03:35 -0000 1.1.1.1
+++ linux-2.6.10-mm1/arch/x86_64/kernel/mce.c 9 Jan 2005 18:18:25 -0000
@@ -31,6 +31,8 @@ static int mce_dont_init;
static int tolerant = 1;
static int banks;
static unsigned long bank[NR_BANKS] = { [0 ... NR_BANKS-1] = ~0UL };
+static unsigned long console_logged;
+static int notify_user;

/*
* Lockless MCE logging infrastructure.
@@ -68,6 +70,9 @@ void mce_log(struct mce *mce)
smp_wmb();
mcelog.entry[entry].finished = 1;
smp_wmb();
+
+ if (!test_and_set_bit(0, &console_logged))
+ notify_user = 1;
}

static void print_mce(struct mce *m)
@@ -252,6 +257,19 @@ static void mcheck_timer(void *data)
{
on_each_cpu(mcheck_check_cpu, NULL, 1, 1);
schedule_delayed_work(&mcheck_work, check_interval * HZ);
+
+ /*
+ * It's ok to read stale data here for notify_user and
+ * console_logged as we'll simply get the updated versions
+ * on the next mcheck_timer execution and atomic operations
+ * on console_logged act as synchronization for notify_user
+ * writes.
+ */
+ if (notify_user && console_logged) {
+ notify_user = 0;
+ clear_bit(0, &console_logged);
+ printk(KERN_INFO "Machine check events logged\n");
+ }
}


2005-01-09 18:27:40

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Notify user of MCE events (updated)

On Sun, 9 Jan 2005, Andi Kleen wrote:

> Zwane Mwaikambo <[email protected]> writes:
> > + */
> > + if (notify_user && console_logged) {
> > + notify_user = 0;
> > + clear_bit(0, &console_logged);
> > + printk(KERN_EMERG "Machine check exception logged\n");
>
> Another suggestion: don't make this KERN_EMERG. Make it KERN_INFO.
> Logged errors are usually correct, so there is no need for an
> emergency.
>
> Also since these are not always exceptions (but can be read from
> the polling timer) I would call them "machine check events"

Thanks for the comments, i've updated the patch.

2005-01-09 18:31:37

by Andreas Steinmetz

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Notify user of MCE events (updated)

Andi Kleen wrote:
> Zwane Mwaikambo <[email protected]> writes:
>
>>+ */
>>+ if (notify_user && console_logged) {
>>+ notify_user = 0;
>>+ clear_bit(0, &console_logged);
>>+ printk(KERN_EMERG "Machine check exception logged\n");
>
>
> Another suggestion: don't make this KERN_EMERG. Make it KERN_INFO.
> Logged errors are usually correct, so there is no need for an
> emergency.

Just asking:
How about KERN_NOTICE? KERN_INFO is in my opinion too easily lost in the
syslog noise. Personally I'm logging KERN_INFO just to console,
KERN_NOTICE however to file.
An MCE event would suit the description "normal but significant
condition" of KERN_NOTICE as far as I can see.
--
Andreas Steinmetz SPAMmers use [email protected]

2005-01-09 18:40:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Notify user of MCE events (updated)

On Sun, Jan 09, 2005 at 07:30:56PM +0100, Andreas Steinmetz wrote:
> Just asking:
> How about KERN_NOTICE? KERN_INFO is in my opinion too easily lost in the

It's still in the mcelog if you want it.

The main idea behind the separate mcelog was to make it unobstrusive to
reduce MCE related support load. (about 20% of all users who get such a
message seem to think it's a kernel bug and ask kernel developers)
KERN_INFO is at the right level of unintrusiveness.

-Andi