2007-05-29 12:53:22

by Michal Piotrowski

[permalink] [raw]
Subject: [1/4] 2.6.22-rc3: known regressions

Hi all,

Here is a list of some known regressions in 2.6.22-rc3.

Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions



Unclassified

Subject : long freezes on thinkpad t60
References : http://lkml.org/lkml/2007/5/24/100
Submitter : Miklos Szeredi <[email protected]>
Handled-By : Ingo Molnar <[email protected]>
Status : problem is being debugged



ACPI

Subject : unable to shutdown on kernel 2.6.22-rc2
References : http://bugzilla.kernel.org/show_bug.cgi?id=8516
Submitter : Thierry Volpiatto <[email protected]>
Status : Unknown



ALSA

Subject : snd-aoa causes badness in lib/kref.c:33
References : http://bugzilla.kernel.org/show_bug.cgi?id=8513
Submitter : Ben Collins <[email protected]>
Status : Unknown



File systems

Subject : Oops in dentry_iput with 2.6.22-rc2 on AMD64
References : http://lkml.org/lkml/2007/5/22/4
Submitter : Florin Iucha <[email protected]>
Status : Unknown



Kbuild

Subject : make M=$PWD modules_install does nothing
References : http://lkml.org/lkml/2007/5/27/190
Submitter : Andrey Borzenkov <[email protected]>
Status : Unknown



Regards,
Michal

--
"Najbardziej brakowa?o mi twojego milczenia."
-- Andrzej Sapkowski "Co? wi?cej"


2007-05-29 14:23:27

by Jan Kara

[permalink] [raw]
Subject: Re: [1/4] 2.6.22-rc3: known regressions

Hi,

On Tue 29-05-07 14:52:53, Michal Piotrowski wrote:
> Here is a list of some known regressions in 2.6.22-rc3.
>
> Feel free to add new regressions/remove fixed etc.
> http://kernelnewbies.org/known_regressions
>
<snip>
> File systems
>
> Subject : Oops in dentry_iput with 2.6.22-rc2 on AMD64
> References : http://lkml.org/lkml/2007/5/22/4
> Submitter : Florin Iucha <[email protected]>
> Status : Unknown
Actually, the bug seems to be unreproducible and it has probably been a
1-bit flip. So I'd be reluctant to call it a regression...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-05-29 14:41:31

by Florin Iucha

[permalink] [raw]
Subject: Re: [1/4] 2.6.22-rc3: known regressions

On Tue, May 29, 2007 at 04:34:59PM +0200, Jan Kara wrote:
> On Tue 29-05-07 14:52:53, Michal Piotrowski wrote:
> > Here is a list of some known regressions in 2.6.22-rc3.
> >
> > Subject : Oops in dentry_iput with 2.6.22-rc2 on AMD64
> > References : http://lkml.org/lkml/2007/5/22/4
> > Submitter : Florin Iucha <[email protected]>
> > Status : Unknown
> Actually, the bug seems to be unreproducible and it has probably been a
> 1-bit flip. So I'd be reluctant to call it a regression...

I agree with this statement. I'll ping Michal and Jan if the oops
resurfaces.

florin

--
Bruce Schneier expects the Spanish Inquisition.
http://geekz.co.uk/schneierfacts/fact/163


Attachments:
(No filename) (688.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2007-05-30 04:34:36

by Sam Ravnborg

[permalink] [raw]
Subject: Re: [1/4] 2.6.22-rc3: known regressions

On Tue, May 29, 2007 at 02:52:53PM +0200, Michal Piotrowski wrote:
> Hi all,
>
> Here is a list of some known regressions in 2.6.22-rc3.
>
>
> Kbuild
>
> Subject : make M=$PWD modules_install does nothing
> References : http://lkml.org/lkml/2007/5/27/190
> Submitter : Andrey Borzenkov <[email protected]>
> Status : Unknown
Closed - see http://lkml.org/lkml/2007/5/29/497

Sam

2007-06-03 13:03:03

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [1/4] 2.6.22-rc3: known regressions

On Tue, 29 May 2007 14:52:53 +0200 Michal Piotrowski (MP) wrote:

MP> Here is a list of some known regressions in 2.6.22-rc3.
MP>
MP> Feel free to add new regressions/remove fixed etc.
MP> http://kernelnewbies.org/known_regressions

Here's another 2.6.22-rc3 regression. It was ok on 2.6.21. I believe it
triggered during: echo 0 > /proc/sys/kernel/nmi_watchdog


------------[ cut here ]------------
kernel BUG at arch/i386/kernel/cpu/perfctr-watchdog.c:126!
invalid opcode: 0000 [#1]
PREEMPT
Modules linked in:
CPU: 0
EIP: 0060:[<c010cae5>] Not tainted VLI
EFLAGS: 00010286 (2.6.22-rc3 #2)
EIP is at release_evntsel_nmi+0x16/0x22
eax: 000000c1 ebx: 080f7408 ecx: c04296e0 edx: ffffff3b
esi: 00000001 edi: f69d4240 ebp: 00000002 esp: f6962f30
ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068
Process rc.M (pid: 1281, ti=f6962000 task=f706c030 task.ti=f6962000)
Stack: c010cb60 c010cda3 c0110abe 080f7408 f6962f64 f6962fa0 c042ab68 ffffffff
c01853a8 080f7408 f6962f64 f6962fa0 080f7408 00000002 c042a774 f69d4240
080f7408 c0185339 00000002 c0156d33 f6962fa0 f7fcccb4 f69d4240 fffffff7
Call Trace:
[<c010cb60>] single_msr_unreserve+0xd/0x1a
[<c010cda3>] disable_lapic_nmi_watchdog+0x2b/0x39
[<c0110abe>] proc_nmi_enabled+0xa0/0xbd
[<c01853a8>] proc_sys_write+0x6f/0x8c
[<c0185339>] proc_sys_write+0x0/0x8c
[<c0156d33>] vfs_write+0x8a/0x10c
[<c01571ef>] sys_write+0x41/0x67
[<c0103c30>] syscall_call+0x7/0xb
=======================
Code: 00 c7 04 24 f6 5d 3c c0 e8 7d e0 00 00 83 ca ff 89 d0 5a 59 c3 8b 0d 28
6e 48 c0 31 d2 85 c9 74 0e 89 c2 2b 51 18 83 fa 42 76 04 <0f> 0b eb fe 0f b3 15
38 6e 48 c0 c3 8b 0d 28 6e 48 c0 31 d2 85 EIP: [<c010cae5>]
release_evntsel_nmi+0x16/0x22 SS:ESP 0068:f6962f30

Cheers,

- Udo


Attachments:
signature.asc (189.00 B)

2007-06-08 06:02:54

by Björn Steinbrink

[permalink] [raw]
Subject: [PATCH] Fix interchanged parameters to release_{evntsel,perfctr}_nmi

On 2007.06.03 15:02:46 +0200, Udo A. Steinberg wrote:
> On Tue, 29 May 2007 14:52:53 +0200 Michal Piotrowski (MP) wrote:
>
> MP> Here is a list of some known regressions in 2.6.22-rc3.
> MP>
> MP> Feel free to add new regressions/remove fixed etc.
> MP> http://kernelnewbies.org/known_regressions
>
> Here's another 2.6.22-rc3 regression. It was ok on 2.6.21. I believe it
> triggered during: echo 0 > /proc/sys/kernel/nmi_watchdog
>
>
> ------------[ cut here ]------------
> kernel BUG at arch/i386/kernel/cpu/perfctr-watchdog.c:126!
> invalid opcode: 0000 [#1]
> PREEMPT
> Modules linked in:
> CPU: 0
> EIP: 0060:[<c010cae5>] Not tainted VLI
> EFLAGS: 00010286 (2.6.22-rc3 #2)
> EIP is at release_evntsel_nmi+0x16/0x22
> eax: 000000c1 ebx: 080f7408 ecx: c04296e0 edx: ffffff3b
> esi: 00000001 edi: f69d4240 ebp: 00000002 esp: f6962f30
> ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068
> Process rc.M (pid: 1281, ti=f6962000 task=f706c030 task.ti=f6962000)
> Stack: c010cb60 c010cda3 c0110abe 080f7408 f6962f64 f6962fa0 c042ab68 ffffffff
> c01853a8 080f7408 f6962f64 f6962fa0 080f7408 00000002 c042a774 f69d4240
> 080f7408 c0185339 00000002 c0156d33 f6962fa0 f7fcccb4 f69d4240 fffffff7
> Call Trace:
> [<c010cb60>] single_msr_unreserve+0xd/0x1a
> [<c010cda3>] disable_lapic_nmi_watchdog+0x2b/0x39
> [<c0110abe>] proc_nmi_enabled+0xa0/0xbd
> [<c01853a8>] proc_sys_write+0x6f/0x8c
> [<c0185339>] proc_sys_write+0x0/0x8c
> [<c0156d33>] vfs_write+0x8a/0x10c
> [<c01571ef>] sys_write+0x41/0x67
> [<c0103c30>] syscall_call+0x7/0xb
> =======================
> Code: 00 c7 04 24 f6 5d 3c c0 e8 7d e0 00 00 83 ca ff 89 d0 5a 59 c3 8b 0d 28
> 6e 48 c0 31 d2 85 c9 74 0e 89 c2 2b 51 18 83 fa 42 76 04 <0f> 0b eb fe 0f b3 15
> 38 6e 48 c0 c3 8b 0d 28 6e 48 c0 31 d2 85 EIP: [<c010cae5>]
> release_evntsel_nmi+0x16/0x22 SS:ESP 0068:f6962f30

The culprit seems to be 09198e68501a7e34737cd9264d266f42429abcdc:
[PATCH] i386: Clean up NMI watchdog code

In two places, the parameters to release_{evntsel,perfctr}_nmi
got interchanged during the cleanup. Unfortunately, the NMI watchdog
doesn't want to be enabled on my T43 at all (or I just have no idea what
magic is required to make it happy), so this patch untested. Could you
give it a spin?

Thanks,
Bj?rn


From: Bj?rn Steinbrink <[email protected]>

Fix interchanged parameters to release_{evntsel,perfctr}_nmi.

Signed-off-by: Bj?rn Steinbrink <[email protected]>
---
diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
index 2b04c8f..e490ac2 100644
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c
+++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -276,8 +276,8 @@ static int single_msr_reserve(void)

static void single_msr_unreserve(void)
{
- release_evntsel_nmi(wd_ops->perfctr);
- release_perfctr_nmi(wd_ops->evntsel);
+ release_evntsel_nmi(wd_ops->evntsel);
+ release_perfctr_nmi(wd_ops->perfctr);
}

static void single_msr_rearm(struct nmi_watchdog_ctlblk *wd, unsigned nmi_hz)
@@ -475,10 +475,10 @@ static void p4_unreserve(void)
{
#ifdef CONFIG_SMP
if (smp_num_siblings > 1)
- release_evntsel_nmi(MSR_P4_IQ_PERFCTR1);
+ release_perfctr_nmi(MSR_P4_IQ_PERFCTR1);
#endif
- release_evntsel_nmi(MSR_P4_IQ_PERFCTR0);
- release_perfctr_nmi(MSR_P4_CRU_ESCR0);
+ release_evntsel_nmi(MSR_P4_CRU_ESCR0);
+ release_perfctr_nmi(MSR_P4_IQ_PERFCTR0);
}

static void p4_rearm(struct nmi_watchdog_ctlblk *wd, unsigned nmi_hz)

2007-06-08 06:42:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Fix interchanged parameters to release_{evntsel,perfctr}_nmi

On Fri, 8 Jun 2007 08:02:44 +0200 Bj?rn Steinbrink <[email protected]> wrote:

> Fix interchanged parameters to release_{evntsel,perfctr}_nmi.
>
> Signed-off-by: Bj?rn Steinbrink <[email protected]>
> ---
> diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
> index 2b04c8f..e490ac2 100644
> --- a/arch/i386/kernel/cpu/perfctr-watchdog.c
> +++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
> @@ -276,8 +276,8 @@ static int single_msr_reserve(void)
>
> static void single_msr_unreserve(void)
> {
> - release_evntsel_nmi(wd_ops->perfctr);
> - release_perfctr_nmi(wd_ops->evntsel);
> + release_evntsel_nmi(wd_ops->evntsel);
> + release_perfctr_nmi(wd_ops->perfctr);
> }
>
> static void single_msr_rearm(struct nmi_watchdog_ctlblk *wd, unsigned nmi_hz)
> @@ -475,10 +475,10 @@ static void p4_unreserve(void)
> {
> #ifdef CONFIG_SMP
> if (smp_num_siblings > 1)
> - release_evntsel_nmi(MSR_P4_IQ_PERFCTR1);
> + release_perfctr_nmi(MSR_P4_IQ_PERFCTR1);
> #endif
> - release_evntsel_nmi(MSR_P4_IQ_PERFCTR0);
> - release_perfctr_nmi(MSR_P4_CRU_ESCR0);
> + release_evntsel_nmi(MSR_P4_CRU_ESCR0);
> + release_perfctr_nmi(MSR_P4_IQ_PERFCTR0);
> }
>
> static void p4_rearm(struct nmi_watchdog_ctlblk *wd, unsigned nmi_hz)

Half of this (the first hunk) has been in Andi's tree for a day or two.

I shall drop Andi's patch, queue this one up and shall send this off to Linus if
nothing else happens in the next couple of days.

2007-06-08 10:58:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Fix interchanged parameters to release_{evntsel,perfctr}_nmi


* Andrew Morton <[email protected]> wrote:

> Half of this (the first hunk) has been in Andi's tree for a day or
> two.
>
> I shall drop Andi's patch, queue this one up and shall send this off
> to Linus if nothing else happens in the next couple of days.

this patch does not fix the NMI watchdog bootup lockup i can reproduce,
it still occurs in -rc4 too. Andi, could you please react to my report?
See the "2.6.22-rc3 nmi watchdog hang" thread on lkml.

Ingo

2007-06-08 18:44:49

by Björn Steinbrink

[permalink] [raw]
Subject: [PATCH 0/2] i386: Fix two more NMI watchdog bugs

Hi Ingo,

On 2007.06.08 12:58:08 +0200, Ingo Molnar wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > Half of this (the first hunk) has been in Andi's tree for a day or
> > two.
> >
> > I shall drop Andi's patch, queue this one up and shall send this off
> > to Linus if nothing else happens in the next couple of days.
>
> this patch does not fix the NMI watchdog bootup lockup i can reproduce,
> it still occurs in -rc4 too. Andi, could you please react to my report?
> See the "2.6.22-rc3 nmi watchdog hang" thread on lkml.

Ok, so after I figured out again how to enable the nmi watchdog, I found
a few more bugs. One is pretty clear, calling a function directly while
the wrapper should be used, causing a(nother) BUG() when the watchdog is
disabled using /proc/sys/...

The other is less clear (to me). It seems like the perfect candidate to
muck up the watchdog, but I can't get it to do that. On system boot up,
the MSRs are no longer reserved, so some other subsystem might mess with
them. The only suitable subsystem I found was oprofile though, and I
could neither get that to reproduce the hang here nor does oprofile show
up in your logs.

Anyway, both are bugs and should be fixed. Maybe we're even lucky and it
fixes your hang. *fingers crossed*

Bj?rn

2007-06-08 18:47:12

by Björn Steinbrink

[permalink] [raw]
Subject: Re: [PATCH 1/2] i386: Fix NMI watchdog not reserving its MSRs

At system boot time, the NMI watchdog no longer reserved its MSRs,
allowing other subsystems to mess with them. Fix that.

Signed-off-by: Bj?rn Steinbrink <[email protected]>
---
diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
index 2b04c8f..f0b6763 100644
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c
+++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -614,6 +614,12 @@ int lapic_watchdog_init(unsigned nmi_hz)
probe_nmi_watchdog();
if (!wd_ops)
return -1;
+
+ if (!wd_ops->reserve()) {
+ printk(KERN_ERR
+ "NMI watchdog: cannot reserve perfctrs\n");
+ return -1;
+ }
}

if (!(wd_ops->setup(nmi_hz))) {

2007-06-08 18:50:27

by Björn Steinbrink

[permalink] [raw]
Subject: Re: [PATCH 2/2] i386: Use the right wrapper to disable the NMI watchdog

When disabled through /proc/sys/kernel/nmi_watchdog, the NMI watchdog
uses the stop() method directly, which does not decrement the activity
counter, leading to a BUG(). Use the wrapper function instead to fix
that.

Signed-off-by: Bj?rn Steinbrink <[email protected]>
---
diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
index 2b04c8f..f0b6763 100644
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c
+++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -28,7 +28,7 @@ struct wd_ops {
void (*unreserve)(void);
int (*setup)(unsigned nmi_hz);
void (*rearm)(struct nmi_watchdog_ctlblk *wd, unsigned nmi_hz);
- void (*stop)(void *);
+ void (*stop)(void);
unsigned perfctr;
unsigned evntsel;
u64 checkbit;
@@ -142,7 +142,7 @@ void disable_lapic_nmi_watchdog(void)
if (atomic_read(&nmi_active) <= 0)
return;

- on_each_cpu(wd_ops->stop, NULL, 0, 1);
+ on_each_cpu(stop_apic_nmi_watchdog, NULL, 0, 1);
wd_ops->unreserve();

BUG_ON(atomic_read(&nmi_active) != 0);
@@ -255,7 +255,7 @@ static int setup_k7_watchdog(unsigned nmi_hz)
return 1;
}

-static void single_msr_stop_watchdog(void *arg)
+static void single_msr_stop_watchdog(void)
{
struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);

@@ -442,7 +442,7 @@ static int setup_p4_watchdog(unsigned nmi_hz)
return 1;
}

-static void stop_p4_watchdog(void *arg)
+static void stop_p4_watchdog(void)
{
struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);
wrmsr(wd->cccr_msr, 0, 0);
@@ -628,7 +634,7 @@ int lapic_watchdog_init(unsigned nmi_hz)
void lapic_watchdog_stop(void)
{
if (wd_ops)
- wd_ops->stop(NULL);
+ wd_ops->stop();
}

unsigned lapic_adjust_nmi_hz(unsigned hz)

2007-06-08 20:43:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/2] i386: Fix two more NMI watchdog bugs


* Bj?rn Steinbrink <[email protected]> wrote:

> Anyway, both are bugs and should be fixed. Maybe we're even lucky and
> it fixes your hang. *fingers crossed*

just to make it clear: the NMI watchdog was working perfectly fine on
that box (in v2.6.21 and in dozens of kernel releases before that, for
multiple years) before Andi's cleanup patch. So lets find that bug first
or revert the cleanups.

Ingo

2007-06-08 20:49:33

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [PATCH 0/2] i386: Fix two more NMI watchdog bugs

On Fri, 8 Jun 2007 22:43:25 +0200 Ingo Molnar (IM) wrote:

IM>
IM> * Bj?rn Steinbrink <[email protected]> wrote:
IM>
IM> > Anyway, both are bugs and should be fixed. Maybe we're even lucky and
IM> > it fixes your hang. *fingers crossed*
IM>
IM> just to make it clear: the NMI watchdog was working perfectly fine on
IM> that box (in v2.6.21 and in dozens of kernel releases before that, for
IM> multiple years) before Andi's cleanup patch. So lets find that bug first
IM> or revert the cleanups.
IM>
IM> Ingo

None of the patches posted by Bj?rn fix the kernel BUG at
arch/i386/kernel/cpu/perfctr-watchdog.c:126! that occurs when doing
echo 0 > /proc/sys/kernel/nmi_watchdog

Call Trace:
[<c010c429>] single_msr_unreserve+0xd/0x1a
[<c010c668>] disable_lapic_nmi_watchdog+0x27/0x35
[<c0110ac6>] proc_nmi_enabled+0xa0/0xbd
[<c018550c>] proc_sys_write+0x6f/0x8c
[<c018549d>] proc_sys_write+0x0/0x8c
[<c0156e5b>] vfs_write+0x8a/0x10c
[<c0157317>] sys_write+0x41/0x67
[<c0103c30>] syscall_call+0x7/0xb

Andi, did you have a patch for that?

Cheers,

- Udo


Attachments:
signature.asc (189.00 B)

2007-06-08 20:58:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/2] i386: Fix two more NMI watchdog bugs

On Fri, 8 Jun 2007 22:49:11 +0200
"Udo A. Steinberg" <[email protected]> wrote:

> On Fri, 8 Jun 2007 22:43:25 +0200 Ingo Molnar (IM) wrote:
>
> IM>
> IM> * Bj?rn Steinbrink <[email protected]> wrote:
> IM>
> IM> > Anyway, both are bugs and should be fixed. Maybe we're even lucky and
> IM> > it fixes your hang. *fingers crossed*
> IM>
> IM> just to make it clear: the NMI watchdog was working perfectly fine on
> IM> that box (in v2.6.21 and in dozens of kernel releases before that, for
> IM> multiple years) before Andi's cleanup patch. So lets find that bug first
> IM> or revert the cleanups.
> IM>
> IM> Ingo
>
> None of the patches posted by Bj__rn fix the kernel BUG at
> arch/i386/kernel/cpu/perfctr-watchdog.c:126! that occurs when doing
> echo 0 > /proc/sys/kernel/nmi_watchdog
>
> Call Trace:
> [<c010c429>] single_msr_unreserve+0xd/0x1a
> [<c010c668>] disable_lapic_nmi_watchdog+0x27/0x35
> [<c0110ac6>] proc_nmi_enabled+0xa0/0xbd
> [<c018550c>] proc_sys_write+0x6f/0x8c
> [<c018549d>] proc_sys_write+0x0/0x8c
> [<c0156e5b>] vfs_write+0x8a/0x10c
> [<c0157317>] sys_write+0x41/0x67
> [<c0103c30>] syscall_call+0x7/0xb
>
> Andi, did you have a patch for that?
>

This?


From: Bjorn Steinbrink <[email protected]>

Fix oops triggered during: echo 0 > /proc/sys/kernel/nmi_watchdog

The culprit seems to be 09198e68501a7e34737cd9264d266f42429abcdc:
[PATCH] i386: Clean up NMI watchdog code

In two places, the parameters to release_{evntsel,perfctr}_nmi
got interchanged during the cleanup.

Fix interchanged parameters to release_{evntsel,perfctr}_nmi.

Signed-off-by: Bjorn Steinbrink <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Michal Piotrowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/i386/kernel/cpu/perfctr-watchdog.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff -puN arch/i386/kernel/cpu/perfctr-watchdog.c~fix-interchanged-parameters-to-release_evntselperfctr_nmi arch/i386/kernel/cpu/perfctr-watchdog.c
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c~fix-interchanged-parameters-to-release_evntselperfctr_nmi
+++ a/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -276,8 +276,8 @@ static int single_msr_reserve(void)

static void single_msr_unreserve(void)
{
- release_evntsel_nmi(wd_ops->perfctr);
- release_perfctr_nmi(wd_ops->evntsel);
+ release_evntsel_nmi(wd_ops->evntsel);
+ release_perfctr_nmi(wd_ops->perfctr);
}

static void single_msr_rearm(struct nmi_watchdog_ctlblk *wd, unsigned nmi_hz)
@@ -475,10 +475,10 @@ static void p4_unreserve(void)
{
#ifdef CONFIG_SMP
if (smp_num_siblings > 1)
- release_evntsel_nmi(MSR_P4_IQ_PERFCTR1);
+ release_perfctr_nmi(MSR_P4_IQ_PERFCTR1);
#endif
- release_evntsel_nmi(MSR_P4_IQ_PERFCTR0);
- release_perfctr_nmi(MSR_P4_CRU_ESCR0);
+ release_evntsel_nmi(MSR_P4_CRU_ESCR0);
+ release_perfctr_nmi(MSR_P4_IQ_PERFCTR0);
}

static void p4_rearm(struct nmi_watchdog_ctlblk *wd, unsigned nmi_hz)
_

2007-06-08 21:13:48

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [PATCH 0/2] i386: Fix two more NMI watchdog bugs

On Fri, 8 Jun 2007 13:57:27 -0700 Andrew Morton (AM) wrote:

AM> On Fri, 8 Jun 2007 22:49:11 +0200
AM> "Udo A. Steinberg" <[email protected]> wrote:
AM>
AM> > On Fri, 8 Jun 2007 22:43:25 +0200 Ingo Molnar (IM) wrote:
AM> >
AM> > IM>
AM> > IM> * Bj?rn Steinbrink <[email protected]> wrote:
AM> > IM>
AM> > IM> > Anyway, both are bugs and should be fixed. Maybe we're even lucky
AM> > IM> > and it fixes your hang. *fingers crossed*
AM> > IM>
AM> > IM> just to make it clear: the NMI watchdog was working perfectly fine on
AM> > IM> that box (in v2.6.21 and in dozens of kernel releases before that,
AM> > IM> for multiple years) before Andi's cleanup patch. So lets find that
AM> > IM> bug first or revert the cleanups.
AM> > IM>
AM> > IM> Ingo
AM> >
AM> > None of the patches posted by Bj__rn fix the kernel BUG at
AM> > arch/i386/kernel/cpu/perfctr-watchdog.c:126! that occurs when doing
AM> > echo 0 > /proc/sys/kernel/nmi_watchdog
AM> >
AM> > Call Trace:
AM> > [<c010c429>] single_msr_unreserve+0xd/0x1a
AM> > [<c010c668>] disable_lapic_nmi_watchdog+0x27/0x35
AM> > [<c0110ac6>] proc_nmi_enabled+0xa0/0xbd
AM> > [<c018550c>] proc_sys_write+0x6f/0x8c
AM> > [<c018549d>] proc_sys_write+0x0/0x8c
AM> > [<c0156e5b>] vfs_write+0x8a/0x10c
AM> > [<c0157317>] sys_write+0x41/0x67
AM> > [<c0103c30>] syscall_call+0x7/0xb
AM> >
AM> > Andi, did you have a patch for that?
AM> >
AM>
AM> This?
AM>
AM>
AM> From: Bjorn Steinbrink <[email protected]>
AM>
AM> Fix oops triggered during: echo 0 > /proc/sys/kernel/nmi_watchdog
AM>
AM> The culprit seems to be 09198e68501a7e34737cd9264d266f42429abcdc:

This alone does not help, but in combination with the other two patches
the problem no longer occurs.

Thanks,

- Udo


Attachments:
signature.asc (189.00 B)

2007-06-08 22:30:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/2] i386: Fix two more NMI watchdog bugs


> None of the patches posted by Björn fix the kernel BUG at
> arch/i386/kernel/cpu/perfctr-watchdog.c:126! that occurs when doing
> echo 0 > /proc/sys/kernel/nmi_watchdog
>
> Call Trace:
> [<c010c429>] single_msr_unreserve+0xd/0x1a
> [<c010c668>] disable_lapic_nmi_watchdog+0x27/0x35
> [<c0110ac6>] proc_nmi_enabled+0xa0/0xbd
> [<c018550c>] proc_sys_write+0x6f/0x8c
> [<c018549d>] proc_sys_write+0x0/0x8c
> [<c0156e5b>] vfs_write+0x8a/0x10c
> [<c0157317>] sys_write+0x41/0x67
> [<c0103c30>] syscall_call+0x7/0xb
>
> Andi, did you have a patch for that?

ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/disable-watchdog

-Andi

2007-06-09 02:27:24

by Björn Steinbrink

[permalink] [raw]
Subject: [PATCH] i386: Fix the K7 NMI watchdog checkbit

On 2007.06.08 22:43:25 +0200, Ingo Molnar wrote:
>
> * Bj?rn Steinbrink <[email protected]> wrote:
>
> > Anyway, both are bugs and should be fixed. Maybe we're even lucky and
> > it fixes your hang. *fingers crossed*
>
> just to make it clear: the NMI watchdog was working perfectly fine on
> that box (in v2.6.21 and in dozens of kernel releases before that, for
> multiple years) before Andi's cleanup patch. So lets find that bug first
> or revert the cleanups.

Might have been pure luck. ;-) The culprit seems to be commit
b7471c6da94d30d3deadc55986cc38d1ff57f9ca (from Sep 2006), which
introduced the check bit to figure out if a NMI was generated by the
watchdog timer. While the performance counter register on K7 is 64 bits
wide, the upper 16 bits are reserved and thus using bit 63 as the check
bit is wrong. A quick check using /dev/cpu/0/msr shows that
here, the upper 16 bits are zero all the time, chances are that this is
not deterministic and you got a 1 in bit 63 due to some random change.

Bj?rn



The performance counters on K7 are only 48 bits wide, so using bit 63 to
check if the counter overflowed is wrong. Let's use bit 47 instead.

Signed-off-by: Bj?rn Steinbrink <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Andi Kleen <[email protected]>
---
diff --git a/arch/i386/kernel/cpu/perfctr-watchdog.c b/arch/i386/kernel/cpu/perfctr-watchdog.c
index 2b04c8f..82c6967 100644
--- a/arch/i386/kernel/cpu/perfctr-watchdog.c
+++ b/arch/i386/kernel/cpu/perfctr-watchdog.c
@@ -294,7 +294,7 @@ static struct wd_ops k7_wd_ops = {
.stop = single_msr_stop_watchdog,
.perfctr = MSR_K7_PERFCTR0,
.evntsel = MSR_K7_EVNTSEL0,
- .checkbit = 1ULL<<63,
+ .checkbit = 1ULL<<47,
};

/* Intel Model 6 (PPro+,P2,P3,P-M,Core1) */

2007-06-09 02:34:11

by Björn Steinbrink

[permalink] [raw]
Subject: Re: [PATCH] i386: Fix the K7 NMI watchdog checkbit

On 2007.06.09 04:27:10 +0200, Bj?rn Steinbrink wrote:
> On 2007.06.08 22:43:25 +0200, Ingo Molnar wrote:
> >
> > * Bj?rn Steinbrink <[email protected]> wrote:
> >
> > > Anyway, both are bugs and should be fixed. Maybe we're even lucky and
> > > it fixes your hang. *fingers crossed*
> >
> > just to make it clear: the NMI watchdog was working perfectly fine on
> > that box (in v2.6.21 and in dozens of kernel releases before that, for
> > multiple years) before Andi's cleanup patch. So lets find that bug first
> > or revert the cleanups.
>
> Might have been pure luck. ;-) The culprit seems to be commit
> b7471c6da94d30d3deadc55986cc38d1ff57f9ca (from Sep 2006), which
> introduced the check bit to figure out if a NMI was generated by the
> watchdog timer. While the performance counter register on K7 is 64 bits
> wide, the upper 16 bits are reserved and thus using bit 63 as the check
> bit is wrong. A quick check using /dev/cpu/0/msr shows that
> here, the upper 16 bits are zero all the time, chances are that this is
> not deterministic and you got a 1 in bit 63 due to some random change.

Hrmpf... Should've read the AMD docs first, not some random website. The
upper bits are "read as zero", so while that was another bug fix, it's
unlikely to help in your case. :-(

Bj?rn