2008-12-21 14:23:55

by Tim Blechmann

[permalink] [raw]
Subject: 2.6.28-rc9: oprofile regression

hello all,

i am experiencing an issue, similar to the one reported in
http://lkml.org/lkml/2008/10/30/319.

each time, i start the oprofile daemon, one NMI per cpu is executed. the
cpu is an intel core2 duo t7400, x86_64 architecture.

root@thinkpad:~# opcontrol -s
Using 2.6+ OProfile kernel interface.
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.

tim@thinkpad:~/kernel/linux-2.6$ cat /proc/interrupts |grep NMI
NMI: 3 3 Non-maskable interrupts

root@thinkpad:~# opcontrol -h
Stopping profiling.
Killing daemon.

root@thinkpad:~# opcontrol -s
Using 2.6+ OProfile kernel interface.
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.

tim@thinkpad:~/kernel/linux-2.6$ cat /proc/interrupts |grep NMI
NMI: 4 4 Non-maskable interrupts

oprofile samples are not collected.

commit 7c64ade53a6f977d73f16243865c42ceae999aea, which solved the issue
in the 2.6.28-rc2 regression, is already included in the 2.6.28-rc9
kernel, i am currently running. oprofile works fine with kernel 2.6.27
on the same machine.

best, tim

--
[email protected]
http://tim.klingt.org

Nothing exists until or unless it is observed. An artist is making
something exist by observing it. And his hope for other people is that
they will also make it exist by observing it. I call it 'creative
observation.' Creative viewing.
William S. Burroughs


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2008-12-21 21:23:30

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

> i am experiencing an issue, similar to the one reported in
> http://lkml.org/lkml/2008/10/30/319.

bisecting showed, that commit b99170288421c79f0c2efa8b33e26e65f4bb7fb8
(oprofile: Implement Intel architectural perfmon support) caused the
problem.
oddly, the newly introduced api is not used, since the model struct is
set during the ppro_init call ...

best, tim

--
[email protected]
http://tim.klingt.org

Desperation is the raw material of drastic change. Only those who can
leave behind everything they have ever believed in can hope to escape.
William S. Burroughs


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part
Subject: Re: 2.6.28-rc9: oprofile regression

On 21.12.08 22:23:05, Tim Blechmann wrote:
> > i am experiencing an issue, similar to the one reported in
> > http://lkml.org/lkml/2008/10/30/319.
>
> bisecting showed, that commit b99170288421c79f0c2efa8b33e26e65f4bb7fb8
> (oprofile: Implement Intel architectural perfmon support) caused the
> problem.
> oddly, the newly introduced api is not used, since the model struct is
> set during the ppro_init call ...

Tim,

I just tested rc9 on a system I have available and I could not
reproduce this.

cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz
stepping : 2

Does fix 7c64ade53a6f977d73f16243865c42ceae999aea work on your system?

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2008-12-22 14:55:47

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

>>> i am experiencing an issue, similar to the one reported in
>>> http://lkml.org/lkml/2008/10/30/319.
>>
>> bisecting showed, that commit b99170288421c79f0c2efa8b33e26e65f4bb7fb8
>> (oprofile: Implement Intel architectural perfmon support) caused the
>> problem.
>> oddly, the newly introduced api is not used, since the model struct is
>> set during the ppro_init call ...
>
> Tim,
>
> I just tested rc9 on a system I have available and I could not
> reproduce this.
>
> cpu family : 6
> model : 15
> model name : Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz
> stepping : 2
>
> Does fix 7c64ade53a6f977d73f16243865c42ceae999aea work on your system?

7c64ade53a6f977d73f16243865c42ceae999aea has been included in rc9, but still
i experience this issue ...

the only difference to my cpu is the stepping (i have stepping 6) ... also, in
case it matters, i am running a 64bit kernel ...

tim

---------------------------------------------------------------
diese nachricht wurde ueber klingt.org gesendet
this message was sent through klingt.org's webmail.

2008-12-26 02:42:20

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

Tim Blechmann wrote:
>> i am experiencing an issue, similar to the one reported in
>> http://lkml.org/lkml/2008/10/30/319.
>
> bisecting showed, that commit b99170288421c79f0c2efa8b33e26e65f4bb7fb8
> (oprofile: Implement Intel architectural perfmon support) caused the
> problem.
> oddly, the newly introduced api is not used, since the model struct is
> set during the ppro_init call ...

We're still investigating the problem. Thanks for the report.

-Andi

2009-01-02 11:04:52

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

On Fri, 2008-12-26 at 03:42 +0100, Andi Kleen wrote:
> Tim Blechmann wrote:
> >> i am experiencing an issue, similar to the one reported in
> >> http://lkml.org/lkml/2008/10/30/319.
> >
> > bisecting showed, that commit b99170288421c79f0c2efa8b33e26e65f4bb7fb8
> > (oprofile: Implement Intel architectural perfmon support) caused the
> > problem.
> > oddly, the newly introduced api is not used, since the model struct is
> > set during the ppro_init call ...
>
> We're still investigating the problem. Thanks for the report.

btw, this issue still exists in tip/oprofile ... not sure, whether this
may be related, but i am running the machine in 64-bit mode ...

best, tim

--
[email protected]
http://tim.klingt.org

I had nothing to offer anybody except my own confusion
Jack Kerouac


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-01-14 17:10:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

Tim,

On Fri, 2 Jan 2009, Tim Blechmann wrote:
> On Fri, 2008-12-26 at 03:42 +0100, Andi Kleen wrote:
> > Tim Blechmann wrote:
> > >> i am experiencing an issue, similar to the one reported in
> > >> http://lkml.org/lkml/2008/10/30/319.
> > >
> > > bisecting showed, that commit b99170288421c79f0c2efa8b33e26e65f4bb7fb8
> > > (oprofile: Implement Intel architectural perfmon support) caused the
> > > problem.
> > > oddly, the newly introduced api is not used, since the model struct is
> > > set during the ppro_init call ...
> >
> > We're still investigating the problem. Thanks for the report.
>
> btw, this issue still exists in tip/oprofile ... not sure, whether this
> may be related, but i am running the machine in 64-bit mode ...

can you please apply the patch below and provide the output ?

That's one of the subtle differences to the 2.6.27 code, where the
counter width is fixed to 32bit, which is correct anyway as the
counter MSRs can only write the lower 32bits and sign extend bit 31
according to intel documentation.

There are more subtle changes, but this is the most obvious one.

Thanks,

tglx
---
arch/x86/oprofile/op_model_ppro.c | 2 ++
1 file changed, 2 insertions(+)

Index: linux-2.6/arch/x86/oprofile/op_model_ppro.c
===================================================================
--- linux-2.6.orig/arch/x86/oprofile/op_model_ppro.c
+++ linux-2.6/arch/x86/oprofile/op_model_ppro.c
@@ -80,6 +80,8 @@ static void ppro_setup_ctrs(struct op_ms
eax.full = cpuid_eax(0xa);
if (counter_width < eax.split.bit_width)
counter_width = eax.split.bit_width;
+
+ printk(KERN_INFO "ppro counter_width: %d\n", counter_width);
}

/* clear all counters */

2009-01-14 18:20:29

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

On Wed, 2009-01-14 at 18:10 +0100, Thomas Gleixner wrote:
> On Fri, 2 Jan 2009, Tim Blechmann wrote:
> > On Fri, 2008-12-26 at 03:42 +0100, Andi Kleen wrote:
> > > Tim Blechmann wrote:
> > > >> i am experiencing an issue, similar to the one reported in
> > > >> http://lkml.org/lkml/2008/10/30/319.
> > > >
> > > > bisecting showed, that commit b99170288421c79f0c2efa8b33e26e65f4bb7fb8
> > > > (oprofile: Implement Intel architectural perfmon support) caused the
> > > > problem.
> > > > oddly, the newly introduced api is not used, since the model struct is
> > > > set during the ppro_init call ...
> > >
> > > We're still investigating the problem. Thanks for the report.
> >
> > btw, this issue still exists in tip/oprofile ... not sure, whether this
> > may be related, but i am running the machine in 64-bit mode ...
>
> can you please apply the patch below and provide the output ?

[29030.863352] oprofile: using NMI interrupt.
[29051.826778] ppro counter_width: 40
[29051.826783] ppro counter_width: 40


> That's one of the subtle differences to the 2.6.27 code, where the
> counter width is fixed to 32bit, which is correct anyway as the
> counter MSRs can only write the lower 32bits and sign extend bit 31
> according to intel documentation.

this code (line 81/82), changes counter_width from 32 to 40.

if (counter_width < eax.split.bit_width)
counter_width = eax.split.bit_width;

however when removing these lines, and thus keeping the value 32 for
counter_width, doesn't change the behavior, only one NMI per cpu.

best, tim

--
[email protected]
http://tim.klingt.org

It is better to make a piece of music than to perform one, better to
perform one than to listen to one, better to listen to one than to
misuse it as a means of distraction, entertainment, or acquisition of
'culture'.
John Cage


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-01-15 08:46:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

Tim,

On Wed, 14 Jan 2009, Tim Blechmann wrote:
> this code (line 81/82), changes counter_width from 32 to 40.
>
> if (counter_width < eax.split.bit_width)
> counter_width = eax.split.bit_width;
>
> however when removing these lines, and thus keeping the value 32 for
> counter_width, doesn't change the behavior, only one NMI per cpu.

It would only help, when the reported bit_width would be bogus. We
know that you get at least one NMI, so lets look at the results we get
there.

Thanks,

tglx
---
arch/x86/oprofile/op_model_ppro.c | 4 ++++
1 file changed, 4 insertions(+)

Index: linux-2.6/arch/x86/oprofile/op_model_ppro.c
===================================================================
--- linux-2.6.orig/arch/x86/oprofile/op_model_ppro.c
+++ linux-2.6/arch/x86/oprofile/op_model_ppro.c
@@ -126,11 +126,15 @@ static int ppro_check_ctrs(struct pt_reg
u64 val;
int i;

+ printk(KERN_INFO "ppro_check_ctrs\n");
+
for (i = 0 ; i < num_counters; ++i) {
if (!reset_value[i])
continue;
rdmsrl(msrs->counters[i].addr, val);
+ printk(KERN_INFO "cntr %d: 0x%016lx\n", i, (unsigned long) val);
if (CTR_OVERFLOWED(val)) {
+ printk(KERN_INFO "cntr %d overflowed\n", i);
oprofile_add_sample(regs, i);
wrmsrl(msrs->counters[i].addr, -reset_value[i]);
}

2009-01-15 09:14:58

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

On Thu, 2009-01-15 at 09:46 +0100, Thomas Gleixner wrote:
> Tim,
>
> On Wed, 14 Jan 2009, Tim Blechmann wrote:
> > this code (line 81/82), changes counter_width from 32 to 40.
> >
> > if (counter_width < eax.split.bit_width)
> > counter_width = eax.split.bit_width;
> >
> > however when removing these lines, and thus keeping the value 32 for
> > counter_width, doesn't change the behavior, only one NMI per cpu.
>
> It would only help, when the reported bit_width would be bogus. We
> know that you get at least one NMI, so lets look at the results we get
> there.

it seems, that ppro_check_ctrs is never called:
[ 982.238639] oprofile: using NMI interrupt.

hth, tim

--
[email protected]
http://tim.klingt.org

Nothing exists until or unless it is observed. An artist is making
something exist by observing it. And his hope for other people is that
they will also make it exist by observing it. I call it 'creative
observation.' Creative viewing.
William S. Burroughs


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-01-15 21:05:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

On Thu, 15 Jan 2009, Tim Blechmann wrote:
> On Thu, 2009-01-15 at 09:46 +0100, Thomas Gleixner wrote:
> > Tim,
> >
> > On Wed, 14 Jan 2009, Tim Blechmann wrote:
> > > this code (line 81/82), changes counter_width from 32 to 40.
> > >
> > > if (counter_width < eax.split.bit_width)
> > > counter_width = eax.split.bit_width;
> > >
> > > however when removing these lines, and thus keeping the value 32 for
> > > counter_width, doesn't change the behavior, only one NMI per cpu.
> >
> > It would only help, when the reported bit_width would be bogus. We
> > know that you get at least one NMI, so lets look at the results we get
> > there.
>
> it seems, that ppro_check_ctrs is never called:
> [ 982.238639] oprofile: using NMI interrupt.
>
> hth, tim

Hmm, that confuses the hell out of me. Can you try the patch below,
which restores the original code of writing the counter registers ?

Thanks,

tglx
---
Subject: x86-oprofile-debug2.patch
From: Thomas Gleixner <[email protected]>
Date: Thu, 15 Jan 2009 21:25:55 +0100

Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/oprofile/op_model_ppro.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/oprofile/op_model_ppro.c
===================================================================
--- linux-2.6.orig/arch/x86/oprofile/op_model_ppro.c
+++ linux-2.6/arch/x86/oprofile/op_model_ppro.c
@@ -101,9 +101,9 @@ static void ppro_setup_ctrs(struct op_ms
/* enable active counters */
for (i = 0; i < num_counters; ++i) {
if ((counter_config[i].enabled) && (CTR_IS_RESERVED(msrs, i))) {
- reset_value[i] = counter_config[i].count;
+ reset_value[i] = (1ULL << 32) - counter_config[i].count;

- wrmsrl(msrs->counters[i].addr, -reset_value[i]);
+ wrmsr(msrs->counters[i].addr, (u32) reset_value[i], 0);

CTRL_READ(low, high, msrs, i);
CTRL_CLEAR(low);
@@ -132,7 +132,7 @@ static int ppro_check_ctrs(struct pt_reg
rdmsrl(msrs->counters[i].addr, val);
if (CTR_OVERFLOWED(val)) {
oprofile_add_sample(regs, i);
- wrmsrl(msrs->counters[i].addr, -reset_value[i]);
+ wrmsr(msrs->counters[i].addr, (u32) reset_value[i], 0);
}
}

2009-01-16 00:53:32

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

On Thu, 2009-01-15 at 21:37 +0100, Thomas Gleixner wrote:
> On Thu, 15 Jan 2009, Tim Blechmann wrote:
> > On Thu, 2009-01-15 at 09:46 +0100, Thomas Gleixner wrote:
> > > Tim,
> > >
> > > On Wed, 14 Jan 2009, Tim Blechmann wrote:
> > > > this code (line 81/82), changes counter_width from 32 to 40.
> > > >
> > > > if (counter_width < eax.split.bit_width)
> > > > counter_width = eax.split.bit_width;
> > > >
> > > > however when removing these lines, and thus keeping the value 32 for
> > > > counter_width, doesn't change the behavior, only one NMI per cpu.
> > >
> > > It would only help, when the reported bit_width would be bogus. We
> > > know that you get at least one NMI, so lets look at the results we get
> > > there.
> >
> > it seems, that ppro_check_ctrs is never called:
> > [ 982.238639] oprofile: using NMI interrupt.
> >
> > hth, tim
>
> Hmm, that confuses the hell out of me. Can you try the patch below,
> which restores the original code of writing the counter registers ?

no difference ... i must admit, i already spent some time, trying to
revert the specific changes of this commit ... i gave up to wait for
some help from someone, who actually knows how the code is working ...

possibly it makes sense to compile a full kernel of the first bad kommit
and introduce the different changes one by one ... i will try to find
some time for this during the next few days ...

but if you have more patches, that i can test, please let me know ...

cheers, tim

--
[email protected]
http://tim.klingt.org

Which is more musical, a truck passing by a factory or a truck passing
by a music school?
John Cage


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-01-16 09:00:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

On Fri, 16 Jan 2009, Tim Blechmann wrote:
> > Hmm, that confuses the hell out of me. Can you try the patch below,
> > which restores the original code of writing the counter registers ?
>
> no difference ... i must admit, i already spent some time, trying to
> revert the specific changes of this commit ... i gave up to wait for
> some help from someone, who actually knows how the code is working ...
>
> possibly it makes sense to compile a full kernel of the first bad kommit
> and introduce the different changes one by one ... i will try to find
> some time for this during the next few days ...
>
> but if you have more patches, that i can test, please let me know ...

Here is another one, which also reverts the counter reset code and
disables the cpuid(0xa) call. Can you verify that counters are set up
? Can you please provide me your .config ? What's your kernel command
line ?

Thanks,

tglx
---
Subject: x86-oprofile-debug2.patch
From: Thomas Gleixner <[email protected]>
Date: Thu, 15 Jan 2009 21:25:55 +0100

Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/oprofile/op_model_ppro.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/arch/x86/oprofile/op_model_ppro.c
===================================================================
--- linux-2.6.orig/arch/x86/oprofile/op_model_ppro.c
+++ linux-2.6/arch/x86/oprofile/op_model_ppro.c
@@ -75,7 +75,7 @@ static void ppro_setup_ctrs(struct op_ms
return;
}

- if (cpu_has_arch_perfmon) {
+ if (0 && cpu_has_arch_perfmon) {
union cpuid10_eax eax;
eax.full = cpuid_eax(0xa);
if (counter_width < eax.split.bit_width)
@@ -95,15 +95,15 @@ static void ppro_setup_ctrs(struct op_ms
for (i = 0; i < num_counters; ++i) {
if (unlikely(!CTR_IS_RESERVED(msrs, i)))
continue;
- wrmsrl(msrs->counters[i].addr, -1LL);
+ wrmsr(msrs->counters[i].addr, (u32) 0xFFFFFFFF, 0);
}

/* enable active counters */
for (i = 0; i < num_counters; ++i) {
if ((counter_config[i].enabled) && (CTR_IS_RESERVED(msrs, i))) {
- reset_value[i] = counter_config[i].count;
+ reset_value[i] = (1ULL << 32) - counter_config[i].count;

- wrmsrl(msrs->counters[i].addr, -reset_value[i]);
+ wrmsr(msrs->counters[i].addr, (u32) reset_value[i], 0);

CTRL_READ(low, high, msrs, i);
CTRL_CLEAR(low);
@@ -132,7 +132,7 @@ static int ppro_check_ctrs(struct pt_reg
rdmsrl(msrs->counters[i].addr, val);
if (CTR_OVERFLOWED(val)) {
oprofile_add_sample(regs, i);
- wrmsrl(msrs->counters[i].addr, -reset_value[i]);
+ wrmsr(msrs->counters[i].addr, (u32) reset_value[i], 0);
}
}

2009-01-16 11:35:38

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

On Fri, 2009-01-16 at 09:59 +0100, Thomas Gleixner wrote:
> On Fri, 16 Jan 2009, Tim Blechmann wrote:
> > > Hmm, that confuses the hell out of me. Can you try the patch
> below,
> > > which restores the original code of writing the counter
> registers ?
> >
> > no difference ... i must admit, i already spent some time, trying to
> > revert the specific changes of this commit ... i gave up to wait for
> > some help from someone, who actually knows how the code is
> working ...
> >
> > possibly it makes sense to compile a full kernel of the first bad
> kommit
> > and introduce the different changes one by one ... i will try to
> find
> > some time for this during the next few days ...
> >
> > but if you have more patches, that i can test, please let me
> know ...
>
> Here is another one, which also reverts the counter reset code and
> disables the cpuid(0xa) call.

i applied the patch, no difference

> Can you verify that counters are set up
> ?

sorry for my ignorance, but how?

> Can you please provide me your .config ?

attached

> What's your kernel command
> line ?

nothing fancy, just the ubuntu default:
root=UUID=7b58a11b-972a-4a67-909a-b8e48cb29da5 ro quiet splash

--
[email protected]
http://tim.klingt.org

Just what the hell is the experimental tradition?
Morton Feldman


Attachments:
.config (95.10 kB)
signature.asc (197.00 B)
This is a digitally signed message part
Download all attachments

2009-01-16 15:53:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

On Fri, 16 Jan 2009, Tim Blechmann wrote:
> nothing fancy, just the ubuntu default:
> root=UUID=7b58a11b-972a-4a67-909a-b8e48cb29da5 ro quiet splash

Hmm, ubuntu. question: which gcc version ?

Thanks,

tglx

2009-01-16 16:02:50

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

On Fri, 2009-01-16 at 16:52 +0100, Thomas Gleixner wrote:
> On Fri, 16 Jan 2009, Tim Blechmann wrote:
> > nothing fancy, just the ubuntu default:
> > root=UUID=7b58a11b-972a-4a67-909a-b8e48cb29da5 ro quiet splash
>
> Hmm, ubuntu. question: which gcc version ?

~$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
4.3.2-1ubuntu11' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.3.2 (Ubuntu 4.3.2-1ubuntu11)


--
[email protected]
http://tim.klingt.org

The aim of education is the knowledge, not of facts, but of values
William S. Burroughs


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-01-17 13:33:26

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

hm, it seems a bit more complicated than i originally thought.

trying to reconstruct commit b99170288421c79f0c2efa8b33e26e65f4bb7fb8, i
found, that resetting counter_width in ppro_setup_ctrs, causes the
issue. reverting part of the patch, resolved the issue:

--- a/arch/x86/oprofile/op_model_ppro.c
+++ b/arch/x86/oprofile/op_model_ppro.c
@@ -79,8 +79,8 @@ static void ppro_setup_ctrs(struct op_msrs const * const msrs)
if (cpu_has_arch_perfmon) {
union cpuid10_eax eax;
eax.full = cpuid_eax(0xa);
- if (counter_width < eax.split.bit_width)
- counter_width = eax.split.bit_width;
+// if (counter_width < eax.split.bit_width)
+// counter_width = eax.split.bit_width;
}

/* clear all counters */


however, trying to apply this patch to 2.6.28, the behavior is the same
as before (one NMI) ... so possibly, it is a combination of two bugs,
with similar symptoms ...

hth, tim

--
[email protected]
http://tim.klingt.org

Your mind will answer most questions if you learn to relax and wait
for the answer.
William S. Burroughs


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part
Subject: Re: 2.6.28-rc9: oprofile regression

On 17.01.09 14:32:49, Tim Blechmann wrote:
> hm, it seems a bit more complicated than i originally thought.
>
> trying to reconstruct commit b99170288421c79f0c2efa8b33e26e65f4bb7fb8, i
> found, that resetting counter_width in ppro_setup_ctrs, causes the
> issue. reverting part of the patch, resolved the issue:
>
> --- a/arch/x86/oprofile/op_model_ppro.c
> +++ b/arch/x86/oprofile/op_model_ppro.c
> @@ -79,8 +79,8 @@ static void ppro_setup_ctrs(struct op_msrs const * const msrs)
> if (cpu_has_arch_perfmon) {
> union cpuid10_eax eax;
> eax.full = cpuid_eax(0xa);
> - if (counter_width < eax.split.bit_width)
> - counter_width = eax.split.bit_width;
> +// if (counter_width < eax.split.bit_width)
> +// counter_width = eax.split.bit_width;
> }
>
> /* clear all counters */
>
>
> however, trying to apply this patch to 2.6.28, the behavior is the same
> as before (one NMI) ... so possibly, it is a combination of two bugs,
> with similar symptoms ...

Tim, could you revert 7c64ade53a6f977d73f16243865c42ceae999aea too?

If this not helps, last chance is
59512900baab03c5629f2ff5efad1d5d4e682ece, but this seems to be save.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2009-01-17 15:10:24

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

> > however, trying to apply this patch to 2.6.28, the behavior is the same
> > as before (one NMI) ... so possibly, it is a combination of two bugs,
> > with similar symptoms ...
>
> Tim, could you revert 7c64ade53a6f977d73f16243865c42ceae999aea too?
>
> If this not helps, last chance is
> 59512900baab03c5629f2ff5efad1d5d4e682ece, but this seems to be save.

i tried to revert both commits, however the behavior doesn't seem to
change. will try to apply the working patch to the child commits, maybe
i can find something interesting ...

best, tim

btw, i am not very familiar with kernel programming, but is it safe to
have `static u64 *reset_value' uninitialized, or should it be
initialized to NULL?

--
[email protected]
http://tim.klingt.org

I must say I find television very educational. The minute somebody
turns it on, I go to the library and read a good book.
Groucho Marx


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part
Subject: Re: 2.6.28-rc9: oprofile regression

On 17.01.09 16:09:23, Tim Blechmann wrote:
> > > however, trying to apply this patch to 2.6.28, the behavior is the same
> > > as before (one NMI) ... so possibly, it is a combination of two bugs,
> > > with similar symptoms ...
> >
> > Tim, could you revert 7c64ade53a6f977d73f16243865c42ceae999aea too?
> >
> > If this not helps, last chance is
> > 59512900baab03c5629f2ff5efad1d5d4e682ece, but this seems to be save.
>
> i tried to revert both commits, however the behavior doesn't seem to
> change. will try to apply the working patch to the child commits, maybe
> i can find something interesting ...

Hmm, strange. Actually 7c64ade53a6f977d73f16243865c42ceae999aea fixed
a similiar bug, see here:
http://bugzilla.kernel.org/show_bug.cgi?id=11908

Your patch with 2.6.28, does:

grep NMI /proc/interrupts

returns exactly 1 NMI per core or some more?

>
> best, tim
>
> btw, i am not very familiar with kernel programming, but is it safe to
> have `static u64 *reset_value' uninitialized, or should it be
> initialized to NULL?

External and static variables should be gaaranteed to be initialized
to zero. Only local variables are uninitialized.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2009-01-17 16:41:18

by Tim Blechmann

[permalink] [raw]
Subject: Re: 2.6.28-rc9: oprofile regression

> Hmm, strange. Actually 7c64ade53a6f977d73f16243865c42ceae999aea fixed
> a similiar bug, see here:
> http://bugzilla.kernel.org/show_bug.cgi?id=11908
>
> Your patch with 2.6.28, does:
>
> grep NMI /proc/interrupts
>
> returns exactly 1 NMI per core or some more?

that bug is similar, but it reports different numbers of NMIs for the
cores.
i get exactly one NMI per core, for each time, i start oprofile:

NMI: 0
opcontrol -s
(wait)
opcontrol --deinit
NMI: 1
opcontrol -s
(wait)
opcontrol --deinit
NMI: 2

hth, tim

--
[email protected]
http://tim.klingt.org

Which is more musical, a truck passing by a factory or a truck passing
by a music school?
John Cage


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part