Hallo,
I'm considering to enable CR4.PCE by default on x86-64/i386. Currently it's 0
which means RDPMC doesn't work. On x86-64 PMC 0 is always programmed
to be a cycle counter, so it would be useful to be able to access
this for measuring instructions. That's especially useful because RDTSC
does not necessarily count cycles in the current P state (already
the case on Intel CPUs and AMD's future direction seems to also
to decouple it from cycles) Drawback is that it stops during idle, but
that shouldn't be a big issue for normal measuring. It's not useful
as a real timer anyways.
On Pentium 4 it also has the advantage that unlike RDTSC it's not
serializing so should be much faster.
The kernel change would be to always set CR4.PCE to allow RDPMC
in ring 3.
It would be actually a good idea to disable RDTSC in ring 3 too
(because user space usually doesn't have enough information to make
good use of it and gets it wrong), but I fear that will break
too many applications right now.
Any comments on this?
-Andi
Andi Kleen writes:
> Hallo,
>
> I'm considering to enable CR4.PCE by default on x86-64/i386. Currently it's 0
> which means RDPMC doesn't work. On x86-64 PMC 0 is always programmed
> to be a cycle counter, so it would be useful to be able to access
> this for measuring instructions. That's especially useful because RDTSC
> does not necessarily count cycles in the current P state (already
> the case on Intel CPUs and AMD's future direction seems to also
> to decouple it from cycles) Drawback is that it stops during idle, but
> that shouldn't be a big issue for normal measuring. It's not useful
> as a real timer anyways.
>
> On Pentium 4 it also has the advantage that unlike RDTSC it's not
> serializing so should be much faster.
>
> The kernel change would be to always set CR4.PCE to allow RDPMC
> in ring 3.
>
> It would be actually a good idea to disable RDTSC in ring 3 too
> (because user space usually doesn't have enough information to make
> good use of it and gets it wrong), but I fear that will break
> too many applications right now.
PMC0 stops being a cycle counter as soon as any real driver
(not the NMI watchdog) takes over the hardware, such as oprofile,
perfmon2, or perfctr. So user-space cannot rely on the semantics
of PMC0. I have no objection to globally enabling CR4.PCE.
Disabling user-space RDTSC (setting CR4.TSD) seems evil and pointless.
At least some users of it (the perfctr library and I hope eventually
also perfmon2) do use it in an SMP-safe manner (through special
user/kernel protocols).
/Mikael
On Tuesday 29 November 2005 09:15, Andi Kleen wrote:
> Hallo,
>
> I'm considering to enable CR4.PCE by default on x86-64/i386. Currently it's
> 0 which means RDPMC doesn't work. On x86-64 PMC 0 is always programmed to
> be a cycle counter, so it would be useful to be able to access
> this for measuring instructions. That's especially useful because RDTSC
> does not necessarily count cycles in the current P state (already
> the case on Intel CPUs and AMD's future direction seems to also
> to decouple it from cycles) Drawback is that it stops during idle, but
> that shouldn't be a big issue for normal measuring. It's not useful
> as a real timer anyways.
>
> On Pentium 4 it also has the advantage that unlike RDTSC it's not
> serializing so should be much faster.
>
> The kernel change would be to always set CR4.PCE to allow RDPMC
> in ring 3.
>
You might also ping Stephane Eranian and the folks that are working on
defining a common performance measurement interface in the kernel
over on [email protected] and see what they think.
I'll cc them on this reply.
> It would be actually a good idea to disable RDTSC in ring 3 too
> (because user space usually doesn't have enough information to make
> good use of it and gets it wrong), but I fear that will break
> too many applications right now.
>
FWIW, I agree here. We lock down the power state and still use RDTSC for
some timing things.
> Any comments on this?
>
> -Andi
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)
>
> > It would be actually a good idea to disable RDTSC in ring 3 too
> > (because user space usually doesn't have enough information to make
> > good use of it and gets it wrong), but I fear that will break
> > too many applications right now.
> >
>
> FWIW, I agree here. We lock down the power state and still use RDTSC for
> some timing things.
You should replace that with RDPMC 0 once I did the change IMHO.
-Andi
> PMC0 stops being a cycle counter as soon as any real driver
> (not the NMI watchdog) takes over the hardware, such as oprofile,
> perfmon2, or perfctr. So user-space cannot rely on the semantics
They're wrong then. oprofile shouldn't disable the NMI
watchdog. If it does it's broken and needs to be fixed.
> Disabling user-space RDTSC (setting CR4.TSD) seems evil and pointless.
> At least some users of it (the perfctr library and I hope eventually
> also perfmon2) do use it in an SMP-safe manner (through special
> user/kernel protocols).
How do you handle P state changes? I don't think it can be safely
used in user space.
-Andi
Andi,
On Tue, Nov 29, 2005 at 10:56:31AM -0600, Ray Bryant wrote:
> >
> > I'm considering to enable CR4.PCE by default on x86-64/i386. Currently it's
> > 0 which means RDPMC doesn't work. On x86-64 PMC 0 is always programmed to
> > be a cycle counter, so it would be useful to be able to access
> > this for measuring instructions. That's especially useful because RDTSC
> > does not necessarily count cycles in the current P state (already
> > the case on Intel CPUs and AMD's future direction seems to also
> > to decouple it from cycles) Drawback is that it stops during idle, but
> > that shouldn't be a big issue for normal measuring. It's not useful
> > as a real timer anyways.
> >
Where did you see that PMC0 (PERSEL0/PERFCTR0) can only be programmed
to count cpu cycles (i.e. cpu_clk_unhalted)? As far as I can tell from
the documentation, the 4 counters are symetrical and can measure
any event that the processor offers.
If you look at the perfmon x86-64 patch, you will see that under certain
circumstances, we enable CR4.pce. The main motivation is to alloc reading
counters at ring 3 with RDPMC. By default, CR4.pce must be cleared for
security reasons (covert channel). Perfmon only allows RDPMC for
self-monitoring threads or upon special request.
--
-Stephane
> Where did you see that PMC0 (PERSEL0/PERFCTR0) can only be programmed
> to count cpu cycles (i.e. cpu_clk_unhalted)? As far as I can tell from
> the documentation, the 4 counters are symetrical and can measure
> any event that the processor offers.
Linux NMI watchdog does that.
All other perfctr users are supposed to keep their fingers away
from the watchdog (it looks like oprofile doesn't but not for much
longer ...)
I think it's also a useful convention - RDTSC is becomming more and more
useless and you cannot expect user applications who just want to
measure some cycles to rely on ever changing instable or non existing
performance counter APIs.
> If you look at the perfmon x86-64 patch, you will see that under certain
> circumstances, we enable CR4.pce. The main motivation is to alloc reading
> counters at ring 3 with RDPMC. By default, CR4.pce must be cleared for
> security reasons (covert channel). Perfmon only allows RDPMC for
The same covert channel would exist with RDTSC so that argument is bogus.
-Andi
Andi Kleen wrote:
> I think it's also a useful convention - RDTSC is becomming more and more
> useless and you cannot expect user applications who just want to
> measure some cycles to rely on ever changing instable or non existing
> performance counter APIs.
Users are even more unhappy with ever-changing ABIs -- such as the
kernel taking away RDTSC.
RDTSC+perfctr [Pettersson] still is the fastest way for user-mode code
to count something that is highly correlated with both "billable"
CPU time and "code quality" for a fixed task. With a little care
RDTSC is close enough to monotonic that I find it very useful.
Please don't take away user-mode RDTSC.
--
On Tue, Nov 29, 2005 at 10:29:47AM -0800, John Reiser wrote:
> Andi Kleen wrote:
> > I think it's also a useful convention - RDTSC is becomming more and more
> > useless and you cannot expect user applications who just want to
> > measure some cycles to rely on ever changing instable or non existing
> > performance counter APIs.
>
> Users are even more unhappy with ever-changing ABIs -- such as the
> kernel taking away RDTSC.
Nobody is talking about taking it away. But it's becomming
more and more useless because there are many situations
where it does unexpected things.
(it's not synchronized over CPUs,
on modern Intel CPUs it always measures the fastest P state even
though you might be running slower, on other CPUs when
you want to measure time it actually changes with P states etc.etc.)
The performance counter has a much clearer defintion - it's always
cycles are executed by the CPU and it doesn't even pretend
to be a usable timer.
>
> RDTSC+perfctr [Pettersson] still is the fastest way for user-mode code
> to count something that is highly correlated with both "billable"
> CPU time and "code quality" for a fixed task. With a little care
Actually it's wrong - at least on Intel CPUs RDPMC is faster
than RDTSC because it doesn't synchronize.
> RDTSC is close enough to monotonic that I find it very useful.
You tested on a very limited set of platforms and setups then.
So far you were either lucky or just didn't notice the problems yet.
About the only reasonable usage was for custom hacks to measure
cycles, but with all the ongoing changes in its definition
I believe these users will be happier with rdpmc 0 once
it's enabled (and oprofile and other users be taught
to keep their fingers away)
People who use it for timing, not measurement, directly are just wrong and
misguided.
-Andi
On Tue, 2005-11-29 at 10:29 -0800, John Reiser wrote:
> Andi Kleen wrote:
> > I think it's also a useful convention - RDTSC is becomming more and more
> > useless and you cannot expect user applications who just want to
> > measure some cycles to rely on ever changing instable or non existing
> > performance counter APIs.
>
> Users are even more unhappy with ever-changing ABIs -- such as the
> kernel taking away RDTSC.
>
> RDTSC+perfctr [Pettersson] still is the fastest way for user-mode code
> to count something that is highly correlated with both "billable"
> CPU time and "code quality" for a fixed task. With a little care
> RDTSC is close enough to monotonic that I find it very useful.
> Please don't take away user-mode RDTSC.
>
The kernel didn't take this away, the hardware vendors did. For years
although it was risky on paper it was perfectly usable as a cheap high
res timer. I agree that it's unfortunate.
Lee
On Tue, 2005-11-29 at 19:13 +0100, Andi Kleen wrote:
> > Where did you see that PMC0 (PERSEL0/PERFCTR0) can only be programmed
> > to count cpu cycles (i.e. cpu_clk_unhalted)? As far as I can tell from
> > the documentation, the 4 counters are symetrical and can measure
> > any event that the processor offers.
>
> Linux NMI watchdog does that.
>
> All other perfctr users are supposed to keep their fingers away
> from the watchdog (it looks like oprofile doesn't but not for much
> longer ...)
Why? Hardcoding PMC 0 to be a cycle counter seems to be a waste of a
perfectly usable performance counter. What if I want to profile four
things, none of them requiring a cycle count?
--
Nicholas Miell <[email protected]>
On Tue, Nov 29, 2005 at 01:43:11PM -0800, Nicholas Miell wrote:
> On Tue, 2005-11-29 at 19:13 +0100, Andi Kleen wrote:
> > > Where did you see that PMC0 (PERSEL0/PERFCTR0) can only be programmed
> > > to count cpu cycles (i.e. cpu_clk_unhalted)? As far as I can tell from
> > > the documentation, the 4 counters are symetrical and can measure
> > > any event that the processor offers.
> >
> > Linux NMI watchdog does that.
> >
> > All other perfctr users are supposed to keep their fingers away
> > from the watchdog (it looks like oprofile doesn't but not for much
> > longer ...)
>
> Why? Hardcoding PMC 0 to be a cycle counter seems to be a waste of a
> perfectly usable performance counter. What if I want to profile four
> things, none of them requiring a cycle count?
You won't then anymore. Providing a full replacement for RDTSC is
more important.
-Andi
Andi,
On Tue, Nov 29, 2005 at 10:52:07PM +0100, Andi Kleen wrote:
> On Tue, Nov 29, 2005 at 01:43:11PM -0800, Nicholas Miell wrote:
> > On Tue, 2005-11-29 at 19:13 +0100, Andi Kleen wrote:
> > > > Where did you see that PMC0 (PERSEL0/PERFCTR0) can only be programmed
> > > > to count cpu cycles (i.e. cpu_clk_unhalted)? As far as I can tell from
> > > > the documentation, the 4 counters are symetrical and can measure
> > > > any event that the processor offers.
> > >
> > > Linux NMI watchdog does that.
> > >
> > > All other perfctr users are supposed to keep their fingers away
> > > from the watchdog (it looks like oprofile doesn't but not for much
> > > longer ...)
> >
> > Why? Hardcoding PMC 0 to be a cycle counter seems to be a waste of a
> > perfectly usable performance counter. What if I want to profile four
> > things, none of them requiring a cycle count?
>
On AMD you only have 4 counters. That's not a lot for some measurements.
The other thing is that PERCTR0 is not like the TSC. It can count cycles
but it does only implement 47bits. At a high clock rate, this can wrap
around fairly rapidly. It all depends on what is the intended usage model.
Suppose you would have a "stable" performance monitoring interface.
One could just use that interface to measure time only when needed.
--
-Stephane
On Tue, 2005-11-29 at 22:52 +0100, Andi Kleen wrote:
> On Tue, Nov 29, 2005 at 01:43:11PM -0800, Nicholas Miell wrote:
> > On Tue, 2005-11-29 at 19:13 +0100, Andi Kleen wrote:
> > > > Where did you see that PMC0 (PERSEL0/PERFCTR0) can only be programmed
> > > > to count cpu cycles (i.e. cpu_clk_unhalted)? As far as I can tell from
> > > > the documentation, the 4 counters are symetrical and can measure
> > > > any event that the processor offers.
> > >
> > > Linux NMI watchdog does that.
> > >
> > > All other perfctr users are supposed to keep their fingers away
> > > from the watchdog (it looks like oprofile doesn't but not for much
> > > longer ...)
> >
> > Why? Hardcoding PMC 0 to be a cycle counter seems to be a waste of a
> > perfectly usable performance counter. What if I want to profile four
> > things, none of them requiring a cycle count?
>
> You won't then anymore. Providing a full replacement for RDTSC is
> more important.
>
I think you really need to come up with a better justification than "I
think this will be useful" for a permanent user-space ABI change.
What problem are you trying to solve, why is that a problem, how does
making PMC0 always be a cycle counter solve that, what makes you think
that future CPUs will have the same type of cycle counter that behaves
the same way as the current cycle counters, etc.
AFAICT, the problem you're trying to solve is two-fold:
a) RDTSC is serializing and RDPMC isn't.
Which is nicely solved by RDTSCP.
and
b) RDTSC isn't well defined.
Well, RDPMC isn't defined at all. You're assuming that future processor
revisions will have the same or substantially similar PerfCtrs as
current processors, and nothing guarantees that at all.
--
Nicholas Miell <[email protected]>
> I think you really need to come up with a better justification than "I
> think this will be useful" for a permanent user-space ABI change.
There's no user space ABI change involved, at least not from
the kernel side. Hardware is breaking some assumptions people
made though (they actually never worked fully, but these days they
break more clearly) and this is a best effort to adapt.
To give an bad analogy RDTSC usage in the last years is
like explicit spinning wait loops for delays in the earlier
times. They tended to work on some subset of computers,
but were always bad and caused problems and people eventually learned
it was better to use operating system services for this.
The kernel will probably not disable RDTSC outright,
but will make it clear in documentation that it's a bad
idea to use directly and laugh at everybody who runs
into problems with it.
oprofile usage might change slightly though, although only
for a small subset of its users. There can't be very many
of them using multiple performance counters anyways because
at least in the last 0.9 release I tried it didn't even work.
> What problem are you trying to solve, why is that a problem, how does
> making PMC0 always be a cycle counter solve that, what makes you think
Read the original messages in the thread. They explain it all.
> that future CPUs will have the same type of cycle counter that behaves
> the same way as the current cycle counters, etc.
>
> AFAICT, the problem you're trying to solve is two-fold:
>
> a) RDTSC is serializing and RDPMC isn't.
>
> Which is nicely solved by RDTSCP.
No, you got that totally wrong. Please read the RDTSCP specification again.
> and
> b) RDTSC isn't well defined.
It's well defined - but in a way that makes it useless for cycle
measurements these days.
>
> Well, RDPMC isn't defined at all. You're assuming that future processor
> revisions will have the same or substantially similar PerfCtrs as
> current processors, and nothing guarantees that at all.
Point, but i guess it is reasonable to assume that future x85 CPUs
will have cycle counter perfctrs. I cannot imagine anybody dropping
such a basic facility.
-Andi
On Tue, Nov 29, 2005 at 02:19:15PM -0800, Stephane Eranian wrote:
> Andi,
>
> On Tue, Nov 29, 2005 at 10:52:07PM +0100, Andi Kleen wrote:
> > On Tue, Nov 29, 2005 at 01:43:11PM -0800, Nicholas Miell wrote:
> > > On Tue, 2005-11-29 at 19:13 +0100, Andi Kleen wrote:
> > > > > Where did you see that PMC0 (PERSEL0/PERFCTR0) can only be programmed
> > > > > to count cpu cycles (i.e. cpu_clk_unhalted)? As far as I can tell from
> > > > > the documentation, the 4 counters are symetrical and can measure
> > > > > any event that the processor offers.
> > > >
> > > > Linux NMI watchdog does that.
> > > >
> > > > All other perfctr users are supposed to keep their fingers away
> > > > from the watchdog (it looks like oprofile doesn't but not for much
> > > > longer ...)
> > >
> > > Why? Hardcoding PMC 0 to be a cycle counter seems to be a waste of a
> > > perfectly usable performance counter. What if I want to profile four
> > > things, none of them requiring a cycle count?
> >
>
> On AMD you only have 4 counters. That's not a lot for some measurements.
Disabling the NMI watchdog for that is out of question. It's a important
debugging device and without it kernel bug reports are much worse.
It increased the quality of x86-64 bug reports over the years
considerably and I'm unwilling to give that up.
I didn't realize oprofile did this so far, but I plan to definitely
fix this.
> The other thing is that PERCTR0 is not like the TSC. It can count cycles
> but it does only implement 47bits. At a high clock rate, this can wrap
> around fairly rapidly. It all depends on what is the intended usage model.
TSC also doesn't count cycles in many circumstances (different frequency
depending on P states or not synchronized over CPUs, even running
at completely different frequencies etc.)
> Suppose you would have a "stable" performance monitoring interface.
Then you don't use TSC because it's not stable (or conversely
too stable for many performance measurements because it doesn't
follow the P states)
> One could just use that interface to measure time only when needed.
Good debugging infrastructure has priority imho - and NMI watchdog
is important. You will need to live with three counters.
This means there is one alternative - some of the newer chipsets
have external watchdogs that could be also used (using the ACPI WDOG
table). If someone writes a nice NMI driver for these then on system
with working WDOG it could replace the perfctr based timeout and free
the perfctr. That would need some code to allocate and deallocate
perfctrs though.
The older IOAPIC watchdog is no alternative because it runs too often (at HZ)
and has too much overhead.
-Andi
On Tue, 2005-11-29 at 23:43 +0100, Andi Kleen wrote:
> > I think you really need to come up with a better justification than "I
> > think this will be useful" for a permanent user-space ABI change.
>
> There's no user space ABI change involved, at least not from
> the kernel side. Hardware is breaking some assumptions people
> made though (they actually never worked fully, but these days they
> break more clearly) and this is a best effort to adapt.
Yes there is, you're allowing userspace usage of RDPMC and you're
guaranteeing that PMC0 will always be a cycle counter. The RDPMC usage
is benign (assuming you make a note somewhere that future versions of
Linux might disable both RDPMC and RDTSC(P) to prevent timing-channel
attacks), but that "PMC0 is a cycle counter" guarantee will probably
come back to haunt you.
> To give an bad analogy RDTSC usage in the last years is
> like explicit spinning wait loops for delays in the earlier
> times. They tended to work on some subset of computers,
> but were always bad and caused problems and people eventually learned
> it was better to use operating system services for this.
And you are now suggesting people should use RDPMC instead of OS
services?
> The kernel will probably not disable RDTSC outright,
> but will make it clear in documentation that it's a bad
> idea to use directly and laugh at everybody who runs
> into problems with it.
>
> oprofile usage might change slightly though, although only
> for a small subset of its users. There can't be very many
> of them using multiple performance counters anyways because
> at least in the last 0.9 release I tried it didn't even work.
>
> > What problem are you trying to solve, why is that a problem, how does
> > making PMC0 always be a cycle counter solve that, what makes you think
>
> Read the original messages in the thread. They explain it all.
>
> > that future CPUs will have the same type of cycle counter that behaves
> > the same way as the current cycle counters, etc.
> >
> > AFAICT, the problem you're trying to solve is two-fold:
> >
> > a) RDTSC is serializing and RDPMC isn't.
> >
> > Which is nicely solved by RDTSCP.
>
> No, you got that totally wrong. Please read the RDTSCP specification again.
Whoops, thanks.
> > and
> > b) RDTSC isn't well defined.
>
> It's well defined - but in a way that makes it useless for cycle
> measurements these days.
>
> >
> > Well, RDPMC isn't defined at all. You're assuming that future processor
> > revisions will have the same or substantially similar PerfCtrs as
> > current processors, and nothing guarantees that at all.
>
> Point, but i guess it is reasonable to assume that future x85 CPUs
> will have cycle counter perfctrs. I cannot imagine anybody dropping
> such a basic facility.
I don't think that's a reasonable assumption at all.
The definition of all these performance events for the Opteron is hidden
away in a very terse chart in the BIOS/Kernel manual.
That chart contains incompatible variations for pre-B, B, and C revision
processors and (among other strange things) includes instructions for
the monitoring of segment register loads to the HS register.
Everything is telling me that this is not something AMD intends to keep
stable and it isn't even something they're interested in documenting
very well at all.
My problem with this is basically that you are abusing something not
intended as a RDTSC-like interface to replace RDTSC, you're using a
scarce resource that could probably be put to better use to do it,
there's no guarantee that this abuse will continue to work in future
processor revisions, and this is a userspace-visible ABI change that
will have to be maintained indefinitely.
--
Nicholas Miell <[email protected]>
On Tue, Nov 29, 2005 at 11:43:47PM +0100, Andi Kleen wrote:
> > Well, RDPMC isn't defined at all. You're assuming that future processor
> > revisions will have the same or substantially similar PerfCtrs as
> > current processors, and nothing guarantees that at all.
>
> Point, but i guess it is reasonable to assume that future x85 CPUs
> will have cycle counter perfctrs. I cannot imagine anybody dropping
> such a basic facility.
It need not necessarily remain configurable to be in PMC 0 however.
Nor do the the PMC MSRs have to remain fixed..
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Tue, Nov 29, 2005 at 03:02:18PM -0800, Nicholas Miell wrote:
> On Tue, 2005-11-29 at 23:43 +0100, Andi Kleen wrote:
> > > I think you really need to come up with a better justification than "I
> > > think this will be useful" for a permanent user-space ABI change.
> >
> > There's no user space ABI change involved, at least not from
> > the kernel side. Hardware is breaking some assumptions people
> > made though (they actually never worked fully, but these days they
> > break more clearly) and this is a best effort to adapt.
>
> Yes there is, you're allowing userspace usage of RDPMC and you're
> guaranteeing that PMC0 will always be a cycle counter. The RDPMC usage
Well, it doesn't change any existing ABIs.
> is benign (assuming you make a note somewhere that future versions of
> Linux might disable both RDPMC and RDTSC(P) to prevent timing-channel
> attacks), but that "PMC0 is a cycle counter" guarantee will probably
> come back to haunt you.
There are no plans that i Know of to disable them.
Also even if someone decided to disable them they could always
trap and emulate them.
>
> > To give an bad analogy RDTSC usage in the last years is
> > like explicit spinning wait loops for delays in the earlier
> > times. They tended to work on some subset of computers,
> > but were always bad and caused problems and people eventually learned
> > it was better to use operating system services for this.
>
> And you are now suggesting people should use RDPMC instead of OS
> services?
For any kind of timers they should use the OS service
(gettimeofday/clock_gettime). The OS will go to extraordinary
means to make it as fast as possible, but when it's slow
then because it's not possible to do it faster accurately
(that's the case right now modulo one possible optimization)
For cycle counting where they previously used RDTSC they should
use RDPMC 0 now.
> That chart contains incompatible variations for pre-B, B, and C revision
> processors and (among other strange things) includes instructions for
> the monitoring of segment register loads to the HS register.
>
> Everything is telling me that this is not something AMD intends to keep
> stable and it isn't even something they're interested in documenting
> very well at all.
There are obscure performance counters and then there are basic
fundamental performance counters. That particular counter hasn't
changed since the K7 days (and K6 didn't have performance counters)
Intel also always had an equivalent one. Unless they go to clockless
CPUs or something I think it's likely they will keep a counter like
this.
Also we'll let them know that we would like them to keep such a counter
around.
You're right that many other performance counters are not so stable.
-Andi
> It need not necessarily remain configurable to be in PMC 0 however.
> Nor do the the PMC MSRs have to remain fixed..
If all fails the kernel can always trap and emulate it. But I hope this will
never be needed.
-Andi
On Wed, 2005-11-30 at 00:17 +0100, Andi Kleen wrote:
> On Tue, Nov 29, 2005 at 03:02:18PM -0800, Nicholas Miell wrote:
> > On Tue, 2005-11-29 at 23:43 +0100, Andi Kleen wrote:
> > > To give an bad analogy RDTSC usage in the last years is
> > > like explicit spinning wait loops for delays in the earlier
> > > times. They tended to work on some subset of computers,
> > > but were always bad and caused problems and people eventually learned
> > > it was better to use operating system services for this.
> >
> > And you are now suggesting people should use RDPMC instead of OS
> > services?
>
> For any kind of timers they should use the OS service
> (gettimeofday/clock_gettime). The OS will go to extraordinary
> means to make it as fast as possible, but when it's slow
> then because it's not possible to do it faster accurately
> (that's the case right now modulo one possible optimization)
>
> For cycle counting where they previously used RDTSC they should
> use RDPMC 0 now.
Well, if that's all you want them to use RDPMC 0 for, why not just make
PMCs programmable from userspace?
> > That chart contains incompatible variations for pre-B, B, and C revision
> > processors and (among other strange things) includes instructions for
> > the monitoring of segment register loads to the HS register.
> >
> > Everything is telling me that this is not something AMD intends to keep
> > stable and it isn't even something they're interested in documenting
> > very well at all.
>
> There are obscure performance counters and then there are basic
> fundamental performance counters. That particular counter hasn't
> changed since the K7 days (and K6 didn't have performance counters)
Sounds like the gamblers fallacy to me, but I'll take your word for it
for now.
--
Nicholas Miell <[email protected]>
Andi Kleen wrote:
> To give an bad analogy RDTSC usage in the last years is
> like explicit spinning wait loops for delays in the earlier
> times. They tended to work on some subset of computers,
> but were always bad and caused problems and people eventually learned
> it was better to use operating system services for this.
>
> The kernel will probably not disable RDTSC outright,
> but will make it clear in documentation that it's a bad
> idea to use directly and laugh at everybody who runs
> into problems with it.
I happen to have an old program which uses RDTSC frequently for timing
purposes. That seemed like a good idea at the time, but I guess it
should be updated. The question is, what replacement is there? I don't
want to have to use a syscall every 50 instructions or so. Feel free to
laugh, but suggesting a workable replacement might be more helpful.
Bernd
> Well, if that's all you want them to use RDPMC 0 for, why not just make
> PMCs programmable from userspace?
First we need to have a cycle counter PMC anyways for the NMI watchdog.
So it can as well be used for other purposes.
And using virtual performance counters adds a large cost the
context switch for changing the MSRs around. An always running counter
avoids this problem.
-Andi
> I happen to have an old program which uses RDTSC frequently for timing
> purposes. That seemed like a good idea at the time, but I guess it
> should be updated. The question is, what replacement is there? I don't
> want to have to use a syscall every 50 instructions or so. Feel free to
> laugh, but suggesting a workable replacement might be more helpful.
Well, you're asking me to write all the points from the documentation
in advance. Not tonight (or look up the full thread in the l-k archives,
I think I covered most)
But right now gettimeofday or clock_gettime(CLOCK_MONOTONIC) is
probably your best option. It tries to be reasonable fast depending
on hardware capabilities (and when you measure on P4
even basic RDTSC is quite slow). At least on x86-64 gettimeofday
isn't a real system call, but actually often stays in ring 3.
If you want to measure instructions in cycles in the future you can probably
use RDPMC 0, but that's not implemented yet. It won't be a replacement
for a timer for other purposes.
-Andi
On Wed, Nov 30, 2005 at 12:39:20AM +0100, Andi Kleen wrote:
> > Well, if that's all you want them to use RDPMC 0 for, why not just make
> > PMCs programmable from userspace?
>
> First we need to have a cycle counter PMC anyways for the NMI watchdog.
> So it can as well be used for other purposes.
But the watchdog doesn't require it to be in the same place on all
machines. Exporting it turns an internal requirement into an ABI
point.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
On Tuesday 29 November 2005 17:39, Andi Kleen wrote:
> > Well, if that's all you want them to use RDPMC 0 for, why not just make
> > PMCs programmable from userspace?
>
> First we need to have a cycle counter PMC anyways for the NMI watchdog.
> So it can as well be used for other purposes.
>
Yes, but that assumes having the NMI watchdog around is more important to you
than having 4 performance counters available. I'm perfectly willing to
have the NMI watchdog around by default, since it seems to be useful in most
cases. But if my measurement study needs 4 PMC's to do its job and I am
willing to forgo use of the NMI watchdog for that period of time, why
shouldn't I be allowed to do that? We have few enough PMCs anyway, I just
don't like the idea of giving one up forever. I'd much prefer to make that
decision myself rather than have it enforced by the OS. Let me make the
tradeoffs about which is the most valuable in my particular situation,
please.
Furthermore, if I know what I am doing, I can still make use of the RDTSC to
do timing. Yes, I have to pin the process to the current cpu and yes I have
to force the power state to a known setup, and oh yeah, we have to turn off
the halt in the idle loop, etc., etc., but after doing it works quite nicely,
thank you. And you've got a tremendous education problem to get people not
to use the RDTSC and lots of existing programs that have to be modified.
So, sure, allow RDPMC from ring 3. For people who want to use RDPMC and
performance counter 0 for timing, let them do that.
The real issue here is that there needs to be a defined kernel interface to
allow user programs to allocate and use PMCs and that is part of what this
whole discussion on the perfctr-devel mailing list has been about. Let's
get that defined and then if a user requests PMC0 and insists on using it,
then perhaps the NMI watchdog will simply have to go away until PMC0 is
available to the kernel again.
I'm also not sure what the performance tradeoff between RDTSC/P and RDPMC is
across processor implementations, now and future. That's something that
needs to be looked at.
> And using virtual performance counters adds a large cost the
> context switch for changing the MSRs around. An always running counter
> avoids this problem.
>
> -Andi
--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)
On Wed, Nov 30, 2005 at 10:56:26AM +1100, David Gibson wrote:
> On Wed, Nov 30, 2005 at 12:39:20AM +0100, Andi Kleen wrote:
> > > Well, if that's all you want them to use RDPMC 0 for, why not just make
> > > PMCs programmable from userspace?
> >
> > First we need to have a cycle counter PMC anyways for the NMI watchdog.
> > So it can as well be used for other purposes.
>
> But the watchdog doesn't require it to be in the same place on all
> machines. Exporting it turns an internal requirement into an ABI
> point.
Yes, but the only alternative I know of would be to let user space
continue getting broken all the time with RDTSC. RDPMC is a nice
alternative.
-Andi
On Tue, Nov 29, 2005 at 06:50:49PM -0600, Ray Bryant wrote:
> On Tuesday 29 November 2005 17:39, Andi Kleen wrote:
> > > Well, if that's all you want them to use RDPMC 0 for, why not just make
> > > PMCs programmable from userspace?
> >
> > First we need to have a cycle counter PMC anyways for the NMI watchdog.
> > So it can as well be used for other purposes.
> >
>
> Yes, but that assumes having the NMI watchdog around is more important to you
> than having 4 performance counters available. I'm perfectly willing to
Stable kernels are important to everyone. And we can only
get stable kernels if the bug reports are good. And that needs
the NMI watchdog.
Also see it differently - with the default ticking counter exported you
have a nice relatively reliable (you need to pin to CPUs) way to measure
cycles of instructions now. Previously that wasn't possible or rather
required one to jump through all kinds of hops to get right. My opinion
is that RDTSC is becoming less and less usable for this kind of stuff,
but since micro measurements are important for many cases it's important
to offer a nice facility for them.
> have the NMI watchdog around by default, since it seems to be useful in most
> cases. But if my measurement study needs 4 PMC's to do its job and I am
> willing to forgo use of the NMI watchdog for that period of time, why
> shouldn't I be allowed to do that? We have few enough PMCs anyway, I just
> don't like the idea of giving one up forever. I'd much prefer to make that
You don't give it up - it's just dedicated to counting cycles, free
for everybody to use.
-Andi
On Wed, Nov 30, 2005 at 01:34:33AM +0100, Andi Kleen wrote:
> On Wed, Nov 30, 2005 at 10:56:26AM +1100, David Gibson wrote:
> > On Wed, Nov 30, 2005 at 12:39:20AM +0100, Andi Kleen wrote:
> > > > Well, if that's all you want them to use RDPMC 0 for, why not just make
> > > > PMCs programmable from userspace?
> > >
> > > First we need to have a cycle counter PMC anyways for the NMI watchdog.
> > > So it can as well be used for other purposes.
> >
> > But the watchdog doesn't require it to be in the same place on all
> > machines. Exporting it turns an internal requirement into an ABI
> > point.
>
> Yes, but the only alternative I know of would be to let user space
> continue getting broken all the time with RDTSC. RDPMC is a nice
> alternative.
It just seems to me unwise to make an ABI commitment to something
that's not guaranteed by the architecture, perverse though it might
seem for the chip designers to take it away. CPU designers have been
known to do some fairly perverse things from time to time..
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
> It just seems to me unwise to make an ABI commitment to something
> that's not guaranteed by the architecture, perverse though it might
> seem for the chip designers to take it away. CPU designers have been
> known to do some fairly perverse things from time to time..
What would you prefer? Letting them continue to use RDTSC
which is less and less an usable cycle count? Telling the users
it's a bad idea without offering a credible alternative
for cycle counting?
And if they break it the kernel can always trap it, so there's
at least a safety net.
One alternative would be to make a function in the vsyscall page, but
that would add at least an indirect function call which can be quite
slow and I can just hear people complaining about that.
Also there are only two slots in the vsyscall page left right now and I
already planned them for other things (although it would be possible to go
for a full vDSO, just a lot of overhead)
-Andi
Hi Andi,
On Tue, 29 Nov 2005, Andi Kleen wrote:
> I'm considering to enable CR4.PCE by default on x86-64/i386. Currently it's 0
> which means RDPMC doesn't work. On x86-64 PMC 0 is always programmed
> to be a cycle counter, so it would be useful to be able to access
> this for measuring instructions. That's especially useful because RDTSC
> does not necessarily count cycles in the current P state (already
> the case on Intel CPUs and AMD's future direction seems to also
> to decouple it from cycles) Drawback is that it stops during idle, but
> that shouldn't be a big issue for normal measuring. It's not useful
> as a real timer anyways.
Some processor implementations don't have a performance counter which
ticks during the idle thread either.
> Any comments on this?
I think that this should be best left to a profiling tool to configure and
not a general kernel facility. I also have very little faith in processor
vendors not doing to performance counters what was done to the TSC.
Zwane
> > Any comments on this?
>
> I think that this should be best left to a profiling tool to configure and
> not a general kernel facility. I also have very little faith in processor
> vendors not doing to performance counters what was done to the TSC.
The use case would be small code snippets to benchmark some code
or instructions. That normally can't be done with a external
tool and exec'ing something would be quite awkward. Also
I would like to allow this for normal users.
The performance counters definitely share some properties with TSC already -
they definitely won't be synced (because they don't tick in C states etc.)
so if you change CPUs they won't be monotone.
But I doubt we'll ever see them running at a different frequency than
the current P state, which is the big problem RDTSC has now and that's
why I'm looking for a replacement. That it's faster on P4 is just
a bonus.
However it looks like everybody except me hates the idea :/ Or perhaps
a lot of the opposition was just against the reasonable position
that profiling should not disturb the NMI watchdogs. I guess not
everybody values debuggable kernel dumps.
-Andi
Nicholas,
On Tue, Nov 29, 2005 at 03:29:26PM -0800, Nicholas Miell wrote:
> On Wed, 2005-11-30 at 00:17 +0100, Andi Kleen wrote:
> > On Tue, Nov 29, 2005 at 03:02:18PM -0800, Nicholas Miell wrote:
> > > On Tue, 2005-11-29 at 23:43 +0100, Andi Kleen wrote:
> > > > To give an bad analogy RDTSC usage in the last years is
> > > > like explicit spinning wait loops for delays in the earlier
> > > > times. They tended to work on some subset of computers,
> > > > but were always bad and caused problems and people eventually learned
> > > > it was better to use operating system services for this.
> > >
> > > And you are now suggesting people should use RDPMC instead of OS
> > > services?
> >
> > For any kind of timers they should use the OS service
> > (gettimeofday/clock_gettime). The OS will go to extraordinary
> > means to make it as fast as possible, but when it's slow
> > then because it's not possible to do it faster accurately
> > (that's the case right now modulo one possible optimization)
> >
> > For cycle counting where they previously used RDTSC they should
> > use RDPMC 0 now.
>
> Well, if that's all you want them to use RDPMC 0 for, why not just make
> PMCs programmable from userspace?
>
Simply because write a PERFSEL (i.e. an MSR) register is a privileged operation.
On Tue, 2005-11-29 at 23:38 -0800, Stephane Eranian wrote:
> Nicholas,
>
> On Tue, Nov 29, 2005 at 03:29:26PM -0800, Nicholas Miell wrote:
> > On Wed, 2005-11-30 at 00:17 +0100, Andi Kleen wrote:
> > > On Tue, Nov 29, 2005 at 03:02:18PM -0800, Nicholas Miell wrote:
> > > > On Tue, 2005-11-29 at 23:43 +0100, Andi Kleen wrote:
> > > > > To give an bad analogy RDTSC usage in the last years is
> > > > > like explicit spinning wait loops for delays in the earlier
> > > > > times. They tended to work on some subset of computers,
> > > > > but were always bad and caused problems and people eventually learned
> > > > > it was better to use operating system services for this.
> > > >
> > > > And you are now suggesting people should use RDPMC instead of OS
> > > > services?
> > >
> > > For any kind of timers they should use the OS service
> > > (gettimeofday/clock_gettime). The OS will go to extraordinary
> > > means to make it as fast as possible, but when it's slow
> > > then because it's not possible to do it faster accurately
> > > (that's the case right now modulo one possible optimization)
> > >
> > > For cycle counting where they previously used RDTSC they should
> > > use RDPMC 0 now.
> >
> > Well, if that's all you want them to use RDPMC 0 for, why not just make
> > PMCs programmable from userspace?
> >
> Simply because write a PERFSEL (i.e. an MSR) register is a privileged operation.
Well, yeah. Design an API (or just steal from Solaris) and expose that
to userspace.
--
Nicholas Miell <[email protected]>
Nicholas,
On Wed, Nov 30, 2005 at 12:22:42AM -0800, Nicholas Miell wrote:
> On Tue, 2005-11-29 at 23:38 -0800, Stephane Eranian wrote:
> > Nicholas,
> >
> > On Tue, Nov 29, 2005 at 03:29:26PM -0800, Nicholas Miell wrote:
> > > On Wed, 2005-11-30 at 00:17 +0100, Andi Kleen wrote:
> > > > On Tue, Nov 29, 2005 at 03:02:18PM -0800, Nicholas Miell wrote:
> > > > > On Tue, 2005-11-29 at 23:43 +0100, Andi Kleen wrote:
> > > > > > To give an bad analogy RDTSC usage in the last years is
> > > > > > like explicit spinning wait loops for delays in the earlier
> > > > > > times. They tended to work on some subset of computers,
> > > > > > but were always bad and caused problems and people eventually learned
> > > > > > it was better to use operating system services for this.
> > > > >
> > > > > And you are now suggesting people should use RDPMC instead of OS
> > > > > services?
> > > >
> > > > For any kind of timers they should use the OS service
> > > > (gettimeofday/clock_gettime). The OS will go to extraordinary
> > > > means to make it as fast as possible, but when it's slow
> > > > then because it's not possible to do it faster accurately
> > > > (that's the case right now modulo one possible optimization)
> > > >
> > > > For cycle counting where they previously used RDTSC they should
> > > > use RDPMC 0 now.
> > >
> > > Well, if that's all you want them to use RDPMC 0 for, why not just make
> > > PMCs programmable from userspace?
> > >
> > Simply because write a PERFSEL (i.e. an MSR) register is a privileged operation.
>
> Well, yeah. Design an API (or just steal from Solaris) and expose that
> to userspace.
>
And that is exactly what I am doing with the perfmon2 interface that I
am working on. It does expose the performance counters to users via
a system call. With it you can certainly program a PERFSEL to count
cycles and you can use RDPMC from ring 3 to read the current value.
--
-Stephane
Andi,
On Tue, Nov 29, 2005 at 11:51:55PM +0100, Andi Kleen wrote:
> On Tue, Nov 29, 2005 at 02:19:15PM -0800, Stephane Eranian wrote:
> >
> > On AMD you only have 4 counters. That's not a lot for some measurements.
>
> Disabling the NMI watchdog for that is out of question. It's a important
> debugging device and without it kernel bug reports are much worse.
> It increased the quality of x86-64 bug reports over the years
> considerably and I'm unwilling to give that up.
>
So if I understand correctly, the kernel would program a counter
to count elapsed cycles while executing a ring 0 and ring 3. The watchdog
works by polling on the counter and after a certain delta is reached it
triggers an NMI interrupt which, in turn, causes a kernel crash and the
(bug) report. Is that the correct behavior?
> > but it does only implement 47bits. At a high clock rate, this can wrap
> > around fairly rapidly. It all depends on what is the intended usage model.
>
> TSC also doesn't count cycles in many circumstances (different frequency
> depending on P states or not synchronized over CPUs, even running
> at completely different frequencies etc.)
>
Well the big difference here is that once the counter reaches 2^{47}, it goes
back to zero, i.e., it wraps around silently by default. If you are polling
it, you will suddenly see a huge delta and think that a long perriod of time
has elapsed when in fact it is just one cycle. The TSC may not count all cycles
but the user sees the counter has continuously increasing.
Also are you sure that the PERFCTR0/PERFSEL0 are not affected when going
into lower power state? I know by experience that one IA-64, for instance,
the counters are seriously affected.
> Good debugging infrastructure has priority imho - and NMI watchdog
> is important. You will need to live with three counters.
>
As Ray mentioned, it all depends on what the user/sysdamin is after.
Some people maybe okay with disbaling NMI in favor of more counters.
Obviously others people are not.
> This means there is one alternative - some of the newer chipsets
> have external watchdogs that could be also used (using the ACPI WDOG
> table). If someone writes a nice NMI driver for these then on system
> with working WDOG it could replace the perfctr based timeout and free
> the perfctr. That would need some code to allocate and deallocate
> perfctrs though.
>
I discussed the idea of an abitration layer to access performance
counters with David Gibson a coule of months ago. I do believe this is
an important mechanism to have and I would like to restart the discussion
on this topic. You are providing us another reason why this could
be useful.
--
-Stephane
On Wed, Nov 30, 2005 at 08:01:59AM -0800, Stephane Eranian wrote:
> Andi,
>
> On Tue, Nov 29, 2005 at 11:51:55PM +0100, Andi Kleen wrote:
> > On Tue, Nov 29, 2005 at 02:19:15PM -0800, Stephane Eranian wrote:
> > >
> > > On AMD you only have 4 counters. That's not a lot for some measurements.
> >
> > Disabling the NMI watchdog for that is out of question. It's a important
> > debugging device and without it kernel bug reports are much worse.
> > It increased the quality of x86-64 bug reports over the years
> > considerably and I'm unwilling to give that up.
> >
>
> So if I understand correctly, the kernel would program a counter
It always did that on x86-64. On i386 it's an option, but
off by default because it breaks on some broken BIOS.
[imho it should be actually made default there too with
the bad systems just blacklisted.]
> to count elapsed cycles while executing a ring 0 and ring 3. The watchdog
> works by polling on the counter and after a certain delta is reached it
> triggers an NMI interrupt which, in turn, causes a kernel crash and the
> (bug) report. Is that the correct behavior?
The watchdog is driven by the performance counter (this means
it has varying frequency, but that's not a big issue for the watchdog)
It underflows every second in the fastest case or very slowly
(if the machine is idle). Every time it underflows it checks if
the per CPU timer has been ticking, and if it hasn't for some time
it triggers an oops.
In theory other sources could be used, but there aren't any generic
ones. At some point the local APIC timer was used, but that
was disabled because it ticked at HZ and caused too much overhead.
If the motherboard has a usable watchdog timer in the chipset
i wouldn't be completely opposed to using that too, but the problem is
that many chipsets don't have them or only broken and it's fairly
important that this works reliably to get better bug reports.
> > > but it does only implement 47bits. At a high clock rate, this can wrap
> > > around fairly rapidly. It all depends on what is the intended usage model.
> >
> > TSC also doesn't count cycles in many circumstances (different frequency
> > depending on P states or not synchronized over CPUs, even running
> > at completely different frequencies etc.)
> >
>
> Well the big difference here is that once the counter reaches 2^{47}, it goes
> back to zero, i.e., it wraps around silently by default. If you are polling
> it, you will suddenly see a huge delta and think that a long perriod of time
> has elapsed when in fact it is just one cycle. The TSC may not count all cycles
> but the user sees the counter has continuously increasing.
The obvious solution would be to set an underflow interrupt at 2^46 or so
and then reset the counters. For that you would need to count down though.
>
> Also are you sure that the PERFCTR0/PERFSEL0 are not affected when going
> into lower power state? I know by experience that one IA-64, for instance,
> the counters are seriously affected.
They stop ticking in idle. Yes, that's ok if you just want to measure
cycles because there are no cycles in idle.
It's not ok for timing (wall clock time) purposes, but it's also not
intended for that. If you want time use gettimeofday
They will also clock slower if the CPU is in a P state (runs with lower
frequency), but for measurements that's also wanted and expected I believe.
e.g with RDTSC on Intel right now if you are in a lower P state you will
get wrong results.
Basically it's a good cycle timer for instruction measurements and
nothing more.
Not ticking in idle actually helps with that because it makes
it totally clear to everybody that it's not a wall clock :)
> As Ray mentioned, it all depends on what the user/sysdamin is after.
> Some people maybe okay with disbaling NMI in favor of more counters.
> Obviously others people are not.
I cannot stop them from hacking the kernel, but I don't think
I will make it easy for them to do this in a stock kernel
(or at least not until they provide an reliable alternative watchdog
time source)
> > This means there is one alternative - some of the newer chipsets
> > have external watchdogs that could be also used (using the ACPI WDOG
> > table). If someone writes a nice NMI driver for these then on system
> > with working WDOG it could replace the perfctr based timeout and free
> > the perfctr. That would need some code to allocate and deallocate
> > perfctrs though.
> >
> I discussed the idea of an abitration layer to access performance
> counters with David Gibson a coule of months ago. I do believe this is
> an important mechanism to have and I would like to restart the discussion
> on this topic. You are providing us another reason why this could
> be useful.
There have been many tries of that over the years. The attempt
from IBM from a few years back they did for dprobes didn't look that bad
actually. There were a few others.
At least for the NMI watchdog it is not 100% needed - the
other code can just be changed to leave perfctr 0 alone.
-Andi
On Wed, 30 Nov 2005, Andi Kleen wrote:
> The performance counters definitely share some properties with TSC already -
> they definitely won't be synced (because they don't tick in C states etc.)
> so if you change CPUs they won't be monotone.
>
> But I doubt we'll ever see them running at a different frequency than
> the current P state, which is the big problem RDTSC has now and that's
> why I'm looking for a replacement. That it's faster on P4 is just
> a bonus.
>
> However it looks like everybody except me hates the idea :/ Or perhaps
> a lot of the opposition was just against the reasonable position
> that profiling should not disturb the NMI watchdogs. I guess not
> everybody values debuggable kernel dumps.
I agree that the NMI watchdog is very important, but my main objection is
trying to provide a stable interface for this, i would rather see a
seperate tool do it (as cumbersome as it may be) even if it meant that
the external tool simply did what you intend on doing in the kernel.
Zwane
> I agree that the NMI watchdog is very important, but my main objection is
> trying to provide a stable interface for this, i would rather see a
> seperate tool do it (as cumbersome as it may be) even if it meant that
> the external tool simply did what you intend on doing in the kernel.
But why an external tool when the nmi watchdog needs this anyways?
-Andi
On Thu, 1 Dec 2005, Andi Kleen wrote:
> > I agree that the NMI watchdog is very important, but my main objection is
> > trying to provide a stable interface for this, i would rather see a
> > seperate tool do it (as cumbersome as it may be) even if it meant that
> > the external tool simply did what you intend on doing in the kernel.
>
> But why an external tool when the nmi watchdog needs this anyways?
We'll probably have the NMI watchdog ticking at a lower frequency anyway
and isn't always enabled. Although i recall you want NMI watchdog on by
default too on i386 (which i don't have a problem with). The reason for
the external tool is because that allows more flexibility for profiling
tools, that is more PMCs can be used if needed by the user.
Andi,
On Wed, Nov 30, 2005 at 05:23:15PM +0100, Andi Kleen wrote:
> > to count elapsed cycles while executing a ring 0 and ring 3. The watchdog
> > works by polling on the counter and after a certain delta is reached it
> > triggers an NMI interrupt which, in turn, causes a kernel crash and the
> > (bug) report. Is that the correct behavior?
>
> The watchdog is driven by the performance counter (this means
> it has varying frequency, but that's not a big issue for the watchdog)
>
> It underflows every second in the fastest case or very slowly
> (if the machine is idle). Every time it underflows it checks if
> the per CPU timer has been ticking, and if it hasn't for some time
> it triggers an oops.
How is the checking for underflows done? Polling?
>
> The obvious solution would be to set an underflow interrupt at 2^46 or so
> and then reset the counters. For that you would need to count down though.
>
> >
> > Also are you sure that the PERFCTR0/PERFSEL0 are not affected when going
> > into lower power state? I know by experience that one IA-64, for instance,
> > the counters are seriously affected.
>
> They stop ticking in idle. Yes, that's ok if you just want to measure
> cycles because there are no cycles in idle.
>
Ok that makes sense.
> It's not ok for timing (wall clock time) purposes, but it's also not
> intended for that. If you want time use gettimeofday
>
I agree.
> They will also clock slower if the CPU is in a P state (runs with lower
> frequency), but for measurements that's also wanted and expected I believe.
> e.g with RDTSC on Intel right now if you are in a lower P state you will
> get wrong results.
>
> Basically it's a good cycle timer for instruction measurements and
> nothing more.
Yes, it requires pinning and also that nothing else can run
on the processor. This is very restrictive and I wonder if
this is really useable?
> Not ticking in idle actually helps with that because it makes
> it totally clear to everybody that it's not a wall clock :)
>
Yes, except that this is silently done. There needs to be a
STRONG warning about this somewhere.
> > As Ray mentioned, it all depends on what the user/sysdamin is after.
> > Some people maybe okay with disbaling NMI in favor of more counters.
> > Obviously others people are not.
>
> I cannot stop them from hacking the kernel, but I don't think
> I will make it easy for them to do this in a stock kernel
> (or at least not until they provide an reliable alternative watchdog
> time source)
>
--
-Stephane
On Thu, Dec 01, 2005 at 03:41:50PM -0800, Stephane Eranian wrote:
> On Wed, Nov 30, 2005 at 05:23:15PM +0100, Andi Kleen wrote:
> > > to count elapsed cycles while executing a ring 0 and ring 3. The watchdog
> > > works by polling on the counter and after a certain delta is reached it
> > > triggers an NMI interrupt which, in turn, causes a kernel crash and the
> > > (bug) report. Is that the correct behavior?
> >
> > The watchdog is driven by the performance counter (this means
> > it has varying frequency, but that's not a big issue for the watchdog)
> >
> > It underflows every second in the fastest case or very slowly
> > (if the machine is idle). Every time it underflows it checks if
> > the per CPU timer has been ticking, and if it hasn't for some time
> > it triggers an oops.
>
> How is the checking for underflows done? Polling?
There is a bit in the perfctr MSRs to cause an interrupt if it underflows.
That is programmed to be an NMI.
-Andi
Andi,
On Fri, Dec 02, 2005 at 01:07:37AM +0100, Andi Kleen wrote:
> On Thu, Dec 01, 2005 at 03:41:50PM -0800, Stephane Eranian wrote:
> > On Wed, Nov 30, 2005 at 05:23:15PM +0100, Andi Kleen wrote:
> > > > to count elapsed cycles while executing a ring 0 and ring 3. The watchdog
> > > > works by polling on the counter and after a certain delta is reached it
> > > > triggers an NMI interrupt which, in turn, causes a kernel crash and the
> > > > (bug) report. Is that the correct behavior?
> > >
> > > The watchdog is driven by the performance counter (this means
> > > it has varying frequency, but that's not a big issue for the watchdog)
> > >
> > > It underflows every second in the fastest case or very slowly
> > > (if the machine is idle). Every time it underflows it checks if
> > > the per CPU timer has been ticking, and if it hasn't for some time
> > > it triggers an oops.
> >
> > How is the checking for underflows done? Polling?
>
> There is a bit in the perfctr MSRs to cause an interrupt if it underflows.
> That is programmed to be an NMI.
>
But the interrupt is programmed for all counters. So by forcing the PMU
interrupt to use the NMI vector then any perfmon interface would have to use
this interrupt as well. That will break the whole thing because in many
places we rely on PMU interrupt being off.
--
-Stephane
On Thu, Dec 01, 2005 at 11:09:31PM -0800, Stephane Eranian wrote:
> But the interrupt is programmed for all counters. So by forcing the PMU
> interrupt to use the NMI vector then any perfmon interface would have to use
That is how it works yes. oprofile also uses NMIs.
> this interrupt as well. That will break the whole thing because in many
> places we rely on PMU interrupt being off.
If you rely on that then your subsystem is not usable on x86/x86-64.
-Andi