2006-03-20 12:24:56

by bert hubert

[permalink] [raw]
Subject: gettimeofday order of magnitude slower with pmtimer, which is default

Hi everybody,

For my open source nameserver (http://www.powerdns.com) I need to do quite a
number of gettimeofday calls. I've pared it down to almost the bare minimum
of 1 gettimeofday per packet sent and received. With these calls, I can make
stats like http://ds9a.nl/tmp/rrd/ which my users need so they can verify
the proper performance of the nameserver.

Yesterday, together with Zwane, I discovered each gettimeofday call costs me
4 usec on some boxes and almost nothing on others. We did a fruitless chase
for vsyscall/sysenter happening but the problem turned out to be
CONFIG_X86_PM_TIMER.

This problem has been discussed before
http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/2135.html

Not only is the pm timer slow by design, it also needs to be read multiple
times to work around a bug in certain hardware.

What is new is that this option is now dependent on CONFIG_EMBEDDED. Unless
you select this option, the PM Timer will always be used.

Would a patch removing the link to EMBEDDED and adding a warning that while
this timer is of high quality, it is slow, be welcome?

Thanks.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services


2006-03-20 14:50:50

by Andreas Mohr

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

Hi,

On Mon, Mar 20, 2006 at 01:24:49PM +0100, bert hubert wrote:
> Yesterday, together with Zwane, I discovered each gettimeofday call costs me
> 4 usec on some boxes and almost nothing on others. We did a fruitless chase
> for vsyscall/sysenter happening but the problem turned out to be
> CONFIG_X86_PM_TIMER.
>
> This problem has been discussed before
> http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/2135.html
>
> Not only is the pm timer slow by design, it also needs to be read multiple
> times to work around a bug in certain hardware.
I've been realizing this, too, when looking at some oprofile logs.
pmtmr reading uses almost 3% CPU time (e.g. P3/700) when idle, and it's
similarly problematic when not idle.

I think it's crazy to do a safe tripled readout (with *very* expensive I/O!)
of the PM timer unconditionally on *all* systems when only a
(albeit not that small) subset of systems is affected by buggy (un-latched)
PM timers.
I want to improve things there; I can see three ways to do it:
a) maintain a blacklist (or probably better a whitelist) of systems that
are (not) affected
b) detect long-time timer accuracy, then switch to fast readout if timer
is verified to be accurate (no white/blacklist needed this way)
c) give up on PM timer completely

Any comments on which way and how this could/should be done?

> Would a patch removing the link to EMBEDDED and adding a warning that while
> this timer is of high quality, it is slow, be welcome?
I don't want to comment on the first part, but adding a note to the Kconfig
text would be very useful, I think.

Andreas Mohr

2006-03-20 15:40:27

by Con Kolivas

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Tuesday 21 March 2006 01:50, Andreas Mohr wrote:
> Hi,
>
> On Mon, Mar 20, 2006 at 01:24:49PM +0100, bert hubert wrote:
> > Yesterday, together with Zwane, I discovered each gettimeofday call costs
> > me 4 usec on some boxes and almost nothing on others. We did a fruitless
> > chase for vsyscall/sysenter happening but the problem turned out to be
> > CONFIG_X86_PM_TIMER.
> >
> > This problem has been discussed before
> > http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/2135.html
> >
> > Not only is the pm timer slow by design, it also needs to be read
> > multiple times to work around a bug in certain hardware.
>
> I've been realizing this, too, when looking at some oprofile logs.
> pmtmr reading uses almost 3% CPU time (e.g. P3/700) when idle, and it's
> similarly problematic when not idle.
>
> I think it's crazy to do a safe tripled readout (with *very* expensive
> I/O!) of the PM timer unconditionally on *all* systems when only a
> (albeit not that small) subset of systems is affected by buggy (un-latched)
> PM timers.
> I want to improve things there; I can see three ways to do it:
> a) maintain a blacklist (or probably better a whitelist) of systems that
> are (not) affected
> b) detect long-time timer accuracy, then switch to fast readout if timer
> is verified to be accurate (no white/blacklist needed this way)
> c) give up on PM timer completely
>
> Any comments on which way and how this could/should be done?

The pm timer is very fast when the timer is latched and that strange loop uses
hardly any cpu time. The same can't be said about the unlatched timer case
where absurd amounts of cpu seem the norm. You have a catch 22 situation if
you depend on the accuracy of the pm timer only to end up wasting time trying
to actually use that timer.

Currently if you compile in pmtimer it is used by default. Perhaps when it's
detected that the timer is unlatched it should actually be used as a last
resort when all other timers have failed or it has been specified by
bootparam rather than the default timer. I figured having separate functions
for latched and unlatched timers would make more sense so that hardware with
good timers aren't unfairly disadvantaged by reading the time 2 more times
than necessary.

Cheers,
Con

2006-03-21 01:27:06

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

Con Kolivas <[email protected]> writes:

> On Tuesday 21 March 2006 01:50, Andreas Mohr wrote:
>
>> I think it's crazy to do a safe tripled readout (with *very* expensive
>> I/O!) of the PM timer unconditionally on *all* systems when only a
>> (albeit not that small) subset of systems is affected by buggy (un-latched)
>> PM timers.
>> I want to improve things there; I can see three ways to do it:
>> a) maintain a blacklist (or probably better a whitelist) of systems that
>> are (not) affected
>> b) detect long-time timer accuracy, then switch to fast readout if timer
>> is verified to be accurate (no white/blacklist needed this way)
>> c) give up on PM timer completely
>>
>> Any comments on which way and how this could/should be done?
>
> The pm timer is very fast when the timer is latched and that strange loop uses
> hardly any cpu time. The same can't be said about the unlatched timer case
> where absurd amounts of cpu seem the norm. You have a catch 22 situation if
> you depend on the accuracy of the pm timer only to end up wasting time trying
> to actually use that timer.

Actually, pmtmr not seems very fast, rather it's slow like usual I/O port.

Simple userland test

rdtsc(l1, h1);
val = inl(0x808); /* 0x808 is PMTMR address */
asm volatile("lfence");
rdtsc(l2, h2);

is about 1us.

And the following is test of gettimeofday(). Probably, we need a patch. Umm....

2.6.16 (pmtmr)
Simple gettimeofday: 3.6532 microseconds
Simple gettimeofday: 3.6502 microseconds
Simple gettimeofday: 3.6522 microseconds
Simple gettimeofday: 3.6486 microseconds
Simple gettimeofday: 3.6539 microseconds

2.6.16+patch (pmtmr)
Simple gettimeofday: 1.4582 microseconds
Simple gettimeofday: 1.4593 microseconds
Simple gettimeofday: 1.4671 microseconds
Simple gettimeofday: 1.4650 microseconds

2.6.16 (tsc)
Simple gettimeofday: 0.4037 microseconds
Simple gettimeofday: 0.4037 microseconds
Simple gettimeofday: 0.4040 microseconds
Simple gettimeofday: 0.4037 microseconds
Simple gettimeofday: 0.4038 microseconds
--
OGAWA Hirofumi <[email protected]>


Signed-off-by: OGAWA Hirofumi <[email protected]>
---

arch/i386/kernel/timers/timer_pm.c | 74 ++++++++++++++++++++++++++++++-------
1 file changed, 60 insertions(+), 14 deletions(-)

diff -puN arch/i386/kernel/timers/timer_pm.c~pm-kill-workaround arch/i386/kernel/timers/timer_pm.c
--- linux-2.6/arch/i386/kernel/timers/timer_pm.c~pm-kill-workaround 2006-03-21 04:20:27.000000000 +0900
+++ linux-2.6-hirofumi/arch/i386/kernel/timers/timer_pm.c 2006-03-21 04:31:48.000000000 +0900
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/device.h>
#include <linux/init.h>
+#include <linux/pci.h>
#include <asm/types.h>
#include <asm/timer.h>
#include <asm/smp.h>
@@ -35,6 +36,7 @@
* in arch/i386/acpi/boot.c */
u32 pmtmr_ioport = 0;

+static int pmtmr_need_workaround __read_mostly = 1;

/* value of the Power timer at last timer interrupt */
static u32 offset_tick;
@@ -45,24 +47,68 @@ static seqlock_t monotonic_lock = SEQLOC

#define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */

+#ifdef CONFIG_PCI
+/*
+ * PIIX4 Errata:
+ *
+ * The power management timer may return improper result when read.
+ * Although the timer value settles properly after incrementing,
+ * while incrementing there is a 3 ns window every 69.8 ns where the
+ * timer value is indeterminate (a 4.2% chance that the data will be
+ * incorrect when read). As a result, the ACPI free running count up
+ * timer specification is violated due to erroneous reads.
+ */
+static int __init pmtmr_bug_check(void)
+{
+ struct pci_dev *dev;
+ int pmtmr_has_bug = 0;
+ u8 rev;
+
+ dev = pci_get_device(PCI_VENDOR_ID_INTEL,
+ PCI_DEVICE_ID_INTEL_82371AB_3, NULL);
+ if (dev) {
+ pci_read_config_byte(dev, PCI_REVISION_ID, &rev);
+ /* the bug has been fixed in PIIX4M */
+ if (rev < 3)
+ pmtmr_has_bug = 1;
+ pci_dev_put(dev);
+ }
+
+ if (pmtmr_has_bug)
+ printk(KERN_INFO
+ "*** Found PM-Timer Bug on this chip. For workarounds a bug, this timer\n"
+ "*** source is slow. Use other timer source (clock=).\n");
+ else
+ pmtmr_need_workaround = 0;
+
+ return 0;
+}
+device_initcall(pmtmr_bug_check);
+#endif
+
/*helper function to safely read acpi pm timesource*/
static inline u32 read_pmtmr(void)
{
- u32 v1=0,v2=0,v3=0;
- /* It has been reported that because of various broken
- * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
- * source is not latched, so you must read it multiple
- * times to insure a safe value is read.
- */
- do {
- v1 = inl(pmtmr_ioport);
- v2 = inl(pmtmr_ioport);
- v3 = inl(pmtmr_ioport);
- } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
- || (v3 > v1 && v3 < v2));
+ if (unlikely(pmtmr_need_workaround)) {
+ u32 v1, v2, v3;
+
+ /* It has been reported that because of various broken
+ * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
+ * source is not latched, so you must read it multiple
+ * times to insure a safe value is read.
+ */
+ do {
+ v1 = inl(pmtmr_ioport);
+ v2 = inl(pmtmr_ioport);
+ v3 = inl(pmtmr_ioport);
+ } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
+ || (v3 > v1 && v3 < v2));
+
+ /* mask the output to 24 bits */
+ return v2 & ACPI_PM_MASK;
+ }

- /* mask the output to 24 bits */
- return v2 & ACPI_PM_MASK;
+ return inl(pmtmr_ioport) & ACPI_PM_MASK;
}


_

2006-03-21 01:49:54

by Con Kolivas

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

Quoting OGAWA Hirofumi <[email protected]>:

> Con Kolivas <[email protected]> writes:
> > The pm timer is very fast when the timer is latched and that strange loop
> uses
> > hardly any cpu time. The same can't be said about the unlatched timer case
>
> > where absurd amounts of cpu seem the norm. You have a catch 22 situation if
>
> > you depend on the accuracy of the pm timer only to end up wasting time
> trying
> > to actually use that timer.
>
> Actually, pmtmr not seems very fast, rather it's slow like usual I/O port.

What I mean is that I've seen profiles of the worst (from Andi) showing up to 5%
cpu time on some workloads! That's a heck of a lot slower than when it's
latched.

> is about 1us.
>
> And the following is test of gettimeofday(). Probably, we need a patch.
> Umm....
>
> 2.6.16 (pmtmr)
> Simple gettimeofday: 3.6532 microseconds


> 2.6.16+patch (pmtmr)
> Simple gettimeofday: 1.4582 microseconds

Looks well worth it

> 2.6.16 (tsc)
> Simple gettimeofday: 0.4037 microseconds


> Signed-off-by: OGAWA Hirofumi <[email protected]>

Thanks!

> + if (unlikely(pmtmr_need_workaround)) {

I would not put this in an unlikely because on the machines where
pmtmr_need_workaround is true this will always be true.

Cheers,
Con

2006-03-21 02:59:38

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

[email protected] writes:

> Quoting OGAWA Hirofumi <[email protected]>:
>
> What I mean is that I've seen profiles of the worst (from Andi) showing up to 5%
> cpu time on some workloads! That's a heck of a lot slower than when it's
> latched.

I see. Thanks.

>> + if (unlikely(pmtmr_need_workaround)) {
>
> I would not put this in an unlikely because on the machines where
> pmtmr_need_workaround is true this will always be true.

Yes. However, if machines uses buggy chip, I guessed TSC/PIT would be
more proper as time source. But probably you are right, timer_pit.c
seems more slow usually (it uses many I/O port).

I'll remove unlikely(), and also will remove "Use other timer source"
from warning.

BTW, this patch is still quick hack.

At least, we would need to check the ICH4 which says in comment.
However, I couldn't find the PM-Timer Errata in ICH4 spec update.

Do you/anyone know about a ICH4 error?
--
OGAWA Hirofumi <[email protected]>

2006-03-21 03:09:28

by Con Kolivas

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Tue, 21 Mar 2006 01:59 pm, OGAWA Hirofumi wrote:
> Yes. However, if machines uses buggy chip, I guessed TSC/PIT would be
> more proper as time source.

Oh yes but there has been an epidemic of timer problems (fast/slow, lost ticks
etc) lately meaning the pm timer is being relied upon more and more.

> But probably you are right, timer_pit.c
> seems more slow usually (it uses many I/O port).
>
> I'll remove unlikely(), and also will remove "Use other timer source"
> from warning.

Suggesting another timer source is ok in the warning I believe given massive
amounts of wasted cpu.

> BTW, this patch is still quick hack.

Understood. Perhaps having an indirect function call set to either
good_pmtmr() or bad_pmtmr() after checking would be preferable to a variable
that is checked on each function call despite never changing.

> At least, we would need to check the ICH4 which says in comment.
> However, I couldn't find the PM-Timer Errata in ICH4 spec update.
>
> Do you/anyone know about a ICH4 error?

Not personally but my ICH4 pm timer seems to work very well whereas Andi's
apparently similar chipset exhibits terrible problems.

Cheers,
Con

2006-03-21 05:33:09

by Albert Cahalan

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

bert hubert writes:

> Not only is the pm timer slow by design, it also needs to be read
> multiple times to work around a bug in certain hardware.
>
> What is new is that this option is now dependent on CONFIG_EMBEDDED.
> Unless you select this option, the PM Timer will always be used.

Good. Fast but horribly buggy is NOT acceptable as a default.
Go ahead and tweak the settings on YOUR system. Run Gentoo.

I wasted a lot of time trying to figure out why my clock ran
about 1.414 (yes, square root of two) times faster than proper.
It turns out that a normal dual-core AMD box has TSCs that run
at randomly varying and fully independant rates. Before I found
a post to linux-kernel by an AMD employee explaining this, my
thought was that I was probably missing clock ticks from the
BIOS going into SMM mode. That was the problem the previous time
I had clock problems. (so I reduced HZ to 100, etc.)

I'm sure lots of people would never figure out why their clock
is running 41.4% faster than normal. They'd live with it, using
a physical clock mounted on the wall, or switch to a different OS.

2006-03-21 08:54:00

by Andreas Mohr

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

Hi,

On Tue, Mar 21, 2006 at 02:09:50PM +1100, Con Kolivas wrote:
> On Tue, 21 Mar 2006 01:59 pm, OGAWA Hirofumi wrote:
> > Yes. However, if machines uses buggy chip, I guessed TSC/PIT would be
> > more proper as time source.
>
> Oh yes but there has been an epidemic of timer problems (fast/slow, lost ticks
> etc) lately meaning the pm timer is being relied upon more and more.

I think it's reasonable to question whether to use unlikely or not,
but IMHO omitting unlikely here will not reward well-behaving systems and
not punish buggy systems, and this doesn't seem quite right from an
evolutionary point of view (especially since I'd guess the number of buggy
chipsets is much lower than the number of working chipsets possibly
also from vendors other than Intel - famous last words...).
OTOH by not using the likely/unlikely branch prediction skew we may gain
higher combined (buggy systems *plus* good systems) performance.
But since I/O access time (especially in tripled form!) is **MUCH** higher
than any branching delay judging from my profiling experiments, I'd vote
for keeping the unlikely anyway.


BTW, my original patch (not published) evaluated timer performance over e.g.
10000 cycles and then (if positive) switched a *function pointer* over to the
"good" function.
So the buggy/non-buggy check inside the function could also be done via
function pointers, but this seems much more costly since it wastes kernel
image space for two functions with two stack frame setups etc. pp.

Thanks go to OGAWA for actually writing the patch with a surprisingly easy
buggy chipset check! (PCI revision etc.)

> > I'll remove unlikely(), and also will remove "Use other timer source"
> > from warning.
>
> Suggesting another timer source is ok in the warning I believe given massive
> amounts of wasted cpu.

Yes, please keep the warning (and make sure there's a part mentioning
"very costly for certain buggy Intel chipsets" or so).

> > BTW, this patch is still quick hack.
>
> Understood. Perhaps having an indirect function call set to either
> good_pmtmr() or bad_pmtmr() after checking would be preferable to a variable
> that is checked on each function call despite never changing.

Oh, there you even mention it ;) However due to my reasons given above
(and the fact that invoking a function pointer should be similarly expensive
to a conditional) I don't think it's useful.

> > At least, we would need to check the ICH4 which says in comment.
> > However, I couldn't find the PM-Timer Errata in ICH4 spec update.
> >
> > Do you/anyone know about a ICH4 error?
>
> Not personally but my ICH4 pm timer seems to work very well whereas Andi's
> apparently similar chipset exhibits terrible problems.

Are you referring to dynticks trouble with my P4HT ("ICH5/ICH5R"!!) here?
I don't think there were any problems due to non-latchedness, jus the usual
batch of slowness issues.

00:00.0 Host bridge: Intel Corp. 82865G/PE/P DRAM Controller/Host-Hub Interface (rev 02)
00:02.0 VGA compatible controller: Intel Corp. 82865G Integrated Graphics Device (rev 02)
00:1d.0 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #3 (rev 02)
00:1d.7 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corp. 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02)
00:1f.2 IDE interface: Intel Corp. 82801EB (ICH5) SATA Controller (rev 02)
00:1f.3 SMBus: Intel Corp. 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02)

Andreas Mohr

--
No programming skills!? Why not help translate many Linux applications!
https://launchpad.ubuntu.com/rosetta

2006-03-21 09:06:35

by Arjan van de Ven

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Tue, 2006-03-21 at 09:53 +0100, Andreas Mohr wrote:
> Hi,
>
> On Tue, Mar 21, 2006 at 02:09:50PM +1100, Con Kolivas wrote:
> > On Tue, 21 Mar 2006 01:59 pm, OGAWA Hirofumi wrote:
> > > Yes. However, if machines uses buggy chip, I guessed TSC/PIT would be
> > > more proper as time source.
> >
> > Oh yes but there has been an epidemic of timer problems (fast/slow, lost ticks
> > etc) lately meaning the pm timer is being relied upon more and more.
>
> I think it's reasonable to question whether to use unlikely or not,
> but IMHO omitting unlikely here will not reward well-behaving systems and
> not punish buggy systems, and this doesn't seem quite right from an
> evolutionary point of view

rdtsc is not reliable for any SMP system or any system doing frequency
scaling or C3 state power saving states.

(exception is newest generation processors where that appears to be
changing now)


You can say "but it appears to work on my SMP system".. but are they
still synced after 200 days of uptime? or are they skewed by then by too
much.

2006-03-21 11:59:17

by Con Kolivas

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Tuesday 21 March 2006 19:53, Andreas Mohr wrote:
> (and the fact that invoking a function pointer should be similarly
> expensive to a conditional) I don't think it's useful.

Is

*blah();

as expensive as

if (conditional)
blah();

I don't know the answer. I just know cmp is expensive. Comments?

Cheers,
Con

2006-03-21 12:04:51

by Arjan van de Ven

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Tue, 2006-03-21 at 22:58 +1100, Con Kolivas wrote:
> On Tuesday 21 March 2006 19:53, Andreas Mohr wrote:
> > (and the fact that invoking a function pointer should be similarly
> > expensive to a conditional) I don't think it's useful.
>
> Is
>
> *blah();
>
> as expensive as
>
> if (conditional)
> blah();
>
> I don't know the answer. I just know cmp is expensive. Comments?

function pointer is usually MORE expensive.
for if() the processor has a change to predict the branch right, while
call <register> (which is what function pointer calls end up being) are
basically always mispredicted unless you have a really really fancy
branch predictor...

2006-03-21 12:08:54

by Con Kolivas

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Tuesday 21 March 2006 23:04, Arjan van de Ven wrote:
> On Tue, 2006-03-21 at 22:58 +1100, Con Kolivas wrote:
> > On Tuesday 21 March 2006 19:53, Andreas Mohr wrote:
> > > (and the fact that invoking a function pointer should be similarly
> > > expensive to a conditional) I don't think it's useful.
> >
> > Is
> >
> > *blah();
> >
> > as expensive as
> >
> > if (conditional)
> > blah();
> >
> > I don't know the answer. I just know cmp is expensive. Comments?
>
> function pointer is usually MORE expensive.
> for if() the processor has a change to predict the branch right, while
> call <register> (which is what function pointer calls end up being) are
> basically always mispredicted unless you have a really really fancy
> branch predictor...

Thanks! That's something I've been trying to find good info on.

Cheers,
Con

2006-03-21 19:23:14

by john stultz

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Tue, 2006-03-21 at 10:26 +0900, OGAWA Hirofumi wrote:
> Con Kolivas <[email protected]> writes:
>
> > On Tuesday 21 March 2006 01:50, Andreas Mohr wrote:
> >
> >> I think it's crazy to do a safe tripled readout (with *very* expensive
> >> I/O!) of the PM timer unconditionally on *all* systems when only a
> >> (albeit not that small) subset of systems is affected by buggy (un-latched)
> >> PM timers.
> >> I want to improve things there; I can see three ways to do it:
> >> a) maintain a blacklist (or probably better a whitelist) of systems that
> >> are (not) affected
> >> b) detect long-time timer accuracy, then switch to fast readout if timer
> >> is verified to be accurate (no white/blacklist needed this way)
> >> c) give up on PM timer completely
> >>
> >> Any comments on which way and how this could/should be done?
> >
> > The pm timer is very fast when the timer is latched and that strange loop uses
> > hardly any cpu time. The same can't be said about the unlatched timer case
> > where absurd amounts of cpu seem the norm. You have a catch 22 situation if
> > you depend on the accuracy of the pm timer only to end up wasting time trying
> > to actually use that timer.
[snip]
> And the following is test of gettimeofday(). Probably, we need a patch. Umm....
>
> 2.6.16 (pmtmr)
> Simple gettimeofday: 3.6532 microseconds
> Simple gettimeofday: 3.6502 microseconds
> Simple gettimeofday: 3.6522 microseconds
> Simple gettimeofday: 3.6486 microseconds
> Simple gettimeofday: 3.6539 microseconds
>
> 2.6.16+patch (pmtmr)
> Simple gettimeofday: 1.4582 microseconds
> Simple gettimeofday: 1.4593 microseconds
> Simple gettimeofday: 1.4671 microseconds
> Simple gettimeofday: 1.4650 microseconds
>
> 2.6.16 (tsc)
> Simple gettimeofday: 0.4037 microseconds
> Simple gettimeofday: 0.4037 microseconds
> Simple gettimeofday: 0.4040 microseconds
> Simple gettimeofday: 0.4037 microseconds
> Simple gettimeofday: 0.4038 microseconds
> --
> OGAWA Hirofumi <[email protected]>
>
>
> Signed-off-by: OGAWA Hirofumi <[email protected]>

In my TOD rework I've dropped the triple read, figuring if a problem
arose we could blacklist the specific box. This patch covers that, so it
looks like a good idea to me.

I've not tested it myself, but if you feel good about it, please send it
to Andrew.

thanks
-john

Acked-by: John Stultz <[email protected]>

> ---
>
> arch/i386/kernel/timers/timer_pm.c | 74 ++++++++++++++++++++++++++++++-------
> 1 file changed, 60 insertions(+), 14 deletions(-)
>
> diff -puN arch/i386/kernel/timers/timer_pm.c~pm-kill-workaround arch/i386/kernel/timers/timer_pm.c
> --- linux-2.6/arch/i386/kernel/timers/timer_pm.c~pm-kill-workaround 2006-03-21 04:20:27.000000000 +0900
> +++ linux-2.6-hirofumi/arch/i386/kernel/timers/timer_pm.c 2006-03-21 04:31:48.000000000 +0900
> @@ -15,6 +15,7 @@
> #include <linux/module.h>
> #include <linux/device.h>
> #include <linux/init.h>
> +#include <linux/pci.h>
> #include <asm/types.h>
> #include <asm/timer.h>
> #include <asm/smp.h>
> @@ -35,6 +36,7 @@
> * in arch/i386/acpi/boot.c */
> u32 pmtmr_ioport = 0;
>
> +static int pmtmr_need_workaround __read_mostly = 1;
>
> /* value of the Power timer at last timer interrupt */
> static u32 offset_tick;
> @@ -45,24 +47,68 @@ static seqlock_t monotonic_lock = SEQLOC
>
> #define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */
>
> +#ifdef CONFIG_PCI
> +/*
> + * PIIX4 Errata:
> + *
> + * The power management timer may return improper result when read.
> + * Although the timer value settles properly after incrementing,
> + * while incrementing there is a 3 ns window every 69.8 ns where the
> + * timer value is indeterminate (a 4.2% chance that the data will be
> + * incorrect when read). As a result, the ACPI free running count up
> + * timer specification is violated due to erroneous reads.
> + */
> +static int __init pmtmr_bug_check(void)
> +{
> + struct pci_dev *dev;
> + int pmtmr_has_bug = 0;
> + u8 rev;
> +
> + dev = pci_get_device(PCI_VENDOR_ID_INTEL,
> + PCI_DEVICE_ID_INTEL_82371AB_3, NULL);
> + if (dev) {
> + pci_read_config_byte(dev, PCI_REVISION_ID, &rev);
> + /* the bug has been fixed in PIIX4M */
> + if (rev < 3)
> + pmtmr_has_bug = 1;
> + pci_dev_put(dev);
> + }
> +
> + if (pmtmr_has_bug)
> + printk(KERN_INFO
> + "*** Found PM-Timer Bug on this chip. For workarounds a bug, this timer\n"
> + "*** source is slow. Use other timer source (clock=).\n");
> + else
> + pmtmr_need_workaround = 0;
> +
> + return 0;
> +}
> +device_initcall(pmtmr_bug_check);
> +#endif
> +
> /*helper function to safely read acpi pm timesource*/
> static inline u32 read_pmtmr(void)
> {
> - u32 v1=0,v2=0,v3=0;
> - /* It has been reported that because of various broken
> - * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
> - * source is not latched, so you must read it multiple
> - * times to insure a safe value is read.
> - */
> - do {
> - v1 = inl(pmtmr_ioport);
> - v2 = inl(pmtmr_ioport);
> - v3 = inl(pmtmr_ioport);
> - } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
> - || (v3 > v1 && v3 < v2));
> + if (unlikely(pmtmr_need_workaround)) {
> + u32 v1, v2, v3;
> +
> + /* It has been reported that because of various broken
> + * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
> + * source is not latched, so you must read it multiple
> + * times to insure a safe value is read.
> + */
> + do {
> + v1 = inl(pmtmr_ioport);
> + v2 = inl(pmtmr_ioport);
> + v3 = inl(pmtmr_ioport);
> + } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
> + || (v3 > v1 && v3 < v2));
> +
> + /* mask the output to 24 bits */
> + return v2 & ACPI_PM_MASK;
> + }
>
> - /* mask the output to 24 bits */
> - return v2 & ACPI_PM_MASK;
> + return inl(pmtmr_ioport) & ACPI_PM_MASK;
> }
>
>
> _
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-03-21 19:34:46

by john stultz

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Mon, 2006-03-20 at 13:24 +0100, bert hubert wrote:
> Yesterday, together with Zwane, I discovered each gettimeofday call costs me
> 4 usec on some boxes and almost nothing on others. We did a fruitless chase
> for vsyscall/sysenter happening but the problem turned out to be
> CONFIG_X86_PM_TIMER.
>
> This problem has been discussed before
> http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/2135.html
>
> Not only is the pm timer slow by design, it also needs to be read multiple
> times to work around a bug in certain hardware.
>
> What is new is that this option is now dependent on CONFIG_EMBEDDED. Unless
> you select this option, the PM Timer will always be used.
>
> Would a patch removing the link to EMBEDDED and adding a warning that while
> this timer is of high quality, it is slow, be welcome?

I think Ogawa's patch is the right solution for dropping the triple
read, which should help a good bit.

If you *really* are sure the TSC is usable on your system, you can
override it w/ clock=tsc. Warning users that the clock is slow without
giving them a way to know if the TSC is usable will likely just cause
more problem reports. And hey, its better then using the PIT :)

thanks
-john

2006-03-21 21:19:32

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

john stultz <[email protected]> writes:

> In my TOD rework I've dropped the triple read, figuring if a problem
> arose we could blacklist the specific box. This patch covers that, so it
> looks like a good idea to me.
>
> I've not tested it myself, but if you feel good about it, please send it
> to Andrew.

Current patch is the following. If I'm missing something, or you have
some comment, please tell me. (Since I don't have ICH4, ICH4 detection
is untested)

Thanks.
--
OGAWA Hirofumi <[email protected]>



Current timer_pm.c reads I/O port triple times, in order to avoid the
bug of chipset. But I/O port is slow.

2.6.16 (pmtmr)
Simple gettimeofday: 3.6532 microseconds

2.6.16+patch (pmtmr)
Simple gettimeofday: 1.4582 microseconds

[if chip is buggy, probably it will be 7us or more in 4.2% of probability.]


This patch adds blacklist of buggy chip, and if chip is not buggy,
this uses fast normal version instead of slow workaround version.

If chip is buggy, warnings "pmtmr is slow". But sounds like there is
gray zone. I found the PIIX4 errata, but I couldn't find the ICH4
errata. But some motherboard seems to have problem.

So, if found a ICH4, also warnings, and uses a workaround version. If
user's ICH4 is good actually, user can specify the "pmtmr_good" boot
parameter to use fast version.


Acked-by: John Stultz <[email protected]>
Signed-off-by: OGAWA Hirofumi <[email protected]>
---

arch/i386/kernel/timers/timer_pm.c | 99 +++++++++++++++++++++++++++++++------
1 file changed, 84 insertions(+), 15 deletions(-)

diff -puN arch/i386/kernel/timers/timer_pm.c~pm-kill-workaround arch/i386/kernel/timers/timer_pm.c
--- linux-2.6/arch/i386/kernel/timers/timer_pm.c~pm-kill-workaround 2006-03-22 06:09:15.000000000 +0900
+++ linux-2.6-hirofumi/arch/i386/kernel/timers/timer_pm.c 2006-03-22 06:09:23.000000000 +0900
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/device.h>
#include <linux/init.h>
+#include <linux/pci.h>
#include <asm/types.h>
#include <asm/timer.h>
#include <asm/smp.h>
@@ -35,7 +36,6 @@
* in arch/i386/acpi/boot.c */
u32 pmtmr_ioport = 0;

-
/* value of the Power timer at last timer interrupt */
static u32 offset_tick;
static u32 offset_delay;
@@ -45,24 +45,93 @@ static seqlock_t monotonic_lock = SEQLOC

#define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */

+static int pmtmr_need_workaround __read_mostly = 1;
+
+static int __init pmtr_good_setup(char *__str)
+{
+ pmtmr_need_workaround = 0;
+ return 0;
+}
+__setup("pmtmr_good", pmtr_good_setup);
+
+#ifdef CONFIG_PCI
+/*
+ * PIIX4 Errata:
+ *
+ * The power management timer may return improper result when read.
+ * Although the timer value settles properly after incrementing,
+ * while incrementing there is a 3 ns window every 69.8 ns where the
+ * timer value is indeterminate (a 4.2% chance that the data will be
+ * incorrect when read). As a result, the ACPI free running count up
+ * timer specification is violated due to erroneous reads.
+ */
+static int __init pmtmr_bug_check(void)
+{
+ const struct pci_device_id gray_list[] = {
+ /* these chipsets may have bug. */
+ { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801DB_0) },
+ { },
+ };
+ struct pci_dev *dev;
+ int pmtmr_has_bug = 0;
+ u8 rev;
+
+ if (!pmtmr_need_workaround)
+ return 0;
+
+ dev = pci_get_device(PCI_VENDOR_ID_INTEL,
+ PCI_DEVICE_ID_INTEL_82371AB_3, NULL);
+ if (dev) {
+ pci_read_config_byte(dev, PCI_REVISION_ID, &rev);
+ /* the bug has been fixed in PIIX4M */
+ if (rev < 3) {
+ printk(KERN_WARNING
+ "* Found PM-Timer Bug on this chip. For workarounds a bug, this timer\n"
+ "* source is slow. Other timer source may be proper (clock=)\n");
+ pmtmr_has_bug = 1;
+ }
+ pci_dev_put(dev);
+ }
+
+ if (pci_dev_present(gray_list)) {
+ printk(KERN_WARNING
+ "* This chipset may have PM-Timer Bug, For workarounds a bug,\n"
+ "* this timer source is slow. If you are sure, please use \"pmtmr_good\"\n"
+ "* for disabling the workaround\n");
+ pmtmr_has_bug = 1;
+ }
+
+ if (!pmtmr_has_bug)
+ pmtmr_need_workaround = 0;
+
+ return 0;
+}
+device_initcall(pmtmr_bug_check);
+#endif
+
/*helper function to safely read acpi pm timesource*/
static inline u32 read_pmtmr(void)
{
- u32 v1=0,v2=0,v3=0;
- /* It has been reported that because of various broken
- * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
- * source is not latched, so you must read it multiple
- * times to insure a safe value is read.
- */
- do {
- v1 = inl(pmtmr_ioport);
- v2 = inl(pmtmr_ioport);
- v3 = inl(pmtmr_ioport);
- } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
- || (v3 > v1 && v3 < v2));
+ if (pmtmr_need_workaround) {
+ u32 v1, v2, v3;
+
+ /* It has been reported that because of various broken
+ * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
+ * source is not latched, so you must read it multiple
+ * times to insure a safe value is read.
+ */
+ do {
+ v1 = inl(pmtmr_ioport);
+ v2 = inl(pmtmr_ioport);
+ v3 = inl(pmtmr_ioport);
+ } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
+ || (v3 > v1 && v3 < v2));
+
+ /* mask the output to 24 bits */
+ return v2 & ACPI_PM_MASK;
+ }

- /* mask the output to 24 bits */
- return v2 & ACPI_PM_MASK;
+ return inl(pmtmr_ioport) & ACPI_PM_MASK;
}


_

2006-03-22 00:21:04

by Con Kolivas

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Wed, 22 Mar 2006 08:19 am, OGAWA Hirofumi wrote:
> john stultz <[email protected]> writes:
> > In my TOD rework I've dropped the triple read, figuring if a problem
> > arose we could blacklist the specific box. This patch covers that, so it
> > looks like a good idea to me.
> >
> > I've not tested it myself, but if you feel good about it, please send it
> > to Andrew.
>
> Current patch is the following. If I'm missing something, or you have
> some comment, please tell me. (Since I don't have ICH4, ICH4 detection
> is untested)
>
> Thanks.

Looks good. Just some minor grammar comments

+ * The power management timer may return improper result when read.

Change to "may return an improper result" or "may return improper results"

+ printk(KERN_WARNING
+ "* Found PM-Timer Bug on this chip. For
workarounds a bug, this timer\n"
+ "* source is slow. Other timer source may be
proper (clock=)\n");

Change "Other timer source may be proper" to "Consider trying other timer
sources"


+ "* This chipset may have PM-Timer Bug, For workarounds
a bug,\n"
+ "* this timer source is slow. If you are sure, please
use \"pmtmr_good\"\n"
+ "* for disabling the workaround\n");


Change "For workarounds a bug" to "Due to workarounds for a bug"
Change "If you are sure" to "If you are sure your timer does not have this
bug"
Change "for disabling the workaround" to "to disable the workaround"

Cheers,
Con

2006-03-22 18:49:36

by OGAWA Hirofumi

[permalink] [raw]
Subject: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

Con Kolivas <[email protected]> writes:

> Looks good. Just some minor grammar comments

Thanks a lot for doing it. Fixed.

And finally, I added "if (cur_timer != &timer_pmtmr)" to pmtmr_bug_check()
to avoid incorrect warning.
--
OGAWA Hirofumi <[email protected]>


Current timer_pm.c reads I/O port triple times, in order to avoid the
bug of chipset. But I/O port is slow.

2.6.16 (pmtmr)
Simple gettimeofday: 3.6532 microseconds

2.6.16+patch (pmtmr)
Simple gettimeofday: 1.4582 microseconds

[if chip is buggy, probably it will be 7us or more in 4.2% of probability.]


This patch adds blacklist of buggy chip, and if chip is not buggy,
this uses fast normal version instead of slow workaround version.

If chip is buggy, warnings "pmtmr is slow". But sounds like there is
gray zone. I found the PIIX4 errata, but I couldn't find the ICH4
errata. But some motherboard seems to have problem.

So, if found a ICH4, also warnings, and uses a workaround version. If
user's ICH4 is good actually, user can specify the "pmtmr_good" boot
parameter to use fast version.


Acked-by: John Stultz <[email protected]>
Signed-off-by: OGAWA Hirofumi <[email protected]>
---

arch/i386/kernel/timers/timer_pm.c | 98 +++++++++++++++++++++++++++++++------
1 file changed, 84 insertions(+), 14 deletions(-)

diff -puN arch/i386/kernel/timers/timer_pm.c~pm-kill-workaround arch/i386/kernel/timers/timer_pm.c
--- linux-2.6/arch/i386/kernel/timers/timer_pm.c~pm-kill-workaround 2006-03-23 01:29:42.000000000 +0900
+++ linux-2.6-hirofumi/arch/i386/kernel/timers/timer_pm.c 2006-03-23 01:51:01.000000000 +0900
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/device.h>
#include <linux/init.h>
+#include <linux/pci.h>
#include <asm/types.h>
#include <asm/timer.h>
#include <asm/smp.h>
@@ -45,24 +46,31 @@ static seqlock_t monotonic_lock = SEQLOC

#define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */

+static int pmtmr_need_workaround __read_mostly = 1;
+
/*helper function to safely read acpi pm timesource*/
static inline u32 read_pmtmr(void)
{
- u32 v1=0,v2=0,v3=0;
- /* It has been reported that because of various broken
- * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
- * source is not latched, so you must read it multiple
- * times to insure a safe value is read.
- */
- do {
- v1 = inl(pmtmr_ioport);
- v2 = inl(pmtmr_ioport);
- v3 = inl(pmtmr_ioport);
- } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
- || (v3 > v1 && v3 < v2));
+ if (pmtmr_need_workaround) {
+ u32 v1, v2, v3;
+
+ /* It has been reported that because of various broken
+ * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
+ * source is not latched, so you must read it multiple
+ * times to insure a safe value is read.
+ */
+ do {
+ v1 = inl(pmtmr_ioport);
+ v2 = inl(pmtmr_ioport);
+ v3 = inl(pmtmr_ioport);
+ } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
+ || (v3 > v1 && v3 < v2));
+
+ /* mask the output to 24 bits */
+ return v2 & ACPI_PM_MASK;
+ }

- /* mask the output to 24 bits */
- return v2 & ACPI_PM_MASK;
+ return inl(pmtmr_ioport) & ACPI_PM_MASK;
}


@@ -263,6 +271,68 @@ struct init_timer_opts __initdata timer_
.opts = &timer_pmtmr,
};

+#ifdef CONFIG_PCI
+/*
+ * PIIX4 Errata:
+ *
+ * The power management timer may return improper results when read.
+ * Although the timer value settles properly after incrementing,
+ * while incrementing there is a 3 ns window every 69.8 ns where the
+ * timer value is indeterminate (a 4.2% chance that the data will be
+ * incorrect when read). As a result, the ACPI free running count up
+ * timer specification is violated due to erroneous reads.
+ */
+static int __init pmtmr_bug_check(void)
+{
+ static struct pci_device_id gray_list[] __initdata = {
+ /* these chipsets may have bug. */
+ { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801DB_0) },
+ { },
+ };
+ struct pci_dev *dev;
+ int pmtmr_has_bug = 0;
+ u8 rev;
+
+ if (cur_timer != &timer_pmtmr || !pmtmr_need_workaround)
+ return 0;
+
+ dev = pci_get_device(PCI_VENDOR_ID_INTEL,
+ PCI_DEVICE_ID_INTEL_82371AB_3, NULL);
+ if (dev) {
+ pci_read_config_byte(dev, PCI_REVISION_ID, &rev);
+ /* the bug has been fixed in PIIX4M */
+ if (rev < 3) {
+ printk(KERN_WARNING
+ "* Found PM-Timer Bug on this chipset. Due to workarounds for a bug,\n"
+ "* this time source is slow. Consider trying other time sources (clock=)\n");
+ pmtmr_has_bug = 1;
+ }
+ pci_dev_put(dev);
+ }
+
+ if (pci_dev_present(gray_list)) {
+ printk(KERN_WARNING
+ "* This chipset may have PM-Timer Bug. Due to workarounds for a bug,\n"
+ "* this time source is slow. If you are sure your timer does not have\n"
+ "* this bug, please use \"pmtmr_good\" to disable the workaround\n");
+ pmtmr_has_bug = 1;
+ }
+
+ if (!pmtmr_has_bug)
+ pmtmr_need_workaround = 0;
+
+ return 0;
+}
+device_initcall(pmtmr_bug_check);
+#endif
+
+static int __init pmtr_good_setup(char *__str)
+{
+ pmtmr_need_workaround = 0;
+ return 0;
+}
+__setup("pmtmr_good", pmtr_good_setup);
+
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Dominik Brodowski <[email protected]>");
MODULE_DESCRIPTION("Power Management Timer (PMTMR) as primary timing source for x86");
_

2006-03-22 19:12:22

by Avi Kivity

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

OGAWA Hirofumi wrote:
> john stultz <[email protected]> writes:
>
>
>> In my TOD rework I've dropped the triple read, figuring if a problem
>> arose we could blacklist the specific box. This patch covers that, so it
>> looks like a good idea to me.
>>
>> I've not tested it myself, but if you feel good about it, please send it
>> to Andrew.
>>
>
> Current patch is the following. If I'm missing something, or you have
> some comment, please tell me. (Since I don't have ICH4, ICH4 detection
> is untested)
>
Doesn't it make sense to mark the port as user accessible in the I/O
permissions bitmap and export it as a vsyscall? that would save the
syscall overhead.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2006-03-22 19:54:42

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

Avi Kivity <[email protected]> writes:

> OGAWA Hirofumi wrote:
>> Current patch is the following. If I'm missing something, or you have
>> some comment, please tell me. (Since I don't have ICH4, ICH4 detection
>> is untested)
>>
> Doesn't it make sense to mark the port as user accessible in the I/O
> permissions bitmap and export it as a vsyscall? that would save the
> syscall overhead.

Umm.. I don't know. In userland, can we make a stable gettimeofday()
from only PMTMR?
--
OGAWA Hirofumi <[email protected]>

2006-03-22 20:05:38

by john stultz

[permalink] [raw]
Subject: Re: gettimeofday order of magnitude slower with pmtimer, which is default

On Wed, 2006-03-22 at 21:12 +0200, Avi Kivity wrote:
> OGAWA Hirofumi wrote:
> > john stultz <[email protected]> writes:
> >
> >
> >> In my TOD rework I've dropped the triple read, figuring if a problem
> >> arose we could blacklist the specific box. This patch covers that, so it
> >> looks like a good idea to me.
> >>
> >> I've not tested it myself, but if you feel good about it, please send it
> >> to Andrew.
> >>
> >
> > Current patch is the following. If I'm missing something, or you have
> > some comment, please tell me. (Since I don't have ICH4, ICH4 detection
> > is untested)
> >
> Doesn't it make sense to mark the port as user accessible in the I/O
> permissions bitmap and export it as a vsyscall? that would save the
> syscall overhead.

i386 doesn't yet have a vsyscall gtod. I had some code around earlier
for it, but it still needed a good bit of work before it would be ready
for mainline. If you're interested in working on this, I'd be happy to
send them to you.

thanks
-john

2006-03-22 21:50:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

OGAWA Hirofumi <[email protected]> wrote:
>
>
> ...
>
> This patch adds blacklist of buggy chip, and if chip is not buggy,
> this uses fast normal version instead of slow workaround version.
>
> If chip is buggy, warnings "pmtmr is slow". But sounds like there is
> gray zone. I found the PIIX4 errata, but I couldn't find the ICH4
> errata. But some motherboard seems to have problem.
>
> So, if found a ICH4, also warnings, and uses a workaround version. If
> user's ICH4 is good actually, user can specify the "pmtmr_good" boot
> parameter to use fast version.
>
> ...
>
> +static int __init pmtmr_bug_check(void)
> +{
> + static struct pci_device_id gray_list[] __initdata = {
> + /* these chipsets may have bug. */
> + { PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801DB_0) },
> + { },
> + };
>
> ...
>
> + dev = pci_get_device(PCI_VENDOR_ID_INTEL,
> + PCI_DEVICE_ID_INTEL_82371AB_3, NULL);
>
> ...
>
> +device_initcall(pmtmr_bug_check);

Can this code use the PCI quirk infrastructure?

2006-03-23 07:31:12

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

Andrew Morton <[email protected]> writes:

> OGAWA Hirofumi <[email protected]> wrote:
>>
>> + dev = pci_get_device(PCI_VENDOR_ID_INTEL,
>> + PCI_DEVICE_ID_INTEL_82371AB_3, NULL);
>>
>> ...
>>
>> +device_initcall(pmtmr_bug_check);
>
> Can this code use the PCI quirk infrastructure?

Yes. However, since we need to check there is _not_ those chipsets,
it's tricky. Probably it's a following or similar code, um.. IMHO it
seems not simple enough. Idea?


static void __devinit blacklist_check()
{
has_bug = 1;
}
DECLARE_PCI_FIXUP_EARLY(blacklist);

static void __devinit graylist_check()
{
has_bug = 2;
}
DECLARE_PCI_FIXUP_EARLY(graylist);

static int __init pmtmr_bug_check()
{
if (!has_bug)
pmtmr_need_workaround = 0;
else
...
}
late_initcall(pmtmr_bug_check);
--
OGAWA Hirofumi <[email protected]>

2006-03-23 07:53:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

OGAWA Hirofumi <[email protected]> wrote:
>
> Andrew Morton <[email protected]> writes:
>
> > OGAWA Hirofumi <[email protected]> wrote:
> >>
> >> + dev = pci_get_device(PCI_VENDOR_ID_INTEL,
> >> + PCI_DEVICE_ID_INTEL_82371AB_3, NULL);
> >>
> >> ...
> >>
> >> +device_initcall(pmtmr_bug_check);
> >
> > Can this code use the PCI quirk infrastructure?
>
> Yes. However, since we need to check there is _not_ those chipsets,

Oh. Probably not worth bothering with then.

2006-03-23 17:04:17

by Andreas Mohr

[permalink] [raw]
Subject: Re: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

Hi,

On Thu, Mar 23, 2006 at 03:49:18AM +0900, OGAWA Hirofumi wrote:
> This patch adds blacklist of buggy chip, and if chip is not buggy,
> this uses fast normal version instead of slow workaround version.

Nice!
Testing on ICH5 (P4HT) hasn't shown any problems so far.

> If chip is buggy, warnings "pmtmr is slow". But sounds like there is
> gray zone. I found the PIIX4 errata, but I couldn't find the ICH4
> errata. But some motherboard seems to have problem.

Same here, the ICH4 trace is nowhere to be found.

Should I do a public request for chipset testing?
(I wrote a small test app here that would hopefully identify a buggy
chipset).

Data that I have collected from internet snippets (mostly Intel errata
documents):
Affected (PCI ID / rev):
- ICH4???
- PIIX4 A0 (0x7113 / 00?), A1 (0x7113 / 00?), B0 (0x7113 / 01?)
- PIIX4E A0 (0x7113 / 02?)
Probably fixed (PCI ID / rev):
- PIIX4M A0 (0x7113 / 03?)

My Toshiba Satellite 4280 seems to have non-buggy PIIX4M
(since it's PCI rev. 03), haven't had time to test reliability yet, though.

> So, if found a ICH4, also warnings, and uses a workaround version. If
> user's ICH4 is good actually, user can specify the "pmtmr_good" boot
> parameter to use fast version.

Good, that's how it should be done as long as it's not entirely known which
chipsets (both Intel and possibly also non-Intel!) are buggy.

Also, I think that since triple-reading wastes 3% of system time, it would
be very nice to be able to eventually write a software workaround for buggy
systems, too.
This could perhaps be done by keeping the *old* value in a global yet
thread-safe variable and checking whether the new value is safely related
to it if within short jiffies distance (but since jiffies are probably
calculated from pmtmr readings... :-P).

Andreas Mohr

2006-03-23 18:21:51

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

Andreas Mohr <[email protected]> writes:

> Should I do a public request for chipset testing?
> (I wrote a small test app here that would hopefully identify a buggy
> chipset).

I think almost ICH4 is not buggy. But probably current approach is safe.
So, I added "pmtmr_good" to disable the workaround instead.

I posted probably similar one for mainly ICH4 users.
http://marc.theaimsgroup.com/?l=linux-kernel&m=114297656924494&w=2

> Data that I have collected from internet snippets (mostly Intel errata
> documents):
> Affected (PCI ID / rev):
> - ICH4???
> - PIIX4 A0 (0x7113 / 00?), A1 (0x7113 / 00?), B0 (0x7113 / 01?)
> - PIIX4E A0 (0x7113 / 02?)
> Probably fixed (PCI ID / rev):
> - PIIX4M A0 (0x7113 / 03?)
>
> My Toshiba Satellite 4280 seems to have non-buggy PIIX4M
> (since it's PCI rev. 03), haven't had time to test reliability yet, though.

I tested PIIX4E (yes, really buggy), ICH7, VT88237. And ICH6 was
reported as sane.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2006-03-25 12:00:28

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

> This patch adds blacklist of buggy chip, and if chip is not buggy,
> this uses fast normal version instead of slow workaround version.

I can confirm that this patch solves the problem I originally complained
about.

Timings of 10,000,000 gettimeofday calls on my pentium 4 3GHz (which I won
at OLS, ha!):

tsc
---
real 0m2.504s
user 0m0.700s
sys 0m1.804s

2.6.16-mainline
---------------
real 0m36.973s
user 0m1.440s
sys 0m34.130s

2.6.16-ogawa
------------
real 0m13.697s
user 0m1.712s
sys 0m11.117s

For reference, baseline 2.6.16 on my Athlon64 3200+
---------------------------------------------------
real 0m1.994s
user 0m0.990s
sys 0m0.990s

I'm still going to recommend all 'power' pdns_recursor users on single CPU
and not using frequency scaling to boot with clock=tsc - saves
2.2usec/packet. Or run on an opteron of course.

Thanks for the work everybody!

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2006-03-30 11:53:23

by Andreas Mohr

[permalink] [raw]
Subject: Re: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

Hi,

On Fri, Mar 24, 2006 at 03:21:39AM +0900, OGAWA Hirofumi wrote:
> Andreas Mohr <[email protected]> writes:
>
> > Should I do a public request for chipset testing?
> > (I wrote a small test app here that would hopefully identify a buggy
> > chipset).
>
> I think almost ICH4 is not buggy. But probably current approach is safe.
> So, I added "pmtmr_good" to disable the workaround instead.
>
> I posted probably similar one for mainly ICH4 users.
> http://marc.theaimsgroup.com/?l=linux-kernel&m=114297656924494&w=2

IMHO this request was much too non-verbose (both Subject *and* introduction).
Interested parties wouldn't even know that they should be interested in it ;)


> > Data that I have collected from internet snippets (mostly Intel errata
> > documents):
> > Affected (PCI ID / rev):
> > - ICH4???
> > - PIIX4 A0 (0x7113 / 00?), A1 (0x7113 / 00?), B0 (0x7113 / 01?)
> > - PIIX4E A0 (0x7113 / 02?)
> > Probably fixed (PCI ID / rev):
> > - PIIX4M A0 (0x7113 / 03?)
> >
> > My Toshiba Satellite 4280 seems to have non-buggy PIIX4M
> > (since it's PCI rev. 03), haven't had time to test reliability yet, though.
>
> I tested PIIX4E (yes, really buggy), ICH7, VT88237. And ICH6 was
> reported as sane.

What further steps should now be taken for this patch to be included
in a sufficiently official kernel in some form?

I'm asking now since as some kind of weird Christmas present this one
has found its way under my desk rather very accidentally:

00:00.0 Host bridge: Intel Corporation 440LX/EX - 82443LX/EX Host bridge (rev 0
3)
00:01.0 PCI bridge: Intel Corporation 440LX/EX - 82443LX/EX AGP bridge (rev 03)
00:04.0 ISA bridge: Intel Corporation 82371AB PIIX4 ISA (rev 01)
00:04.1 IDE interface: Intel Corporation 82371AB PIIX4 IDE (rev 01)
00:04.2 USB Controller: Intel Corporation 82371AB PIIX4 USB (rev 01)
00:04.3 Bridge: Intel Corporation 82371AB PIIX4 ACPI (rev 01)
00:06.0 SCSI storage controller: Adaptec AIC-7880U
00:0a.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 7
4)
00:0c.0 VGA compatible controller: ATI Technologies Inc 264VT4 [Mach64 VT4] (re
v 3a)


So this means I have ICH5, PIIX4M and PIIX4 rev. 01 (most likely buggy)
and some non-Intel chipsets. Ideal conditions for testing.

Andreas Mohr

2006-03-30 15:37:15

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

Andreas Mohr <[email protected]> writes:

> What further steps should now be taken for this patch to be included
> in a sufficiently official kernel in some form?

This patch was included into Linus's tree. Thanks.
--
OGAWA Hirofumi <[email protected]>

2006-03-30 16:02:55

by Andreas Mohr

[permalink] [raw]
Subject: Re: [PATCH] PM-Timer: doesn't use workaround if chipset is not buggy

Hi,

On Fri, Mar 31, 2006 at 12:37:06AM +0900, OGAWA Hirofumi wrote:
> Andreas Mohr <[email protected]> writes:
>
> > What further steps should now be taken for this patch to be included
> > in a sufficiently official kernel in some form?
>
> This patch was included into Linus's tree. Thanks.

Sorry, I totally hadn't expected this to have gone in already, otherwise
I'd certainly have verified that. But indeed the patch seemed quite ok
already...

Oh well, seems Denis Vlasenko (coincidentally quite well-known to me)
now is not the only one having issues with patch status ;)

Andreas Mohr

--
No programming skills!? Why not help translate many Linux applications!
https://launchpad.ubuntu.com/rosetta
(or alternatively buy nicely packaged Linux distros/OSS software to help
support Linux developers creating shiny new things for you?)