2002-08-08 02:04:24

by john stultz

[permalink] [raw]
Subject: [PATCH] tsc-disable_B9

Marcelo,

Here is my latest revision of tsc-disable. It includes CodingStyle
changes suggested by Christoph and George Anzinger, as well moving some
of the changes around to accommodate my cyclone-timer patch (which I'll
post in a second).

As I said before: This patch enables a workaround for multi-node NUMA
systems that are experiencing gettimeofday returning "old" time values.
Because these systems are frequently driven by different crystals, the
CPUs vary slightly in frequency causing the TSCs to drift apart. Thus it
is possible for gettimeofday to return time values behind time values
already seen on another cpu. This patch allows people compiling w/
'Multi-node NUMA support' to pass "notsc" or "bad-tsc" as boot
parameters. "notsc" disables rdtsc calls, and forces the kernel to use
the PIT for gettimeofday calucluations (as normally expected w/ i386
compiled kernels). While "bad-tsc" forces the kernel to use the PIT for
gettimeofday, but does not disable TSC access.

Comments, suggestions and flames welcome.

thanks
-john

diff -Nru a/Documentation/Configure.help b/Documentation/Configure.help
--- a/Documentation/Configure.help Wed Aug 7 17:26:31 2002
+++ b/Documentation/Configure.help Wed Aug 7 17:26:31 2002
@@ -230,7 +230,21 @@
network and embedded applications. For more information see the
Axis Communication site, <http://developer.axis.com/>.

-Multiquad support for NUMA systems
+Multi-node support for NUMA systems
+CONFIG_X86_NUMA
+ This option is used for getting Linux to run on a NUMA multi-node box.
+ Because multi-node systems suffer from unsynced TSCs, as well as TSC
+ drift, which can cause gettimeofday to return non-monotonic values,
+ this option will turn off the CONFIG_X86_TSC optimization. This
+ allows you to then specify "bad-tsc" as a boot option to force all nodes
+ to use the PIT for gettimeofday.
+
+ Note: This does not disable access to the unsynced TSCs from userspace,
+ thus applications using the rdtsc instruction for timing may have
+ issues. To disable userspace access, instead of "bad-tsc" use the
+ "notsc" boot option.
+
+Multiquad support for NUMAQ systems
CONFIG_MULTIQUAD
This option is used for getting Linux to run on a (IBM/Sequent) NUMA
multiquad box. This changes the way that processors are bootstrapped,
diff -Nru a/arch/i386/config.in b/arch/i386/config.in
--- a/arch/i386/config.in Wed Aug 7 17:26:31 2002
+++ b/arch/i386/config.in Wed Aug 7 17:26:31 2002
@@ -79,20 +79,20 @@
define_int CONFIG_X86_L1_CACHE_SHIFT 5
define_bool CONFIG_X86_USE_STRING_486 y
define_bool CONFIG_X86_ALIGNMENT_16 y
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_PPRO_FENCE y
fi
if [ "$CONFIG_M586MMX" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
define_bool CONFIG_X86_USE_STRING_486 y
define_bool CONFIG_X86_ALIGNMENT_16 y
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_GOOD_APIC y
define_bool CONFIG_X86_PPRO_FENCE y
fi
if [ "$CONFIG_M686" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_GOOD_APIC y
define_bool CONFIG_X86_PGE y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
@@ -100,14 +100,14 @@
fi
if [ "$CONFIG_MPENTIUMIII" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_GOOD_APIC y
define_bool CONFIG_X86_PGE y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
fi
if [ "$CONFIG_MPENTIUM4" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 7
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_GOOD_APIC y
define_bool CONFIG_X86_PGE y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
@@ -115,12 +115,12 @@
if [ "$CONFIG_MK6" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
define_bool CONFIG_X86_ALIGNMENT_16 y
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
fi
if [ "$CONFIG_MK7" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 6
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_GOOD_APIC y
define_bool CONFIG_X86_USE_3DNOW y
define_bool CONFIG_X86_PGE y
@@ -133,14 +133,14 @@
fi
if [ "$CONFIG_MCYRIXIII" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_ALIGNMENT_16 y
define_bool CONFIG_X86_USE_3DNOW y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
fi
if [ "$CONFIG_MCRUSOE" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
fi
if [ "$CONFIG_MWINCHIPC6" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
@@ -151,14 +151,14 @@
if [ "$CONFIG_MWINCHIP2" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
define_bool CONFIG_X86_ALIGNMENT_16 y
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
define_bool CONFIG_X86_OOSTORE y
fi
if [ "$CONFIG_MWINCHIP3D" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
define_bool CONFIG_X86_ALIGNMENT_16 y
- define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAS_TSC y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
define_bool CONFIG_X86_OOSTORE y
fi
@@ -202,7 +202,14 @@
define_bool CONFIG_X86_IO_APIC y
fi
else
- bool 'Multiquad NUMA system' CONFIG_MULTIQUAD
+ bool 'Multi-node NUMA system support' CONFIG_X86_NUMA
+ if [ "$CONFIG_X86_NUMA" = "y" ]; then
+ bool ' Multiquad (IBM/Sequent) NUMAQ support' CONFIG_MULTIQUAD
+ fi
+fi
+
+if [ "$CONFIG_X86_NUMA" != "y" -a "$CONFIG_X86_HAS_TSC" = "y" ]; then
+ define_bool CONFIG_X86_TSC y
fi

bool 'ISA bus support' CONFIG_ISA
diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c Wed Aug 7 17:26:31 2002
+++ b/arch/i386/kernel/time.c Wed Aug 7 17:26:31 2002
@@ -69,6 +69,9 @@
/* Number of usecs that the last interrupt was delayed */
static int delay_at_last_interrupt;

+/*boot option flag to keep gettimeofday from using do_fastgettimeoffset */
+static int bad_tsc __initdata = 0;
+
static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */

/* Cached *multiplier* to convert TSC counts to microseconds.
@@ -634,6 +637,16 @@
return 0;
}

+/* bad-tsc boot option: Used to keep do_gettimeofday from
+ * using do_fast_gettimeoffset()
+ */
+static int __init bad_tsc_setup(char *str)
+{
+ bad_tsc = 1;
+ return 1;
+}
+__setup("bad-tsc", bad_tsc_setup);
+
void __init time_init(void)
{
extern int x86_udelay_tsc;
@@ -672,16 +685,17 @@
unsigned long tsc_quotient = calibrate_tsc();
if (tsc_quotient) {
fast_gettimeoffset_quotient = tsc_quotient;
- use_tsc = 1;
/*
* We could be more selective here I suspect
* and just enable this for the next intel chips ?
*/
- x86_udelay_tsc = 1;
+ if(!bad_tsc){
+ use_tsc = 1;
+ x86_udelay_tsc = 1;
#ifndef do_gettimeoffset
- do_gettimeoffset = do_fast_gettimeoffset;
+ do_gettimeoffset = do_fast_gettimeoffset;
#endif
-
+ }
/* report CPU clock rate in Hz.
* The formula is (10^6 * 2^32) / (2^32 * 1 / (clocks/us)) =
* clock/second. Our precision is about 100 ppm.


2002-08-08 02:26:37

by john stultz

[permalink] [raw]
Subject: [PATCH] cyclone-timer_A9

Marcelo,
This patch (which applies on top of my tsc-disable_B9 patch as well as
James Cleverdon's summit patch), is a performance improvement for
multi-CEC IBM x440 systems which suffer from drifting TSCs. Rather then
forcing do_gettimeofday to call do_slow_gettimeoffset and access the PIT
(as my tsc-disable patch does), passing "cyclone" as a boot option will
make do_gettimeofday use a 100Mhz performance counter found in the
Summit chipset.

This patch does not effect/correct userspace access to the unsynced TSC
registers via the rdtsc call.

Comments, flames, etc welcome.

thanks
-john

diff -Nru a/Documentation/Configure.help b/Documentation/Configure.help
--- a/Documentation/Configure.help Wed Aug 7 19:00:21 2002
+++ b/Documentation/Configure.help Wed Aug 7 19:00:21 2002
@@ -252,6 +252,14 @@
You will need a new lynxer.elf file to flash your firmware with - send
email to [email protected]

+IBM x440 Summit support
+CONFIG_X86_SUMMIT_NUMA
+ This option enables support for the IBM x440 and related multi-CEC
+ systems based on the Summit chipset. This options allows you to pass
+ "cyclone" as a boot option to make use of a performance counter on
+ the Cyclone chipset for calculating do_gettimeofday, greatly
+ improving performance when compared to the PIT based method.
+
IO-APIC support on uniprocessors
CONFIG_X86_UP_IOAPIC
An IO-APIC (I/O Advanced Programmable Interrupt Controller) is an
diff -Nru a/arch/i386/config.in b/arch/i386/config.in
--- a/arch/i386/config.in Wed Aug 7 19:00:20 2002
+++ b/arch/i386/config.in Wed Aug 7 19:00:20 2002
@@ -205,6 +205,7 @@
bool 'Multi-node NUMA system support' CONFIG_X86_NUMA
if [ "$CONFIG_X86_NUMA" = "y" ]; then
bool ' Multiquad (IBM/Sequent) NUMAQ support' CONFIG_MULTIQUAD
+ bool ' IBM x440 Summit support' CONFIG_X86_SUMMIT_NUMA
fi
fi

diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c Wed Aug 7 19:00:20 2002
+++ b/arch/i386/kernel/time.c Wed Aug 7 19:00:21 2002
@@ -256,6 +256,138 @@

static unsigned long (*do_gettimeoffset)(void) = do_slow_gettimeoffset;

+
+
+#ifdef CONFIG_X86_SUMMIT_NUMA
+
+#define CYCLONE_CBAR_ADDR 0xFEB00CD0
+#define CYCLONE_PMCC_OFFSET 0x51A0
+#define CYCLONE_MPMC_OFFSET 0x51D0
+#define CYCLONE_MPCS_OFFSET 0x51A8
+#define CYCLONE_TIMER_FREQ 100000000
+
+static int use_cyclone __initdata = 0;
+static int __init cyclone_setup(char *str)
+{
+ bad_tsc = 1;
+ use_cyclone = 1;
+ return 1;
+}
+__setup("cyclone", cyclone_setup);
+
+
+static u32* cyclone_timer; /*Cyclone MPMC0 register*/
+static u32 last_cyclone_timer;
+
+static inline void mark_timeoffset_cyclone(void)
+{
+ int count;
+
+ /*quickly read the cyclone timer*/
+ if(cyclone_timer)
+ last_cyclone_timer = cyclone_timer[0];
+
+ /*calculate delay_at_last_interrupt*/
+ spin_lock(&i8253_lock);
+ outb_p(0x00, 0x43); /* latch the count ASAP */
+
+ count = inb_p(0x40); /* read the latched count */
+ count |= inb(0x40) << 8;
+ spin_unlock(&i8253_lock);
+
+ count = ((LATCH-1) - count) * TICK_SIZE;
+ delay_at_last_interrupt = (count + LATCH/2) / LATCH;
+}
+
+static unsigned long do_gettimeoffset_cyclone(void)
+{
+ u32 offset;
+
+ if(!cyclone_timer)
+ return delay_at_last_interrupt;
+
+ /* Read the cyclone timer */
+ offset = cyclone_timer[0];
+
+ /* .. relative to previous jiffy*/
+ offset = offset - last_cyclone_timer;
+
+ /*convert cyclone ticks to microseconds*/
+ offset = offset/100; /*XXX slow, can we speed this up?*/
+
+ /* our adjusted time offset in microseconds */
+ return delay_at_last_interrupt + offset;
+}
+
+static void init_cyclone_clock(void)
+{
+ u32* reg;
+ u32 base; /*saved cyclone base address*/
+ u32 pageaddr; /*page that contains cyclone_timer register*/
+ u32 offset; /*offset from pageaddr to cyclone_timer register*/
+
+ printk("Summit chipset: Starting Cyclone Clock.\n");
+
+ /*find base address*/
+ pageaddr = (CYCLONE_CBAR_ADDR)&PAGE_MASK;
+ offset = (CYCLONE_CBAR_ADDR)&(~PAGE_MASK);
+ set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+ reg = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+ if(!reg){
+ printk("Summit chipset: Could not find valid CBAR register.\n");
+ return;
+ }
+ base = *reg;
+ if(!base){
+ printk("Summit chipset: Could not find valid CBAR value.\n");
+ return;
+ }
+
+ /*setup PMCC*/
+ pageaddr = (base + CYCLONE_PMCC_OFFSET)&PAGE_MASK;
+ offset = (base + CYCLONE_PMCC_OFFSET)&(~PAGE_MASK);
+ set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+ reg = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+ if(!reg){
+ printk("Summit chipset: Could not find valid PMCC register.\n");
+ return;
+ }
+ reg[0] = 0x00000001;
+
+ /*setup MPCS*/
+ pageaddr = (base + CYCLONE_MPCS_OFFSET)&PAGE_MASK;
+ offset = (base + CYCLONE_MPCS_OFFSET)&(~PAGE_MASK);
+ set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+ reg = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+ if(!reg){
+ printk("Summit chipset: Could not find valid MPCS register.\n");
+ return;
+ }
+ reg[0] = 0x00000001;
+
+ /*map in cyclone_timer*/
+ pageaddr = (base + CYCLONE_MPMC_OFFSET)&PAGE_MASK;
+ offset = (base + CYCLONE_MPMC_OFFSET)&(~PAGE_MASK);
+ set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+ cyclone_timer = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+ if(!cyclone_timer){
+ printk("Summit chipset: Could not find valid MPMC register.\n");
+ return;
+ }
+
+ /* Everything looks good, so set do_gettimeoffset*/
+ do_gettimeoffset = do_gettimeoffset_cyclone;
+}
+
+#else /*CONFIG_X86_SUMMIT_NUMA*/
+
+#define use_cyclone 0
+static void mark_timeoffset_cyclone(void) {}
+static unsigned long do_gettimeoffset_cyclone(void) {return 0;}
+static void init_cyclone_clock(void) {}
+
+#endif /*CONFIG_X86_SUMMIT_NUMA*/
+
#else

#define do_gettimeoffset() do_fast_gettimeoffset()
@@ -481,8 +613,7 @@
*/
write_lock(&xtime_lock);

- if (use_tsc)
- {
+ if (use_tsc) {
/*
* It is important that these two operations happen almost at
* the same time. We do the RDTSC stuff first, since it's
@@ -508,8 +639,11 @@

count = ((LATCH-1) - count) * TICK_SIZE;
delay_at_last_interrupt = (count + LATCH/2) / LATCH;
- }
-
+ } else {
+ if(use_cyclone)
+ mark_timeoffset_cyclone();
+ }
+
do_timer_interrupt(irq, NULL, regs);

write_unlock(&xtime_lock);
@@ -709,6 +843,9 @@
}
}
}
+
+ if((!use_tsc) && use_cyclone)
+ init_cyclone_clock();

#ifdef CONFIG_VISWS
printk("Starting Cobalt Timer system clock\n");
diff -Nru a/include/asm-i386/fixmap.h b/include/asm-i386/fixmap.h
--- a/include/asm-i386/fixmap.h Wed Aug 7 19:00:20 2002
+++ b/include/asm-i386/fixmap.h Wed Aug 7 19:00:20 2002
@@ -61,6 +61,9 @@
FIX_LI_PCIA, /* Lithium PCI Bridge A */
FIX_LI_PCIB, /* Lithium PCI Bridge B */
#endif
+#ifdef CONFIG_X86_SUMMIT_NUMA
+ FIX_CYCLONE_TIMER, /*cyclone timer register*/
+#endif
#ifdef CONFIG_HIGHMEM
FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */
FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,


2002-08-08 11:54:24

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Thu, 2002-08-08 at 02:53, john stultz wrote:
> already seen on another cpu. This patch allows people compiling w/
> 'Multi-node NUMA support' to pass "notsc" or "bad-tsc" as boot
> parameters. "notsc" disables rdtsc calls, and forces the kernel to use
> the PIT for gettimeofday calucluations (as normally expected w/ i386
> compiled kernels). While "bad-tsc" forces the kernel to use the PIT for
> gettimeofday, but does not disable TSC access.

I've done a version of this for -ac which uses the NUMA autodetect, and
also provides the hook so I can disable tsc on split on x86 smp with
variable multipliers some time. The only comment I really have is please
use "badtsc" not "bad-tsc" - to match "notsc"

This is how I did it - barring the code that is -ac specific to automatically flip
the mode when NUMA hw is detected


diff -u --exclude-from /usr/src/exclude --new-file --recursive linux.20pre1/include/asm-i386/processor.h linux.20pre1-ac2/include/asm-i386/processor.h
--- linux.20pre1/include/asm-i386/processor.h 2002-08-06 15:40:34.000000000 +0100
+++ linux.20pre1-ac2/include/asm-i386/processor.h 2002-08-07 15:23:39.000000000 +0100
@@ -96,7 +96,7 @@

extern void identify_cpu(struct cpuinfo_x86 *);
extern void print_cpu_info(struct cpuinfo_x86 *);
-extern void dodgy_tsc(void);
+extern int dodgy_tsc(void);

/*
* EFLAGS bits
--- linux.vanilla/arch/i386/kernel/time.c 2002-02-25 19:37:53.000000000 +0000
+++ linux.20pre1-ac2/arch/i386/kernel/time.c 2002-08-08 12:56:48.000000000 +0100
@@ -666,9 +666,7 @@
* moaned if you have the only one in the world - you fix it!
*/

- dodgy_tsc();
-
- if (cpu_has_tsc) {
+ if (!dodgy_tsc()) {
unsigned long tsc_quotient = calibrate_tsc();
if (tsc_quotient) {
fast_gettimeoffset_quotient = tsc_quotient;
diff -u --exclude-from /usr/src/exclude --new-file --recursive linux.20pre1/arch/i386/kernel/setup.c linux.20pre1-ac2/arch/i386/kernel/setup.c
--- linux.20pre1/arch/i386/kernel/setup.c 2002-08-06 15:41:42.000000000 +0100
+++ linux.20pre1-ac2/arch/i386/kernel/setup.c 2002-08-07 15:23:29.000000000 +0100
@@ -1193,19 +1126,51 @@
}
__setup("cachesize=", cachesize_setup);

-
#ifndef CONFIG_X86_TSC
+
static int tsc_disable __initdata = 0;

-static int __init tsc_setup(char *str)
+void disable_tsc(void)
+{
+ if(tsc_disable == 0)
+ {
+ printk(KERN_INFO "Disabling use of TSC for time counting.\n");
+ tsc_disable = 1;
+ }
+}
+
+/* Disable TSC on processor and also for get time of day */
+
+static int __init notsc_setup(char *str)
+{
+ tsc_disable = 2;
+ return 1;
+}
+
+__setup("notsc", notsc_setup);
+
+/* Allow TSC use but don't use it for gettimeofday */
+
+static int __init badtsc_setup(char *str)
{
tsc_disable = 1;
return 1;
}

-__setup("notsc", tsc_setup);
+__setup("badtsc", badtsc_setup);
+
+#else
+
+#define tsc_disable 0
+
+void disable_tsc(void)
+{
+ panic("Time stamp counter required by this kernel, but not supported by the hardware.\n");
+}
+
#endif

+
static int __init get_model_name(struct cpuinfo_x86 *c)
{
unsigned int *v;
@@ -2843,10 +2762,8 @@
*/

/* TSC disabled? */
-#ifndef CONFIG_X86_TSC
- if ( tsc_disable )
+ if ( tsc_disable > 1 )
clear_bit(X86_FEATURE_TSC, &c->x86_capability);
-#endif

/* HT disabled? */
if (disable_x86_ht)
@@ -2906,13 +2823,23 @@
* Perform early boot up checks for a valid TSC. See arch/i386/kernel/time.c
*/

-void __init dodgy_tsc(void)
+int __init dodgy_tsc(void)
{
get_cpu_vendor(&boot_cpu_data);

if ( boot_cpu_data.x86_vendor == X86_VENDOR_CYRIX ||
boot_cpu_data.x86_vendor == X86_VENDOR_NSC )
init_cyrix(&boot_cpu_data);
+
+ /* Is it disabled ? */
+ if(tsc_disable)
+ return 1;
+
+ /* Does it exist ? */
+ if(!cpu_has_tsc)
+ return 1;
+
+ return 0;
}


@@ -3088,14 +3021,12 @@

if (cpu_has_vme || cpu_has_tsc || cpu_has_de)
clear_in_cr4(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
-#ifndef CONFIG_X86_TSC
- if (tsc_disable && cpu_has_tsc) {
+ if (tsc_disable > 1 && cpu_has_tsc) {
printk(KERN_NOTICE "Disabling TSC...\n");
/**** FIX-HPA: DOES THIS REALLY BELONG HERE? ****/
clear_bit(X86_FEATURE_TSC, boot_cpu_data.x86_capability);
set_in_cr4(X86_CR4_TSD);
}
-#endif

__asm__ __volatile__("lgdt %0": "=m" (gdt_descr));
__asm__ __volatile__("lidt %0": "=m" (idt_descr));

2002-08-08 11:55:17

by Alan

[permalink] [raw]
Subject: Re: [PATCH] cyclone-timer_A9

On Thu, 2002-08-08 at 03:15, john stultz wrote:
> Marcelo,
> This patch (which applies on top of my tsc-disable_B9 patch as well as
> James Cleverdon's summit patch), is a performance improvement for
> multi-CEC IBM x440 systems which suffer from drifting TSCs. Rather then
> forcing do_gettimeofday to call do_slow_gettimeoffset and access the PIT
> (as my tsc-disable patch does), passing "cyclone" as a boot option will
> make do_gettimeofday use a 100Mhz performance counter found in the
> Summit chipset.

Why not probe for the summit chipset ?

2002-08-09 02:22:47

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] cyclone-timer_A9

On Thu, 2002-08-08 at 06:18, Alan Cox wrote:
> On Thu, 2002-08-08 at 03:15, john stultz wrote:
> > Marcelo,
> > This patch (which applies on top of my tsc-disable_B9 patch as well as
> > James Cleverdon's summit patch), is a performance improvement for
> > multi-CEC IBM x440 systems which suffer from drifting TSCs. Rather then
> > forcing do_gettimeofday to call do_slow_gettimeoffset and access the PIT
> > (as my tsc-disable patch does), passing "cyclone" as a boot option will
> > make do_gettimeofday use a 100Mhz performance counter found in the
> > Summit chipset.
>
> Why not probe for the summit chipset ?

Yea, its on my list. Once Jame's patch settles down I plan to piggy back
off that.

Thanks!
-john



2002-08-09 02:43:04

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Thu, 2002-08-08 at 06:17, Alan Cox wrote:
> On Thu, 2002-08-08 at 02:53, john stultz wrote:
> > already seen on another cpu. This patch allows people compiling w/
> > 'Multi-node NUMA support' to pass "notsc" or "bad-tsc" as boot
> > parameters. "notsc" disables rdtsc calls, and forces the kernel to use
> > the PIT for gettimeofday calucluations (as normally expected w/ i386
> > compiled kernels). While "bad-tsc" forces the kernel to use the PIT for
> > gettimeofday, but does not disable TSC access.
>
> I've done a version of this for -ac which uses the NUMA autodetect, and
> also provides the hook so I can disable tsc on split on x86 smp with
> variable multipliers some time. The only comment I really have is please

Not sure I followed that, do you mean per-cpu TSC management for
gettimeofday?

> use "badtsc" not "bad-tsc" - to match "notsc"

Sounds good. Changed in my tree.


> This is how I did it - barring the code that is -ac specific to automatically flip
> the mode when NUMA hw is detected

Interesting. See my comments below. Auto-detection has been on my list
for the last week, so I'm eagerly awaiting a peek at the next ac release
:)


> diff -u --exclude-from /usr/src/exclude --new-file --recursive linux.20pre1/include/asm-i386/processor.h linux.20pre1-ac2/include/asm-i386/processor.h
> --- linux.20pre1/include/asm-i386/processor.h 2002-08-06 15:40:34.000000000 +0100
> +++ linux.20pre1-ac2/include/asm-i386/processor.h 2002-08-07 15:23:39.000000000 +0100
> @@ -96,7 +96,7 @@
>
> extern void identify_cpu(struct cpuinfo_x86 *);
> extern void print_cpu_info(struct cpuinfo_x86 *);
> -extern void dodgy_tsc(void);
> +extern int dodgy_tsc(void);
>
> /*
> * EFLAGS bits
> --- linux.vanilla/arch/i386/kernel/time.c 2002-02-25 19:37:53.000000000 +0000
> +++ linux.20pre1-ac2/arch/i386/kernel/time.c 2002-08-08 12:56:48.000000000 +0100
> @@ -666,9 +666,7 @@
> * moaned if you have the only one in the world - you fix it!
> */
>
> - dodgy_tsc();
> -
> - if (cpu_has_tsc) {
> + if (!dodgy_tsc()) {
> unsigned long tsc_quotient = calibrate_tsc();
> if (tsc_quotient) {
> fast_gettimeoffset_quotient = tsc_quotient;


Hmmm. Putting some of the checks into dodgy_tsc does clean some of this
up, maybe a new name is needed for it? Also by not executing this block
of code when we have TSCs that are not synced, we don't calculate
cpu_khz, which is used by the O(1) sched in 2.5 for cpu affinity. True
the TSCs could be wack (and really, we're only using one of them), but
can't we still trust them enough for this rough estimate?


> diff -u --exclude-from /usr/src/exclude --new-file --recursive linux.20pre1/arch/i386/kernel/setup.c linux.20pre1-ac2/arch/i386/kernel/setup.c
> --- linux.20pre1/arch/i386/kernel/setup.c 2002-08-06 15:41:42.000000000 +0100
> +++ linux.20pre1-ac2/arch/i386/kernel/setup.c 2002-08-07 15:23:29.000000000 +0100
> @@ -1193,19 +1126,51 @@
> }
> __setup("cachesize=", cachesize_setup);
>
> -
> #ifndef CONFIG_X86_TSC
> +
> static int tsc_disable __initdata = 0;
>
> -static int __init tsc_setup(char *str)
> +void disable_tsc(void)
> +{
> + if(tsc_disable == 0)
> + {
> + printk(KERN_INFO "Disabling use of TSC for time counting.\n");
> + tsc_disable = 1;
> + }
> +}
> +
> +/* Disable TSC on processor and also for get time of day */
> +
> +static int __init notsc_setup(char *str)
> +{
> + tsc_disable = 2;
> + return 1;
> +}
> +
> +__setup("notsc", notsc_setup);

Overloading the tsc_disable variable does let us have one less variable
around, but makes the code a little less intuitive. Looking over the
patch I get mixed up which value means what. Maybe just an enum might
help?


> +/* Allow TSC use but don't use it for gettimeofday */
> +
> +static int __init badtsc_setup(char *str)
> {
> tsc_disable = 1;
> return 1;
> }
>
> -__setup("notsc", tsc_setup);
> +__setup("badtsc", badtsc_setup);
> +
> +#else
> +
> +#define tsc_disable 0
> +
> +void disable_tsc(void)
> +{
> + panic("Time stamp counter required by this kernel, but not supported by the hardware.\n");
> +}
> +
> #endif

Hmm, I kind of like this. I'm thinking there should also be some sort of
error when the user passes bad/notsc to an optimized the kernel, rather
then just ignoring the boot option. I'll add that to my tree.


Thanks for all the feedback! Can't wait to see the next -ac.
-john


2002-08-09 07:54:07

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Fri, 2002-08-09 at 03:30, john stultz wrote:
> Not sure I followed that, do you mean per-cpu TSC management for
> gettimeofday?

We have some x86 setups where people plug say a 300MHhz and a 450MHz
celeron into the same board. This works because they are same FSB,
different multiplier (works and intel certify being two different
things)

Needless to say tsc does not work well on such boxes. Thats why I don't
trust the tsc at all in such cases. Since you'll have the nice cyclone
timer for the Summit it seems best not to trust it, and on the summit to
use the cyclone for udelay as well ?

I agree dodgy_tsc needs to change name. Perhaps we actually want

int tsc = select_tsc();

switch(tsc)
{
case TSC_CYCLONE:
case TSC_PROCESSOR:
case TSC_NONE:
..
}

Alan

2002-08-09 17:57:55

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Fri, 2002-08-09 at 02:17, Alan Cox wrote:
> On Fri, 2002-08-09 at 03:30, john stultz wrote:
> > Not sure I followed that, do you mean per-cpu TSC management for
> > gettimeofday?
>
> We have some x86 setups where people plug say a 300MHhz and a 450MHz
> celeron into the same board. This works because they are same FSB,
> different multiplier (works and intel certify being two different
> things)

Oh yes, with the old NUMAQ hardware here, one can mix nodes of different
speed cpus. Once I get a chance, I'm going to begin working on this
issue for 2.5. My plan right now is to keep per-cpu last_tsc_low and
fast_gettimeoffset_quotient values, then round robin the timer
interrupt.


> Needless to say tsc does not work well on such boxes. Thats why I don't
> trust the tsc at all in such cases. Since you'll have the nice cyclone
> timer for the Summit it seems best not to trust it, and on the summit to
> use the cyclone for udelay as well ?
>
> I agree dodgy_tsc needs to change name. Perhaps we actually want
>
> int tsc = select_tsc();
>
> switch(tsc)
> {
> case TSC_CYCLONE:
> case TSC_PROCESSOR:
> case TSC_NONE:
> ..
> }

Sounds good. I'll re-work my patch and resubmit.

thanks!
-john

2002-08-09 18:47:03

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

john stultz wrote:
>
> On Fri, 2002-08-09 at 02:17, Alan Cox wrote:
> > On Fri, 2002-08-09 at 03:30, john stultz wrote:
> > > Not sure I followed that, do you mean per-cpu TSC management for
> > > gettimeofday?
> >
> > We have some x86 setups where people plug say a 300MHhz and a 450MHz
> > celeron into the same board. This works because they are same FSB,
> > different multiplier (works and intel certify being two different
> > things)
>
> Oh yes, with the old NUMAQ hardware here, one can mix nodes of different
> speed cpus. Once I get a chance, I'm going to begin working on this
> issue for 2.5. My plan right now is to keep per-cpu last_tsc_low and
> fast_gettimeoffset_quotient values, then round robin the timer
> interrupt.
>
An interesting approach, however, could you take a look at
the high-res-timers patch (see signature). In that code (in
the TSC version), we use the TSC to update jiffies and
_sub_jiffie (which is TSC counts into the next jiffie). We
also want to be able to "grab" a new TSC and figure the time
quickly, without updating either jiffies or _sub_jiffie.
Your approach would, I think, mean that both jiffies and
_sub_jiffie would be per cpu values, not impossible, but,
well, hard.

On the other hand, the high-res-timers patch also allows one
to use the ACPI pm timer, and ignore TSC completely :)

-g
>
> > Needless to say tsc does not work well on such boxes. Thats why I don't
> > trust the tsc at all in such cases. Since you'll have the nice cyclone
> > timer for the Summit it seems best not to trust it, and on the summit to
> > use the cyclone for udelay as well ?
> >
> > I agree dodgy_tsc needs to change name. Perhaps we actually want
> >
> > int tsc = select_tsc();
> >
> > switch(tsc)
> > {
> > case TSC_CYCLONE:
> > case TSC_PROCESSOR:
> > case TSC_NONE:
> > ..
> > }
>
> Sounds good. I'll re-work my patch and resubmit.
>
> thanks!
> -john
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-08-09 21:09:10

by john stultz

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Fri, 2002-08-09 at 11:49, george anzinger wrote:

> An interesting approach, however, could you take a look at
> the high-res-timers patch (see signature). In that code (in

I skimmed the patch, and I like alot of the cleanups and tweaks you have
make to arch/i386/kernel/time.c. As the number of time sources are
growing with the TSC, PIT, and now the cyclone counter and ACPI timer
(also, doesn't MS have some sort of high res performance counter
standard they're pushing for in hardware?), I'm thinking of rearranging
and renaming some of the code to better allow these changes. The current
do_slow/fast_gettimeoffset plus in-function 'if(use_tsc)' switches are
getting to be messy with the new additions.

However, this may be a 2.5 issue, so before I take my 2.4 fixes and port
them to a new infrastructure, I'd like it if someone chimed in and said
"do it" or "skip it" for 2.4.


> the TSC version), we use the TSC to update jiffies and
> _sub_jiffie (which is TSC counts into the next jiffie). We
> also want to be able to "grab" a new TSC and figure the time
> quickly, without updating either jiffies or _sub_jiffie.
> Your approach would, I think, mean that both jiffies and
> _sub_jiffie would be per cpu values, not impossible, but,
> well, hard.

I need to look over you code a bit more. It looks interesting though.


> On the other hand, the high-res-timers patch also allows one
> to use the ACPI pm timer, and ignore TSC completely :)

Yea, having another source of time is always nice, as long as it works
reliably. After all this time related work I'm not sure if I'm going to
start wearing 3-4 watches or just screw it and wear none. :)

thank for the pointer and feedback.
-john



2002-08-11 18:51:21

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Fri, 2002-08-09 at 18:46, john stultz wrote:
> Oh yes, with the old NUMAQ hardware here, one can mix nodes of different
> speed cpus. Once I get a chance, I'm going to begin working on this
> issue for 2.5. My plan right now is to keep per-cpu last_tsc_low and
> fast_gettimeoffset_quotient values, then round robin the timer
> interrupt.

I wouldn't worry too much about it. With a pre-emptible kernel in 2,5
the tsc using code can jump processors and get disturbed. The right
answer is not to use tsc for time clocks unless we have no other good
option. Stuff like HPET will help no end

Alan

2002-08-13 01:20:54

by James Cleverdon

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Friday 09 August 2002 01:58 pm, john stultz wrote:
[ Snip! ]

> (also, doesn't MS have some sort of high res performance counter
> standard they're pushing for in hardware?), ...

Yeah, I think that's called DIG64. They had a reasonably nice HW timer setup,
but not so nice for NUMA synchronization as the one we built into the
[[Deleted by IP Police]] chipset. That had snapshot and delta adjust logic,
to synchronize nodes to one interconnect bus clock cycle. Too bad that
project was canned....

--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com

2002-08-16 01:11:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Thu, Aug 08, 2002 at 07:30:46PM -0700, john stultz wrote:
> On Thu, 2002-08-08 at 06:17, Alan Cox wrote:
> > On Thu, 2002-08-08 at 02:53, john stultz wrote:
> > > already seen on another cpu. This patch allows people compiling w/
> > > 'Multi-node NUMA support' to pass "notsc" or "bad-tsc" as boot
> > > parameters. "notsc" disables rdtsc calls, and forces the kernel to use
> > > the PIT for gettimeofday calucluations (as normally expected w/ i386
> > > compiled kernels). While "bad-tsc" forces the kernel to use the PIT for
> > > gettimeofday, but does not disable TSC access.
> >
> > I've done a version of this for -ac which uses the NUMA autodetect, and
> > also provides the hook so I can disable tsc on split on x86 smp with
> > variable multipliers some time. The only comment I really have is please
>
> Not sure I followed that, do you mean per-cpu TSC management for
> gettimeofday?
>
> > use "badtsc" not "bad-tsc" - to match "notsc"
>
> Sounds good. Changed in my tree.

sorry but I don't see the point of badtsc only in kernel.

If the TSC is bad that will be in particular bad from userspace where
there's no hope to know what CPU you're running on.

If the TSC is bad userspace is the one that must not be allowed to read
the tsc at all, the tsc in userspace could only have a function of a
random number generator.

So I suggest to drop the badtsc and to only have the notsc that disables
the tsc globally for both kernel and userspace. That's what I'm doing in
my tree for the above reasons at least.

> > This is how I did it - barring the code that is -ac specific to automatically flip
> > the mode when NUMA hw is detected
>
> Interesting. See my comments below. Auto-detection has been on my list
> for the last week, so I'm eagerly awaiting a peek at the next ac release
> :)

How do you detect the NUMA hw? That would be a nice addition so the
numa-Q users won't be required to add notsc to the append lilo line.

My current preferred status for this patch is here:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19rc5aa1/94_numaq-tsc-4

I'm fine to drop the #ifdefs and to allow notsc to be significant
always if you like, however also requiring a numa compilation like now
should be fine, no? At least unless you can detect the badtsc
dynamically also in kernels with numa disabled, which isn't the case
right now I guess.

But still the "allowing badtsc in userspace" in the last patches looks
obviously wrong to me, the "emulate tsc in inst fault" was better, but
even that sounds not the best, all programs should have a fallback to
use gettimeofday so it's better to get the instruction fault immediatly
rather that running slow because of the flood of exceptions that happens
silenty.

Andrea

2002-08-16 11:14:00

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Thu, 2002-08-15 at 17:56, Andrea Arcangeli wrote:
> sorry but I don't see the point of badtsc only in kernel.
>
> If the TSC is bad that will be in particular bad from userspace where
> there's no hope to know what CPU you're running on.

You can still do meaningful measurements for things like profiling
because the cpu hop is statistically uninteresting.

> How do you detect the NUMA hw? That would be a nice addition so the
> numa-Q users won't be required to add notsc to the append lilo line.

The current Summit patches from IBM (the ones James did and merged in
-ac) do detection for Numa-Q and Summit. Those are pending for Marcelo
right now so if you have a better way I'd like to know.


Alan

2002-08-16 13:15:58

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

Alan Cox writes:
> On Thu, 2002-08-15 at 17:56, Andrea Arcangeli wrote:
> > sorry but I don't see the point of badtsc only in kernel.
> >
> > If the TSC is bad that will be in particular bad from userspace where
> > there's no hope to know what CPU you're running on.
>
> You can still do meaningful measurements for things like profiling
> because the cpu hop is statistically uninteresting.

There are kernel extensions around that handle process migration across
CPUs while providing virtualised per-process TSC counts, and for which
the TSCs do NOT need to be in perfect sync.

Disabling user-space RDTSC just because the TSCs aren't in sync is stupid.

/Mikael

2002-08-21 13:07:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Fri, Aug 16, 2002 at 03:19:31PM +0200, Mikael Pettersson wrote:
> Alan Cox writes:
> > On Thu, 2002-08-15 at 17:56, Andrea Arcangeli wrote:
> > > sorry but I don't see the point of badtsc only in kernel.
> > >
> > > If the TSC is bad that will be in particular bad from userspace where
> > > there's no hope to know what CPU you're running on.
> >
> > You can still do meaningful measurements for things like profiling
> > because the cpu hop is statistically uninteresting.
>
> There are kernel extensions around that handle process migration across
> CPUs while providing virtualised per-process TSC counts, and for which
> the TSCs do NOT need to be in perfect sync.

you mean trapping the instruction fault and emulating the tsc with a
gettimeofday? In such case you should run gettimeofday in the first
place. that's the whole point. rdtsc means rdtsc, not a software
emulation of it. If you don't want to bind your userspace to a certain
system that has tsc in sync, you should use gettimeofday since the first
place so it'll run on non x86 too.

Infact returning an instruction fault is probably one of the best way to
allow a program to autodetect if the tsc is useful. Otherwise if you
left the tsc enabled, it'll be hard for userspace to guess if the tsc
are in sync (you could with cpu binding but only if you have
privilegies, normal programs wouldn't be capable of probing that).

> Disabling user-space RDTSC just because the TSCs aren't in sync is stupid.

what you mean as non stupid has to be to still disable it (so figure out
how much disabling it is stupid) and to trap the instruction fault and
software emulate the tsc so that userspace will think the TSC aren't
disabled while they're are infact disabled.

However here the point is that the TSC was left _enabled_ (not disabled
and emulated as you are advocating) despite it was not in sync. That
cannot make any sense, except if you use the tsc as a random number
generator.

Andrea

2002-08-21 14:05:43

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Wed, 2002-08-21 at 14:12, Andrea Arcangeli wrote:
> However here the point is that the TSC was left _enabled_ (not disabled
> and emulated as you are advocating) despite it was not in sync. That
> cannot make any sense, except if you use the tsc as a random number
> generator.

Totally untrue. When you are doing profiling the data is very useful
because CPU switches are trivial to filter and statistically rare

2002-08-21 14:28:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Wed, Aug 21, 2002 at 03:10:24PM +0100, Alan Cox wrote:
> On Wed, 2002-08-21 at 14:12, Andrea Arcangeli wrote:
> > However here the point is that the TSC was left _enabled_ (not disabled
> > and emulated as you are advocating) despite it was not in sync. That
> > cannot make any sense, except if you use the tsc as a random number
> > generator.
>
> Totally untrue. When you are doing profiling the data is very useful
> because CPU switches are trivial to filter and statistically rare

You can use statistic to guess what are the measurements that you've to
filter agreed, however this mostly applies to hot cache measurements
(likely to return many times the same result if there's no error) and in
particular where you are allowed to waste cpu time in repeating many
times the same workload before you say "ok this is the right measure".

Filtering out the errors is not feasible in all the cases, even if you
are allowed to repeat the same measurement multiple times. And it also
depends if you have a 3-way pipe communication in background.

And you'll keep to silenty break in all other applications that uses
rdtsc to know how much time has passed, that can hardly understand if
they can use the tsc or not unless we tell it to them from the kernel.

If you want to use the tsc in a controlled manner using statistic for
just profiling of a certain piece of code where it is easy to
demonstrate the absolute most frequent result is the correct one (as
said it's not always the case, when you do measurements out of hot cpu
cache the masurements vary significantly all the time, you are very well
aware about it), at the very least it should be a privilegied sysctl or
a non privilegied prctl (prctl however requires a slowdown of
swithc_to). Infact if we add a branch to switch_to we could even
automatically allow it for processes that are bind to a single cpu (or
to a single NUMA node adding some more internal knowledge about the numa
topology).

But silenty breaking apps and not allowing in any way to apps to learn
if the tsc is returning random or if it's returning something
significant (I understand that's the way you did it in -ac) is a no-way
by default IMHO.

Andrea

2002-08-21 14:57:13

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Wed, 2002-08-21 at 15:33, Andrea Arcangeli wrote:
> But silenty breaking apps and not allowing in any way to apps to learn
> if the tsc is returning random or if it's returning something
> significant (I understand that's the way you did it in -ac) is a no-way
> by default IMHO.

All such apps and libraries are already broken have been silently broken
since about 1999 and will continue to be broken. Thats been true since
speedstep cpus appeared if not before.

Alan

2002-08-21 16:07:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Wed, Aug 21, 2002 at 04:01:55PM +0100, Alan Cox wrote:
> On Wed, 2002-08-21 at 15:33, Andrea Arcangeli wrote:
> > But silenty breaking apps and not allowing in any way to apps to learn
> > if the tsc is returning random or if it's returning something
> > significant (I understand that's the way you did it in -ac) is a no-way
> > by default IMHO.
>
> All such apps and libraries are already broken have been silently broken
> since about 1999 and will continue to be broken. Thats been true since
> speedstep cpus appeared if not before.

certainly fair enough argument in theory, but in practice you're not
going to risk running those apps in a laptop or in general with any
power management that will decrease the frequency of the cpu anytime.
Also the change in frequency wouldn't generate non monothone results,
still the app may malfunction but going backwards or with an huge error
is more likely to be erronic for the tsc users than just the decrease of
frequency.

Furthmore the speedstep right now today can crash any laptop that boots
at reduced mhz and that switches to higher mhz at runtime, that change
of the tsc frequency simply make udelay run faster, and it'll break
drivers easily. I suspect there's even an unfixable race condition in
the speedstep hardware since it's not the kernel asking for the change
of frequency (at least when not using ACPI), so the change of frequency
when you plugin power may happen right before the start of udelay, we
may have irq disabled and the udelay will take less without a chance to
recalbirate delays.

Returning to our tsc issue, these "broken apps since 1999" may now
run run silenty on these numa machines that obviously cannot provide any
significant info via the tsc to userspace, and there's no way to know
that your app isn't breaking because of numa, unless you disable the tsc
to userspace.

Feel free to argue further but that won't change the fact you will never
know if your apps is getting confused or not if you don't disable the
tsc and you don't let it get the instruction fault. it is just
impossible with -ac (hmm, ok you could objdump all the binaries and libs
and hope there's no self modyfing code :)

And following your argument that these apps have been silenty broken
since 1999, if there's no broken app out there, nobody will ever get the
instruction fault. If there's any app broken out there we probably like
to know.

Infact following your argument of "broken apps since 1999" we should
disable the tsc on every machine out there, not only on the numa,
lefting a backdoor via prctl or sysctl to allow the profiling. And we
should bug at boot if we notice the cpu isn't at max frequency to avoid
the speedstep instability (but that's another issue).

Andrea

2002-08-21 16:20:42

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

> certainly fair enough argument in theory, but in practice you're not
> going to risk running those apps in a laptop or in general with any
> power management that will decrease the frequency of the cpu anytime.

Any PIII with speedstep, any Athlon and PIV.

> Furthmore the speedstep right now today can crash any laptop that boots
> at reduced mhz and that switches to higher mhz at runtime, that change

Actually the reduced loops in the kernel seem to work fine

> of the tsc frequency simply make udelay run faster, and it'll break
> drivers easily. I suspect there's even an unfixable race condition in
> the speedstep hardware since it's not the kernel asking for the change

Fixed in the -ac tree for the non APM triggered case because we use
cpufreq code

> significant info via the tsc to userspace, and there's no way to know
> that your app isn't breaking because of numa, unless you disable the tsc
> to userspace.

And you can test that with notsc. Oh and you might also want the code
that makes notsc on a tsc only kernel print a warning btw. badtsc lets
you say "I have a brain cell" notsc lets you select "clueless app
checking mode"


As to the other issue. As soon as we have plug in tsc handling that can
use ACPI, x86-64, summit and other timer sources I'm very keen to put
the tsc based timer in as a fallback before the PIT, and to do chip
sanity checks on it (matching clockmul, not spudstop etc)

2002-08-21 17:12:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Wed, Aug 21, 2002 at 05:25:35PM +0100, Alan Cox wrote:
> > certainly fair enough argument in theory, but in practice you're not
> > going to risk running those apps in a laptop or in general with any
> > power management that will decrease the frequency of the cpu anytime.
>
> Any PIII with speedstep, any Athlon and PIV.
>
> > Furthmore the speedstep right now today can crash any laptop that boots
> > at reduced mhz and that switches to higher mhz at runtime, that change
>
> Actually the reduced loops in the kernel seem to work fine
>
> > of the tsc frequency simply make udelay run faster, and it'll break
> > drivers easily. I suspect there's even an unfixable race condition in
> > the speedstep hardware since it's not the kernel asking for the change
>
> Fixed in the -ac tree for the non APM triggered case because we use
> cpufreq code

if the reduced loops is supposed to work fine what was there left to fix?

>
> > significant info via the tsc to userspace, and there's no way to know
> > that your app isn't breaking because of numa, unless you disable the tsc
> > to userspace.
>
> And you can test that with notsc. Oh and you might also want the code
> that makes notsc on a tsc only kernel print a warning btw. badtsc lets
> you say "I have a brain cell" notsc lets you select "clueless app
> checking mode"

what's wrong with a sysctl to specify you have a brain cell? I'm not
advocating not to let you specify you have a brain cell, I'm only don't
see why we should assume it when we very know apps will break silenty,
and that it will be not noticeable except with subtle breakage after
some runtime in -ac.

Andrea

2002-08-21 17:30:31

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Wed, 2002-08-21 at 18:17, Andrea Arcangeli wrote:
> > Fixed in the -ac tree for the non APM triggered case because we use
> > cpufreq code
>
> if the reduced loops is supposed to work fine what was there left to fix?

Support for doing it right in case someone finds a marginal component
where it doesnt. Incidentally the APM case is it appears fixable.

> > And you can test that with notsc. Oh and you might also want the code
> > that makes notsc on a tsc only kernel print a warning btw. badtsc lets
> > you say "I have a brain cell" notsc lets you select "clueless app
> > checking mode"
>
> what's wrong with a sysctl to specify you have a brain cell? I'm not
> advocating not to let you specify you have a brain cell, I'm only don't
> see why we should assume it when we very know apps will break silenty,
> and that it will be not noticeable except with subtle breakage after
> some runtime in -ac.

The default behaviours are unchanged. badtsc is simply an extra mode. If
anything the new tree is far better because "notsc" on a 686 kernel
actually warns people it is being ignored. Thats the important bit.
Badtsc is just a handy thing.

2002-08-26 18:08:13

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

Hi!

> > > But silenty breaking apps and not allowing in any way to apps to learn
> > > if the tsc is returning random or if it's returning something
> > > significant (I understand that's the way you did it in -ac) is a no-way
> > > by default IMHO.
> >
> > All such apps and libraries are already broken have been silently broken
> > since about 1999 and will continue to be broken. Thats been true since
> > speedstep cpus appeared if not before.
>
> certainly fair enough argument in theory, but in practice you're not
> going to risk running those apps in a laptop or in general with any

So those apps are broken. They don't work on pretty common hardware.

NUMA-Q is just another kind of hardware where broken applications
break. What's the problem?

> And following your argument that these apps have been silenty broken
> since 1999, if there's no broken app out there, nobody will ever get the
> instruction fault. If there's any app broken out there we probably like

No. rdtsc is still usefull if you are clever and statistically filter
out. Also rdtsc provides you number of cycles, so if you want to know
how many cycles mov %eax,%ebx takes, you can do that even on
speedstep. Anything that correlates rdtsc to real time is broken, however.

Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]

2002-08-26 18:47:20

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

>> And following your argument that these apps have been silenty broken
>> since 1999, if there's no broken app out there, nobody will ever get the
>> instruction fault. If there's any app broken out there we probably like
>
> No. rdtsc is still usefull if you are clever and statistically filter
> out. Also rdtsc provides you number of cycles, so if you want to know
> how many cycles mov %eax,%ebx takes, you can do that even on
> speedstep. Anything that correlates rdtsc to real time is broken, however.

It's not correlating it to real time that's the problem. It's getting resceduled
inbetween calls that hurts. Take your example.

rdtsc
mov %eax,%ebx
<- get rescheduled here
rdtsc

Broken. May even take negative "time".

M.

2002-08-26 18:56:39

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

Hi!

> >> And following your argument that these apps have been silenty broken
> >> since 1999, if there's no broken app out there, nobody will ever get the
> >> instruction fault. If there's any app broken out there we probably like
> >
> > No. rdtsc is still usefull if you are clever and statistically filter
> > out. Also rdtsc provides you number of cycles, so if you want to know
> > how many cycles mov %eax,%ebx takes, you can do that even on
> > speedstep. Anything that correlates rdtsc to real time is broken, however.
>
> It's not correlating it to real time that's the problem. It's getting resceduled
> inbetween calls that hurts. Take your example.
>
> rdtsc
> mov %eax,%ebx
> <- get rescheduled here
> rdtsc
>
> Broken. May even take negative "time".

Yep, you need to do 10000 tries and then choose the most common one.

But I believe andrea was talking about apps that corelate rdtsc to real time.

2002-08-26 19:00:54

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Mon, 2002-08-26 at 19:45, Martin J. Bligh wrote:
> It's not correlating it to real time that's the problem. It's getting resceduled
> inbetween calls that hurts. Take your example.
>
> rdtsc
> mov %eax,%ebx
> <- get rescheduled here
> rdtsc
>
> Broken. May even take negative "time".

Statistically irrelevant. When you have 100,000 samples all the
pre-emption ones drop into the dud sample filter with IRQ disturbance
and so on.


2002-08-26 19:13:54

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Mon, 26 Aug 2002, Martin J. Bligh wrote:

> >> And following your argument that these apps have been silenty broken
> >> since 1999, if there's no broken app out there, nobody will ever get the
> >> instruction fault. If there's any app broken out there we probably like
> >
> > No. rdtsc is still usefull if you are clever and statistically filter
> > out. Also rdtsc provides you number of cycles, so if you want to know
> > how many cycles mov %eax,%ebx takes, you can do that even on
> > speedstep. Anything that correlates rdtsc to real time is broken, however.
>
> It's not correlating it to real time that's the problem. It's getting resceduled
> inbetween calls that hurts. Take your example.
>
> rdtsc
> mov %eax,%ebx
> <- get rescheduled here
> rdtsc
>
> Broken. May even take negative "time".
>
> M.

The CPU counters are synchronized on SMP machines. How can you
ever get negative time? Even GHz machines take several months
to wrap the count.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-08-26 19:38:11

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

Hi!
> > >> And following your argument that these apps have been silenty broken
> > >> since 1999, if there's no broken app out there, nobody will ever get the
> > >> instruction fault. If there's any app broken out there we probably like
> > >
> > > No. rdtsc is still usefull if you are clever and statistically filter
> > > out. Also rdtsc provides you number of cycles, so if you want to know
> > > how many cycles mov %eax,%ebx takes, you can do that even on
> > > speedstep. Anything that correlates rdtsc to real time is broken, however.
> >
> > It's not correlating it to real time that's the problem. It's getting resceduled
> > inbetween calls that hurts. Take your example.
> >
> > rdtsc
> > mov %eax,%ebx
> > <- get rescheduled here
> > rdtsc
> >
> > Broken. May even take negative "time".
> >
> > M.
>
> The CPU counters are synchronized on SMP machines. How can you
> ever get negative time? Even GHz machines take several months
> to wrap the count.

This thread was about numa machines that do not keep tsc synchronized.

2002-08-26 20:18:56

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Mon, 2002-08-26 at 20:18, Richard B. Johnson wrote:

> The CPU counters are synchronized on SMP machines. How can you
> ever get negative time? Even GHz machines take several months
> to wrap the count.

On a typical intel chipset PC maybe, on anything fancy - nope. Not even
on a 440BX board with mixed speed processors

2002-08-26 20:20:52

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

>> It's not correlating it to real time that's the problem. It's getting resceduled
>> inbetween calls that hurts. Take your example.
>>
>> rdtsc
>> mov %eax,%ebx
>> <- get rescheduled here
>> rdtsc
>>
>> Broken. May even take negative "time".
>
> Statistically irrelevant. When you have 100,000 samples all the
> pre-emption ones drop into the dud sample filter with IRQ disturbance
> and so on.

OK, so let's take a better example. People are (I think) mainly using rdtsc
to provide a monotonically increasing counter for database "transaction
order" stamping. I think that's really the case we're worried about.

M.

2002-08-26 22:44:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] tsc-disable_B9

On Mon, Aug 26, 2002 at 11:45:36AM -0700, Martin J. Bligh wrote:
> >> And following your argument that these apps have been silenty broken
> >> since 1999, if there's no broken app out there, nobody will ever get the
> >> instruction fault. If there's any app broken out there we probably like
> >
> > No. rdtsc is still usefull if you are clever and statistically filter
> > out. Also rdtsc provides you number of cycles, so if you want to know
> > how many cycles mov %eax,%ebx takes, you can do that even on
> > speedstep. Anything that correlates rdtsc to real time is broken, however.
>
> It's not correlating it to real time that's the problem. It's getting resceduled
> inbetween calls that hurts. Take your example.
>
> rdtsc
> mov %eax,%ebx
> <- get rescheduled here
> rdtsc
>
> Broken. May even take negative "time".

you need to save %edx too, then it would be perfectly safe on a
synchronized TSC hardware (as far as the reschedule doesn't take more
than 2^64 ticks).

Andrea