LinuxLists.cc - [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

Aniruddha M Marathe wrote:
>
> Here is a comparison of results of 2.5.54 with mm2 and 2.5.54.

I'm sorry, but all you are doing with these tests is discrediting
lmbench, AIM9, tiobench and unixbench. There really is nothing in
these patches which can account for the changes which you are observing.

Possibly, it is all caused by cache colouring effects - the physical
addresses at which critical kernel and userspace text and data
happen to end up.

I'd suggest that you look for more complex tests. There's a decent
list at http://lbs.sourceforge.net/, but even those are rather microscopic.

If you have time, things like the osdl dbt1 test, http://osdb.sourceforge.net/
and the commercial benchmarks would be more interesting.

Or cook up some of your own: it's not hard. Just think of some time-consuming
operation which we perform on a daily basis and measure it. Script
the startup and shutdown of X11 applications. rsync. sendmail. cvs.

Mixed workloads are interesting and real world: run tiobench or dbench
or qsbench or whatever while trying to do something else, see how long
"something else" takes.

It is these sorts of things which will find areas of weakness which
can be addressed in this phase of kernel development.

The teeny little microbenchmarks are telling us that the rmap overhead
hurts, that the uninlining of copy_*_user may have been a bad idea, that
the addition of AIO has cost a little and that the complexity which
yielded large improvements in readv(), writev() and SMP throughput were
not free. All of this is already known.

2003-01-03 09:43:31

by David Miller

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

On Fri, 2003-01-03 at 01:33, Andrew Morton wrote:
> I'm sorry, but all you are doing with these tests is discrediting
> lmbench, AIM9, tiobench and unixbench.
...
> Possibly, it is all caused by cache colouring effects - the physical
> addresses at which critical kernel and userspace text and data
> happen to end up.
...
> The teeny little microbenchmarks are telling us that the rmap overhead
> hurts, that the uninlining of copy_*_user may have been a bad idea, that
> the addition of AIO has cost a little and that the complexity which
> yielded large improvements in readv(), writev() and SMP throughput were
> not free. All of this is already known.

I think if anything, you are stating the true value of the
microbenchmarks. They are showing us how the kernel is getting
more and more complex, causing basic operations to take longer
and longer. That's bad. :-)

Last time I brought up an issue like this (a "nobody but weirdos use
feature which is costing us cycles everywhere"), it got redone until
it did cost nothing for people who don't use the feature. See the
whole security layer fiasco for example.

I truly wish I could config out AIO for example, the overhead is just
stupid. I know that if some thought is put into it, the cost could
be consumed completely.

People who don't see the true value of researching even minor jitters
in lmbench results (and fixing the causes or backing out the guilty
patch) aren't kernel developers in my opinion. :-)

2003-01-03 10:14:32

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

"David S. Miller" wrote:
>
> On Fri, 2003-01-03 at 01:33, Andrew Morton wrote:
> > I'm sorry, but all you are doing with these tests is discrediting
> > lmbench, AIM9, tiobench and unixbench.
> ...
> > Possibly, it is all caused by cache colouring effects - the physical
> > addresses at which critical kernel and userspace text and data
> > happen to end up.
> ...
> > The teeny little microbenchmarks are telling us that the rmap overhead
> > hurts, that the uninlining of copy_*_user may have been a bad idea, that
> > the addition of AIO has cost a little and that the complexity which
> > yielded large improvements in readv(), writev() and SMP throughput were
> > not free. All of this is already known.
>
> I think if anything, you are stating the true value of the
> microbenchmarks. They are showing us how the kernel is getting
> more and more complex, causing basic operations to take longer
> and longer. That's bad. :-)

Yup. But these things are already known about.

> Last time I brought up an issue like this (a "nobody but weirdos use
> feature which is costing us cycles everywhere"), it got redone until
> it did cost nothing for people who don't use the feature. See the
> whole security layer fiasco for example.

There would be some small benefit in disabling the per-cpu-pages
pools on uniprocessor, and probably the deferred lru-addition queues.

That's fairly simple to do but I didn't do it because it would mean
that SMP and UP are running significantly different codepaths. Benching
this is on my todo list somewhere.

> I truly wish I could config out AIO for example, the overhead is just
> stupid. I know that if some thought is put into it, the cost could
> be consumed completely.

hm. Its cost in filesystem/VFS land is quite small. I assume you're
referring to networking here?

> People who don't see the true value of researching even minor jitters
> in lmbench results (and fixing the causes or backing out the guilty
> patch) aren't kernel developers in my opinion. :-)

But the statistically significant differences _are_ researched, and are
well understood.

We should't lose sight of large optimisations which happen to not be
covered by these tests. eg: SMP scalability.

To cite an extreme case, the readv/writev changes sped up O_SYNC and
O_DIRECT writev() by up to 300x and buffered writev() by 3x. But it cost
us a few percent on write(fd, buf, 1).

quad:/usr/src> grep -r writev lmbench
quad:/usr/src> grep -r writev aim9
quad:/usr/src> grep -r writev tiobench
quad:/usr/src> grep -r writev unixbench-4.1.0-971022
quad:/usr/src>

The big, big one here is the reverse map. I still don't believe that
its benefit has been shown to exceed its speed and space costs.

2003-01-03 18:31:54

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

Andrew Morton <[email protected]> writes:
>
> The teeny little microbenchmarks are telling us that the rmap overhead
> hurts, that the uninlining of copy_*_user may have been a bad idea, that
> the addition of AIO has cost a little and that the complexity which
> yielded large improvements in readv(), writev() and SMP throughput were
> not free. All of this is already known.

If you mean the signal speed regressions they caused - I fixed
that on x86-64 by inlining 1,2,4,8,10(used by signal fpu frame),16.
But it should not use the stupud rep ; ..., of the old ersio but direct
unrolled moves.

x86-64 version in include/asm-x86_64/uaccess.h, could be ported
to i386 given that movqs need to be replaced by two movls.

-Andi

P.S.: regarding recent lmbench slow downs: I'm a bit
worried about the two wrmsrs which are in the i386 context switch
in load_esp0 for sysenter now. Last time I benchmarked WRMSRs on
Athlon they were really slow and knowing the P4 it is probably
even slower there. Imho it would be better to undo that patch
and use Linus' original trampoline stack.

2003-01-03 21:24:08

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

Andi Kleen wrote:
>
> Andrew Morton <[email protected]> writes:
> >
> > The teeny little microbenchmarks are telling us that the rmap overhead
> > hurts, that the uninlining of copy_*_user may have been a bad idea, that
> > the addition of AIO has cost a little and that the complexity which
> > yielded large improvements in readv(), writev() and SMP throughput were
> > not free. All of this is already known.
>
> If you mean the signal speed regressions they caused - I fixed
> that on x86-64 by inlining 1,2,4,8,10(used by signal fpu frame),16.
> But it should not use the stupud rep ; ..., of the old ersio but direct
> unrolled moves.

Yes, that would help a bit. We should do that for ia32. It's a little
worrisome that the return value from such a copy_*_user() implementation
will be incorrect - it is supposed to return the number of uncopied bytes.
Probably doesn't matter.

Most of the optimisation opportunities wrt signal delivery were soaked up
by replacing the copy_*_user() calls with put_user() and friends.

We could speed up signals heaps by re-lazying the fpu state storage in
some manner.

> x86-64 version in include/asm-x86_64/uaccess.h, could be ported
> to i386 given that movqs need to be replaced by two movls.
>
> -Andi
>
> P.S.: regarding recent lmbench slow downs: I'm a bit
> worried about the two wrmsrs which are in the i386 context switch
> in load_esp0 for sysenter now. Last time I benchmarked WRMSRs on
> Athlon they were really slow and knowing the P4 it is probably
> even slower there. Imho it would be better to undo that patch
> and use Linus' original trampoline stack.

hm. How slow? Any numbers on that?

2003-01-05 00:52:43

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

Andi Kleen wrote:
>
> P.S.: regarding recent lmbench slow downs: I'm a bit
> worried about the two wrmsrs which are in the i386 context switch
> in load_esp0 for sysenter now. Last time I benchmarked WRMSRs on
> Athlon they were really slow and knowing the P4 it is probably
> even slower there. Imho it would be better to undo that patch
> and use Linus' original trampoline stack.
>

Looks like you're right. The indications are that this change
has slowed context switches by ~5% on a PIII. The backout patch
against 2.5.54 is below. Testing on a P4 would be useful.

2.5.54, stock:

lmbench:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu Linux 2.5.54 3 16 44 18 47 20 77

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
i686-linu Linux 2.5.54 3 16 26 65 78 231

AIM9:

3 tcp_test 10.00 2767 276.70000 24903.00 TCP/IP Messages/second
4 udp_test 10.00 4658 465.80000 46580.00 UDP/IP DataGrams/second
5 fifo_test 10.01 6507 650.04995 65005.00 FIFO Messages/second
6 dgram_pipe 10.00 11228 1122.80000 112280.00 DataGram Pipe Messages/second
7 pipe_cpy 10.00 15463 1546.30000 154630.00 Pipe Messages/second

pollbench:
pollbench 1 100 5000
result with handles 1 processes 100 loops 5000:time 9.609487 sec.
pollbench 2 100 2000
result with handles 2 processes 100 loops 2000:time 4.016496 sec.
pollbench 5 100 2000
result with handles 5 processes 100 loops 2000:time 4.917921 sec.

2.5.54, with the below backout patch:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu Linux 2.5.54 3 14 47 18 50 20 61

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
i686-linu Linux 2.5.54 3 15 25 64 69 231

3 tcp_test 10.00 2908 290.80000 26172.00 TCP/IP Messages/second
4 udp_test 10.00 4971 497.10000 49710.00 UDP/IP DataGrams/second
5 fifo_test 10.01 6642 663.53646 66353.65 FIFO Messages/second
6 dgram_pipe 10.00 11516 1151.60000 115160.00 DataGram Pipe Messages/second
7 pipe_cpy 10.00 15930 1593.00000 159300.00 Pipe Messages/second

pollbench 1 100 5000
result with handles 1 processes 100 loops 5000:time 9.106732 sec.
pollbench 2 100 2000
result with handles 2 processes 100 loops 2000:time 3.853814 sec.
pollbench 5 100 2000
result with handles 5 processes 100 loops 2000:time 4.533519 sec.

arch/i386/kernel/cpu/common.c | 2 +-
arch/i386/kernel/process.c | 2 +-
arch/i386/kernel/sysenter.c | 34 ++++++++++++++++++++++++++++++----
arch/i386/kernel/vm86.c | 4 +---
include/asm-i386/cpufeature.h | 3 ---
include/asm-i386/msr.h | 4 ----
include/asm-i386/processor.h | 16 ----------------
7 files changed, 33 insertions(+), 32 deletions(-)

--- 25/arch/i386/kernel/cpu/common.c~bad3 Fri Jan 3 16:36:48 2003
+++ 25-akpm/arch/i386/kernel/cpu/common.c Fri Jan 3 16:36:48 2003
@@ -484,7 +484,7 @@ void __init cpu_init (void)
BUG();
enter_lazy_tlb(&init_mm, current, cpu);

- load_esp0(t, thread->esp0);
+ t->esp0 = thread->esp0;
set_tss_desc(cpu,t);
cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff;
load_TR_desc();
--- 25/arch/i386/kernel/process.c~bad3 Fri Jan 3 16:36:48 2003
+++ 25-akpm/arch/i386/kernel/process.c Fri Jan 3 16:36:48 2003
@@ -437,7 +437,7 @@ void __switch_to(struct task_struct *pre
/*
* Reload esp0, LDT and the page table pointer:
*/
- load_esp0(tss, next->esp0);
+ tss->esp0 = next->esp0;

/*
* Load the per-thread Thread-Local Storage descriptor.
--- 25/arch/i386/kernel/sysenter.c~bad3 Fri Jan 3 16:36:48 2003
+++ 25-akpm/arch/i386/kernel/sysenter.c Fri Jan 3 16:36:48 2003
@@ -34,14 +34,40 @@ struct fake_sep_struct {
unsigned char stack[0];
} __attribute__((aligned(8192)));

+static struct fake_sep_struct *alloc_sep_thread(int cpu)
+{
+ struct fake_sep_struct *entry;
+
+ entry = (struct fake_sep_struct *) __get_free_pages(GFP_ATOMIC, 1);
+ if (!entry)
+ return NULL;
+
+ memset(entry, 0, PAGE_SIZE<<1);
+ entry->thread.task = &entry->task;
+ entry->task.thread_info = &entry->thread;
+ entry->thread.preempt_count = 1;
+ entry->thread.cpu = cpu;
+
+ return entry;
+}
+
static void __init enable_sep_cpu(void *info)
{
int cpu = get_cpu();
- struct tss_struct *tss = init_tss + cpu;
+ struct fake_sep_struct *sep = alloc_sep_thread(cpu);
+ unsigned long *esp0_ptr = &(init_tss + cpu)->esp0;
+ unsigned long rel32;
+
+ rel32 = (unsigned long) sysenter_entry - (unsigned long) (sep->trampoline+11);
+
+ *(short *) (sep->trampoline+0) = 0x258b; /* movl xxxxx,%esp */
+ *(long **) (sep->trampoline+2) = esp0_ptr;
+ *(char *) (sep->trampoline+6) = 0xe9; /* jmp rl32 */
+ *(long *) (sep->trampoline+7) = rel32;

- wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
- wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp0, 0);
- wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);
+ wrmsr(0x174, __KERNEL_CS, 0); /* SYSENTER_CS_MSR */
+ wrmsr(0x175, PAGE_SIZE*2 + (unsigned long) sep, 0); /* SYSENTER_ESP_MSR */
+ wrmsr(0x176, (unsigned long) &sep->trampoline, 0); /* SYSENTER_EIP_MSR */

printk("Enabling SEP on CPU %d\n", cpu);
put_cpu();
--- 25/arch/i386/kernel/vm86.c~bad3 Fri Jan 3 16:36:48 2003
+++ 25-akpm/arch/i386/kernel/vm86.c Fri Jan 3 16:36:48 2003
@@ -113,8 +113,7 @@ struct pt_regs * save_v86_state(struct k
do_exit(SIGSEGV);
}
tss = init_tss + smp_processor_id();
- current->thread.esp0 = current->thread.saved_esp0;
- load_esp0(tss, current->thread.esp0);
+ tss->esp0 = current->thread.esp0 = current->thread.saved_esp0;
current->thread.saved_esp0 = 0;
loadsegment(fs, current->thread.saved_fs);
loadsegment(gs, current->thread.saved_gs);
@@ -290,7 +289,6 @@ static void do_sys_vm86(struct kernel_vm

tss = init_tss + smp_processor_id();
tss->esp0 = tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;
- disable_sysenter();

tsk->thread.screen_bitmap = info->screen_bitmap;
if (info->flags & VM86_SCREEN_BITMAP)
--- 25/include/asm-i386/cpufeature.h~bad3 Fri Jan 3 16:36:48 2003
+++ 25-akpm/include/asm-i386/cpufeature.h Fri Jan 3 16:36:48 2003
@@ -7,8 +7,6 @@
#ifndef __ASM_I386_CPUFEATURE_H
#define __ASM_I386_CPUFEATURE_H

-#include <linux/bitops.h>
-
#define NCAPINTS 4 /* Currently we have 4 32-bit words worth of info */

/* Intel-defined CPU features, CPUID level 0x00000001, word 0 */
@@ -77,7 +75,6 @@
#define cpu_has_pge boot_cpu_has(X86_FEATURE_PGE)
#define cpu_has_sse2 boot_cpu_has(X86_FEATURE_XMM2)
#define cpu_has_apic boot_cpu_has(X86_FEATURE_APIC)
-#define cpu_has_sep boot_cpu_has(X86_FEATURE_SEP)
#define cpu_has_mtrr boot_cpu_has(X86_FEATURE_MTRR)
#define cpu_has_mmx boot_cpu_has(X86_FEATURE_MMX)
#define cpu_has_fxsr boot_cpu_has(X86_FEATURE_FXSR)
--- 25/include/asm-i386/msr.h~bad3 Fri Jan 3 16:36:48 2003
+++ 25-akpm/include/asm-i386/msr.h Fri Jan 3 16:36:48 2003
@@ -53,10 +53,6 @@

#define MSR_IA32_BBL_CR_CTL 0x119

-#define MSR_IA32_SYSENTER_CS 0x174
-#define MSR_IA32_SYSENTER_ESP 0x175
-#define MSR_IA32_SYSENTER_EIP 0x176
-
#define MSR_IA32_MCG_CAP 0x179
#define MSR_IA32_MCG_STATUS 0x17a
#define MSR_IA32_MCG_CTL 0x17b
--- 25/include/asm-i386/processor.h~bad3 Fri Jan 3 16:36:48 2003
+++ 25-akpm/include/asm-i386/processor.h Fri Jan 3 16:36:48 2003
@@ -14,7 +14,6 @@
#include <asm/types.h>
#include <asm/sigcontext.h>
#include <asm/cpufeature.h>
-#include <asm/msr.h>
#include <linux/cache.h>
#include <linux/config.h>
#include <linux/threads.h>
@@ -411,21 +410,6 @@ struct thread_struct {
.io_bitmap = { [ 0 ... IO_BITMAP_SIZE ] = ~0 }, \
}

-static inline void load_esp0(struct tss_struct *tss, unsigned long esp0)
-{
- tss->esp0 = esp0;
- if (cpu_has_sep) {
- wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
- wrmsr(MSR_IA32_SYSENTER_ESP, esp0, 0);
- }
-}
-
-static inline void disable_sysenter(void)
-{
- if (cpu_has_sep)
- wrmsr(MSR_IA32_SYSENTER_CS, 0, 0);
-}
-
#define start_thread(regs, new_eip, new_esp) do { \
__asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0)); \
set_fs(USER_DS); \

_

2003-01-05 03:32:15

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

On Sat, 4 Jan 2003, Andrew Morton wrote:
>
> Looks like you're right. The indications are that this change
> has slowed context switches by ~5% on a PIII. The backout patch
> against 2.5.54 is below. Testing on a P4 would be useful.

Hmm.. The backup patch doesn't handle single-stepping correctly: the
eflags cleanup singlestep patch later in the sysenter sequence _depends_
on the stack (and thus thread) being right on the very first in-kernel
instruction.

That (along with benchmarking of system call numbers - the stack switch at
system call run-time ends up being quite expensive on a P4) was what made
me decide to do the traditional "write MSR in schedule" approach, even
though I agree that it would be much nicer to not have to rewrite that
stupid MSR all the time.

It doesn't show up on lmbench (insufficient precision), but your AIM9
numbers are quite interesting. Are they stable?

Linus

2003-01-05 03:45:49

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

Linus Torvalds wrote:
>
> On Sat, 4 Jan 2003, Andrew Morton wrote:
> >
> > Looks like you're right. The indications are that this change
> > has slowed context switches by ~5% on a PIII. The backout patch
> > against 2.5.54 is below. Testing on a P4 would be useful.
>
> Hmm.. The backup patch doesn't handle single-stepping correctly: the
> eflags cleanup singlestep patch later in the sysenter sequence _depends_
> on the stack (and thus thread) being right on the very first in-kernel
> instruction.

Well that's just a straight `patch -R' of the patch which added the wrmsr's.

> That (along with benchmarking of system call numbers - the stack switch at
> system call run-time ends up being quite expensive on a P4) was what made
> me decide to do the traditional "write MSR in schedule" approach, even
> though I agree that it would be much nicer to not have to rewrite that
> stupid MSR all the time.
>
> It doesn't show up on lmbench (insufficient precision), but your AIM9
> numbers are quite interesting. Are they stable?
>

Seem to be, but more work is needed, including oprofiling. Andi is doing
some P4 testing at present.

2003-01-05 03:48:38

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

On Sat, 4 Jan 2003, Linus Torvalds wrote:
>
> It doesn't show up on lmbench (insufficient precision), but your AIM9
> numbers are quite interesting. Are they stable?

Btw, which checking whether the numbers are stable it is also interesting
to see stability across reboots etc, since for the scheduling latency in
particular it can easily depend on location of the binaries in physical
memory etc, since that matters for cache accesses (I think the L1 D$ on a
PIII is 4-way associative, I'm not sure - it makes it _reasonably_ good at
avoiding cache conflicts, but they can still happen and easily account for
a 5% fluctuation. I don't remember what the L1 I$ situation is).

And with a fairly persistent page cache, whatever cache situation there is
tends to largely stay the same, so just re-running the benchmark may not
change much, at least for the I$ situation.

You can see this effect quite clearly in lmbench: while the 2p/0k context
switch numbers tend to be fairly stable (almost zero likelyhood of any
cache conflicts), the others often fluctuate more even with the same
kernel (ie for me the 2p/16kB numbers fluctuate between 3 and 6 usecs).

D$ conflicts are largely easier to see (because they usually _will_ change
when you re-run the benchmark, so they show up as fluctuations), but the
I$ effects in particular can be quite persistent because (a) the kernel
code will always be at the same place and (b) the user code tends to be
sticky in the same place due to the page cache.

I'm convinced the I$ effects are one major issue why we sometimes see
largish fluctuations on some ubenchmarks between kernels when nothing has
really changed.

Linus

2003-01-05 03:49:40

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

On Sat, 4 Jan 2003, Andrew Morton wrote:
> >
> > Hmm.. The backup patch doesn't handle single-stepping correctly: the
> > eflags cleanup singlestep patch later in the sysenter sequence _depends_
> > on the stack (and thus thread) being right on the very first in-kernel
> > instruction.
>
> Well that's just a straight `patch -R' of the patch which added the wrmsr's.

Yes, but the breakage comes laterr when a subsequent patch in the
2.5.53->54 stuff started depending on the stack location being stable even
on the first instruction.

> > It doesn't show up on lmbench (insufficient precision), but your AIM9
> > numbers are quite interesting. Are they stable?
>
> Seem to be, but more work is needed, including oprofiling. Andi is doing
> some P4 testing at present.

Ok.

Linus

2003-01-05 09:10:23

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

Linus Torvalds wrote:
>
> ...
> It doesn't show up on lmbench (insufficient precision), but your AIM9
> numbers are quite interesting. Are they stable?

OK, a closer look. This is on a dual 1.7G P4, with HT disabled (involuntarily,
grr.) Looks like an 8-10% hit on context-switch intensive stuff.

2.5.54+BK
=========

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu Linux 2.5.54 3 4 11 6 48 12 53

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn

tbench 32: (85k switches/sec)

Throughput 114.633 MB/sec (NB=143.291 MB/sec 1146.33 MBit/sec)
Throughput 114.157 MB/sec (NB=142.696 MB/sec 1141.57 MBit/sec)
Throughput 115.095 MB/sec (NB=143.869 MB/sec 1150.95 MBit/sec)

pollbench 1 100 5000 (118k switches/sec)
result with handles 1 processes 100 loops 5000:time 8.371942 sec.
result with handles 1 processes 100 loops 5000:time 8.381814 sec.
result with handles 1 processes 100 loops 5000:time 8.367576 sec.
pollbench 2 100 2000 (105k switches/sec)
result with handles 2 processes 100 loops 2000:time 3.694412 sec.
result with handles 2 processes 100 loops 2000:time 3.672226 sec.
result with handles 2 processes 100 loops 2000:time 3.657455 sec.
pollbench 5 100 2000 (79k switches/sec)
result with handles 5 processes 100 loops 2000:time 4.564727 sec.
result with handles 5 processes 100 loops 2000:time 4.783192 sec.
result with handles 5 processes 100 loops 2000:time 4.561067 sec.

2.5.54+BK+broken-wrmsr-backout-patch:
=====================================

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
i686-linu Linux 2.5.54 3 4 11 6 48 12 53
i686-linu Linux 2.5.54 1 3 8 4 40 10 51

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
i686-linu Linux 2.5.54 3 14 22 26 30 57
i686-linu Linux 2.5.54 1 12 28 22 32 58

tbench 32:

Throughput 121.701 MB/sec (NB=152.126 MB/sec 1217.01 MBit/sec)
Throughput 124.958 MB/sec (NB=156.197 MB/sec 1249.58 MBit/sec)
Throughput 124.086 MB/sec (NB=155.107 MB/sec 1240.86 MBit/sec)

pollbench 1 100 5000
result with handles 1 processes 100 loops 5000:time 7.306432 sec.
result with handles 1 processes 100 loops 5000:time 7.352913 sec.
result with handles 1 processes 100 loops 5000:time 7.337019 sec.
pollbench 2 100 2000
result with handles 2 processes 100 loops 2000:time 3.184550 sec.
result with handles 2 processes 100 loops 2000:time 3.251854 sec.
result with handles 2 processes 100 loops 2000:time 3.209147 sec.
pollbench 5 100 2000
result with handles 5 processes 100 loops 2000:time 4.135773 sec.
result with handles 5 processes 100 loops 2000:time 4.117304 sec.
result with handles 5 processes 100 loops 2000:time 4.119047 sec.

The tbench changes should probably be ignored. After profiling tbench
I can say that this thoughput difference is _not_ due to the task switcher
change (__switch_to is only 1%). I left the numbers here to show what
the effect of simply relinking and rebooting the kernel can be.

BTW, the pollbench numbers are not stunningly better than the 500MHz PIII:
pollbench 1 100 5000
result with handles 1 processes 100 loops 5000:time 9.609487 sec.
pollbench 2 100 2000
result with handles 2 processes 100 loops 2000:time 4.016496 sec.
pollbench 5 100 2000
result with handles 5 processes 100 loops 2000:time 4.917921 sec.

I didn't profile the P4. John has promised P4 oprofile support for
next week, which will be nice.

I did profile Manfred's pollbench on the PIII, uniprocessor build. Note
that there is only a 5% throughput difference on this machine. It's all
in __switch_to(). Here the PIII is doing 70k switches/sec.

2.5.54+BK:

c012abbc 534 2.69888 buffered_rmqueue
c0116714 617 3.11837 __wake_up_common
c010a606 635 3.20934 restore_all
c014b038 745 3.76529 do_poll
c013d4dc 757 3.82594 fget
c014551c 766 3.87142 pipe_write
c010a5c4 1249 6.31254 system_call
c014b0f0 1273 6.43384 sys_poll
c01090a4 1775 8.97099 __switch_to
c0116484 1922 9.71394 schedule

2.5.54+BK+backout-patch:

c012abbc 768 3.1024 buffered_rmqueue
c0116714 790 3.19127 __wake_up_common
c010a5e6 809 3.26803 restore_all
c013d4dc 918 3.70834 fget
c014551c 936 3.78105 pipe_write
c014b038 977 3.94668 do_poll
c01090a4 1070 4.32236 __switch_to
c014b0f0 1606 6.48758 sys_poll
c010a5a4 1678 6.77843 system_call
c0116484 2542 10.2686 schedule

2003-01-05 09:58:29

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

Linus Torvalds <[email protected]> writes:

> On Sat, 4 Jan 2003, Andrew Morton wrote:
> > >
> > > Hmm.. The backup patch doesn't handle single-stepping correctly: the
> > > eflags cleanup singlestep patch later in the sysenter sequence _depends_
> > > on the stack (and thus thread) being right on the very first in-kernel
> > > instruction.
> >
> > Well that's just a straight `patch -R' of the patch which added the wrmsr's.
>
> Yes, but the breakage comes laterr when a subsequent patch in the
> 2.5.53->54 stuff started depending on the stack location being stable even
> on the first instruction.

Regarding the EFLAGS handling: why can't you just do
a pushfl in the vsyscall page before pushing the 6th arg on the stack
and a popfl afterwards.

Then the syscall entry in kernel code could just do

pushfl $fixed_eflags
popfl

The first popl for the 6th arg in the vsyscall page wouldn't be traced
then, but I doubt that is a problem.

Would add a few cycles to the entry path, but then it is better than
having slow context switch.

This would also eliminate the random IOPL problem Luca noticed.
BTW I think I have the same issue on x86-64 with SYSCALL (random IOPL
in kernel), but so far nothing broke, so it is probably not a big
problem.

>
> > > It doesn't show up on lmbench (insufficient precision), but your AIM9
> > > numbers are quite interesting. Are they stable?
> >
> > Seem to be, but more work is needed, including oprofiling. Andi is doing
> > some P4 testing at present.
>
> Ok.

Here are the numbers from a Dual 2.4Ghz Xeon. The first is plain
2.5.54, the second is with the WRMSR-in-switch-to patch backed out.
Also 2.4.18-aa for co

Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw

oenone Linux 2.5.54 2.410 3.5600 6.0300 3.9900 34.8 8.59000 43.7
oenone Linux 2.5.54 1.270 2.3300 4.7700 2.5100 29.5 4.16000 39.2

If that is true the slowdown would be nearly 50% for the 2p case.
That looks a bit much, I wonder if lmbench is very accurate here
(do we have some other context switch benchmark to double check?)
but all numbers show a significant slowdown.

-Andi

2003-01-05 18:49:26

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

On 5 Jan 2003, Andi Kleen wrote:
>
> Regarding the EFLAGS handling: why can't you just do
> a pushfl in the vsyscall page before pushing the 6th arg on the stack
> and a popfl afterwards.

I did that originally, but timings from Jamie convinced me that it's
actually a quite noticeable overhead for the system call path.

You should realize that the 5-9% slowdown in schedule (which I don't like)
comes with a 360% speedup on a P4 in simple system call handling (which I
_do_ like). My P4 does a system call in 428 cycles as opposed to 1568
cycles according to my benchmarks.

And part of the reason for the huge speedup is that the vsyscall/sysenter
path is actually pretty much the fastest possible. Yes, it would have been
faster just from using sysenter/sysexit, but not by 360%. The other
speedups come from not reloading segment registers multiple times
(noticeable on a PIII, not a P4) and from avoiding things liek the flags
pushing.

NOTE! We could trivially speed up the task switching by making
"load_esp0()" a bit smarter. Right now it actually re-writes _both_
SYSENTER_CS and SYSENTER_ESP on a taskswitch, and that's because a process
that was in vm86 mode will have cleared SYSENTER_CS (so that sysenter will
cause a GP fault inside vm86 mode).

Now, that SYSENTER_CS thing is very rare indeed, and by keeping track of
what the previous value was (ie just caching the SYSENTER_CS value in the
thread_struct), we could get rid of it with a conditional jump instead.
Want to try it?

> This would also eliminate the random IOPL problem Luca noticed.

Nope, it wouldn't. A "popfl" in user mode does nothing for iopl. You have
to have the popfl in kernel mode.

Linus

2003-01-05 23:38:03

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

On Sun, Jan 05, 2003 at 07:51:44PM +0100, Linus Torvalds wrote:
>
> On 5 Jan 2003, Andi Kleen wrote:
> >
> > Regarding the EFLAGS handling: why can't you just do
> > a pushfl in the vsyscall page before pushing the 6th arg on the stack
> > and a popfl afterwards.
>
> I did that originally, but timings from Jamie convinced me that it's
> actually a quite noticeable overhead for the system call path.
>
> You should realize that the 5-9% slowdown in schedule (which I don't like)
> comes with a 360% speedup on a P4 in simple system call handling (which I
> _do_ like). My P4 does a system call in 428 cycles as opposed to 1568
> cycles according to my benchmarks.

According to my benchmarks the slowdown on context switch is a lot
more than 5-9% on P4:

Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw

with wrmsr Linux 2.5.54 2.410 3.5600 6.0300 3.9900 34.8 8.59000 43.7
no wrmsr Linux 2.5.54 1.270 2.3300 4.7700 2.5100 29.5 4.16000 39.2

That looks more like between 10%-51%

[Note I don't trust the numbers completely, the slowdown looks a bit too
extreme especially for the 16p case. But it is clear that it is a lot
slower]

I haven't benchmarked pushfl/popfl, but I cannot imagine it being that
slow to offset that. I agree that syscalls are a slightly hotter path than the
context switch, but hurting one for the other that much looks a bit
misbalanced.

>
> And part of the reason for the huge speedup is that the vsyscall/sysenter
> path is actually pretty much the fastest possible. Yes, it would have been

I can think of some things to speed it up more. e.g. replace all the
push / pop in SAVE/RESTORE_ALL with sub $frame,%esp ; movl %reg,offset(%esp)
and movl offset(%esp),%reg ; addl $frame,%esp. This way the CPU has
no dependencies between all the load/store options unlike push/pop.

(that is what all the code optimization guides recommend and gcc / icc
do too for saving/restoring of lots of registers)

Perhaps that would offset a pushfl / popfl in kernel mode, may be worth
a try.

-Andi

P.S.: For me it is actually good if the i386 context switch is slow.
On x86-64 I have some ugly wrmsrs in the context switch for the
64bit segment base rewriting too and slowing down i386 like this will
make the 64bit kernel look better compared to 32bit ;););)

2003-01-06 00:50:37

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

Followup to: <[email protected]>
By author: Linus Torvalds <[email protected]>
In newsgroup: linux.dev.kernel
>
> Now, that SYSENTER_CS thing is very rare indeed, and by keeping track of
> what the previous value was (ie just caching the SYSENTER_CS value in the
> thread_struct), we could get rid of it with a conditional jump instead.
> Want to try it?
>

This seems like the first thing to do.

Dealing with the SYSENTER_ESP issue is a lot tricker. It seems that
it can be done with a magic EIP range test in the #DB handler, the
range is the part that finds the top of the real kernel stack and
pushfl's to it; the #DB handler if it receives a trap in this region
could emulate this piece of code (including pushing the pre-exception
flags onto the stack) and then invoke the instruction immediately
after the pushf..

Yes, it's ugly, but it should be relatively straightforward, and since
this particular chunk is assembly code by necessity it shouldn't be
brittle.

The other variant (which I have suggested) is to simply state "TF set
in user space is not honoured." This would require a system call to
set TF -> 1. That way the kernel already has the TF state for all
processes.

Again, it's ugly.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2003-01-06 01:30:50

[permalink] [raw]

Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)

On Mon, 6 Jan 2003, Andi Kleen wrote:
>
> Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
>
> with wrmsr Linux 2.5.54 2.410 3.5600 6.0300 3.9900 34.8 8.59000 43.7
> no wrmsr Linux 2.5.54 1.270 2.3300 4.7700 2.5100 29.5 4.16000 39.2
>
> That looks more like between 10%-51%

The lmbench numbers for context switch overhead vary _way_ too much to say
anything at all based on two runs. By all logic the worst-affected case
should be the 2p/0K case, since the overhead of the wrmsr should be 100%
constant.

The numbers by Mikael seem to say that the MSR writes are 800 cycles each
(!) on a P4, so avoiding the CS write would make the overhead about half
of what it is now (at the cost of making it conditional).

800 cycles in the context switch path is still nasty, I agree.

> I haven't benchmarked pushfl/popfl, but I cannot imagine it being that
> slow to offset that. I agree that syscalls are a slightly hotter path than the
> context switch, but hurting one for the other that much looks a bit
> misbalanced.

Note that pushf/popf is a totally different thing, and has nothing to do
witht he MSR save.

For pushf/popf, the tradeoff is very clear: you have to either do the
pushf/popf in the system call path, or you have to do it in the context
switch path. They are equally expensive in both, but we do a hell of a lot
more system calls, so it's _obviously_ better to do the pushf/popf in the
context switch.

The WRMSR thing is much less obvious. Unlike the pushf/popf, the code
isn't the same, you have two different cases:

(a) single static wrmsr / CPU

Change ESP at system call entry and extra jump to common code

(b) wrmsr each context switch

System call entry is free.

The thing that makes me like (b) _despite_ the slowdown of context
switching is that the esp reload it made a difference of tens of cycles in
my testing of system calls (which is surprising, I admit - it might be
more than the esp reload, it might be some interaction with jumps right
after a switch to CPL0 causing badness for the pipeline).

Also, (b) allows us to simplify the TF handling (which is _not_ the eflags
issue: the eflags issue is about IOPL) and means that there are no issues
with NMI's and fake temporary stacks.

> I can think of some things to speed it up more. e.g. replace all the
> push / pop in SAVE/RESTORE_ALL with sub $frame,%esp ; movl %reg,offset(%esp)
> and movl offset(%esp),%reg ; addl $frame,%esp. This way the CPU has
> no dependencies between all the load/store options unlike push/pop.

Last I remember, that only made a difference on Athlons, and Intel CPU's
have special-case logic for pushes/pops where they follow the value of the
stack pointer down the chain and break the dependencies (this may also be
why reloading %esp caused such a problem for the pipeline - if I remember
correctly the ESP-following can only handle simple add/sub chains).

> (that is what all the code optimization guides recommend and gcc / icc
> do too for saving/restoring of lots of registers)

It causes bigger code, which is why I don't particularly like it.
Benchmarks might be interesting.

Linus

2003-01-06 01:57:20