Date: Tue, 6 Apr 2004 21:25:49 +0200
From: Ingo Molnar <mingo@elte.hu>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Eric Whiting <ewhiting@amis.com>, akpm@osdl.org,
       linux-kernel@vger.kernel.org
Subject: Re: -mmX 4G patches feedback [numbers: how much performance impact]
Message-ID: <20040406192549.GA14869@elte.hu>
References: <40718B2A.967D9467@amis.com> <20040405174616.GH2234@dualathlon.random> <4071D11B.1FEFD20A@amis.com> <20040405221641.GN2234@dualathlon.random> <20040406115539.GA31465@elte.hu> <20040406155925.GW2234@dualathlon.random>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20040406155925.GW2234@dualathlon.random>
User-Agent: Mutt/1.4.1i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3627
Lines: 83


* Andrea Arcangeli <andrea@suse.de> wrote:

> I will use the HINT to measure the slowdown on HZ=1000. It's an
> optimal benchmark simulating userspace load at various cache sizes and
> it's somewhat realistic.

here are the INT results from the HINT benchmark (best of 3 runs):

 1000Hz, 3:1, PAE:    25513978.295333 net QUIPs
 1000Hz, 4:4, PAE:    25515998.582834 net QUIPs

i.e. the two kernels are equal in performance. (the noise of the
benchmark was around ~0.5% so this 0.01% win of 4:4 is a draw.) This is
not unexpected, the benchmark is too noisy to notice the 0.22% maximum
possible 4:4 hit.

> Also note that the slowdown for app calling heavily syscalls is 30%
> not 5-10%, [...]

you are right that it's not 5-10%, it's more like 5-15%. It's not 30%,
except in the mentioned case of heavily threaded MySQL benchmark, and in
microbenchmarks. (the microbenchmark case is understandable, 4:4 adds +3
usecs on PAE and +1 usec on non-PAE.)

i've just re-measured a couple of workloads that are very kernel and
syscall intensive, to get a feel for the worst-case:

 apache tested via 'ab':      5% slowdown
 dbench:                     10% slowdown
 tbench:                     16% slowdown

these would be the ones where i'd expect to see the biggest slowdown,
these are dominated by kernel overhead and do alot of small syscalls. 
(all these tests fully saturated the CPU.)

you should also consider that while 4:4 does introduce extra TLB
flushes, it also removes the TLB flush at context-switch. So for
context-switch intensive workloads the 4:4 overhead will be smaller. (in
some rare and atypical cases it might even be a speedup - e.g. NFS
servers driven by knfsd.) This is why e.g. lat_ctx is 4.15 with 3:1, and
it's 4.85 with 4:4, a 16% slowdown only - in contrast to lat_syscall
null, which is 0.7 usecs in the 3:1 case vs. 3.9 usecs in the 4:4 case.

But judging by your present attitude i'm sure you'll be able to find
worse performing testcases and will use them as the typical slowdown
number to quote from that point on ;) Good luck in your search.

here's the 4:4 overhead for some other workloads:

 kernel compilation (30% kernel overhead):      2% slowdown
 pure userspace code:                           0% slowdown
 
anyway, i can only repeat what i said last year in the announcement
email of the 4:4 feature:

   the typical cost of 4G/4G on typical x86 servers is +3 usecs of
   syscall latency (this is in addition to the ~1 usec null syscall
   latency). Depending on the workload this can cause a typical
   measurable wall-clock overhead from 0% to 30%, for typical
   application workloads (DB workload, networking workload, etc.).
   Isolated microbenchmarks can show a bigger slowdown as well - due to
   the syscall latency increase.

so it's not like there's a cat in the bag.
 
the cost of 4:4, just like the cost of any other kernel feature that
impacts performance (like e.g. PAE, highmem or swapping) should be
considered in light of the actual workload. 4:4 is definitely not an
'always good' feature - i never claimed it was. It is an enabler feature
for very large RAM systems, and it gives 3.98 GB of VM to userspace. It
is a slowdown for anything that doesnt need these features.

But for pure userspace code (which started this discussion), where
userspace overhead dominates by far, the cost is negligible even with
1000Hz.

	Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/