Hi Ingo,
We have Alex Nixon doing some profiling of Xen kernels, comparing
current pvops Xen and native with the last "official" Xen kernel
2.6.18.8-xen.
One obvious difference is that the kernbench context switch rate is way
up, from about 30k to 110k. Also, the user time went up from about 375s
to 390s - and that's comparing pvops native to 2.6.18.8-xen (pvops Xen
was more or less identical).
I wonder if the user time increase is related to the context switch
rate, because the actual context switch time itself is accounted to the
process, or because of secondary things like cache and tlb misses. Or
perhaps the new scheduler accounts for things differently?
Anyway, I'm wondering:
* is the increased context switch rate expected?
* what tunables are there so we can try and make them have
comparable context switch rates?
This is an issue because the Xen/pvops kernel is showing a fairly large
overall performance regression, and the context switches a specifically
slow compared to the old Xen kernel, and the high switch rate is
presumably compounding the problem. It would be nice to have some knobs
to turn to see what the underlying performance characteristics are.
Thanks,
J
On Wed, 2008-07-16 at 11:54 -0700, Jeremy Fitzhardinge wrote:
> Hi Ingo,
>
> We have Alex Nixon doing some profiling of Xen kernels, comparing
> current pvops Xen and native with the last "official" Xen kernel
> 2.6.18.8-xen.
>
> One obvious difference is that the kernbench context switch rate is way
> up, from about 30k to 110k. Also, the user time went up from about 375s
> to 390s - and that's comparing pvops native to 2.6.18.8-xen (pvops Xen
> was more or less identical).
>
> I wonder if the user time increase is related to the context switch
> rate, because the actual context switch time itself is accounted to the
> process, or because of secondary things like cache and tlb misses. Or
> perhaps the new scheduler accounts for things differently?
>
> Anyway, I'm wondering:
>
> * is the increased context switch rate expected?
> * what tunables are there so we can try and make them have
> comparable context switch rates?
>
> This is an issue because the Xen/pvops kernel is showing a fairly large
> overall performance regression, and the context switches a specifically
> slow compared to the old Xen kernel, and the high switch rate is
> presumably compounding the problem. It would be nice to have some knobs
> to turn to see what the underlying performance characteristics are.
Is this specific to Xen?, as a native kernel doesn't do more than ~3k
cs/s with make -j3 on my dual core.
Peter Zijlstra wrote:
> Is this specific to Xen?, as a native kernel doesn't do more than ~3k
> cs/s with make -j3 on my dual core.
>
No, it doesn't seem to be. A CONFIG_PARAVIRT kernel running on bare
hardware shows the same context switch rate. Merely turning
CONFIG_PARAVIRT on should have no effect on context switch rate (though,
Alex, it would be worth double-checking, just to be sure).
J
Yeah I've checked - the number of context switches seems to be around
60k regardless of whether CONFIG_PARAVIRT is switched on, and regardless
of whether it's running in domu or native (-j4 on dual core)
-----Original Message-----
From: Jeremy Fitzhardinge [mailto:[email protected]]
Sent: 17 July 2008 16:03
To: Peter Zijlstra
Cc: Ingo Molnar; Linux Kernel Mailing List; Alex Nixon (Intern); Ian
Campbell
Subject: Re: Large increase in context switch rate
Peter Zijlstra wrote:
> Is this specific to Xen?, as a native kernel doesn't do more than ~3k
> cs/s with make -j3 on my dual core.
>
No, it doesn't seem to be. A CONFIG_PARAVIRT kernel running on bare
hardware shows the same context switch rate. Merely turning
CONFIG_PARAVIRT on should have no effect on context switch rate (though,
Alex, it would be worth double-checking, just to be sure).
J
On Thu, 2008-07-17 at 16:14 +0100, Alex Nixon (Intern) wrote:
> Yeah I've checked - the number of context switches seems to be around
> 60k regardless of whether CONFIG_PARAVIRT is switched on, and regardless
> of whether it's running in domu or native (-j4 on dual core)
I'm seeing:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 1148916 120340 472040 0 0 0 1912 2225 3662 78 22 0 0 0
4 0 0 1130192 120556 489844 0 0 0 1974 2220 3461 77 22 0 0 0
doing make -j4
This is on x86_64 SMP PREEMPT HZ=1000 !PARAVIRT
(Linus' tree as of somewhere earlier today)
Alex Nixon (Intern) wrote:
> Yeah I've checked - the number of context switches seems to be around
> 60k regardless of whether CONFIG_PARAVIRT is switched on, and regardless
> of whether it's running in domu or native (-j4 on dual core)
>
OK, just to be sure we're talking about the same thing, is kernbench
displaying the context switch *rate*, or the total number of context
switches during the build?
Peter is looking at vmstat, which is showing cs/sec.
J
I'm talking about total number of context switches - kernbench gets it
from
time -f "%c" make -j 4
Dividing through it gives me a rate of around 250/sec (vs Peters 3000),
but I've set CONFIG_HZ=100 (vs Peters 1000), so they don't wildly
conflict.
Well spotted :-)
- Alex
-----Original Message-----
From: Jeremy Fitzhardinge [mailto:[email protected]]
Sent: 17 July 2008 16:45
To: Alex Nixon (Intern)
Cc: Peter Zijlstra; Ingo Molnar; Linux Kernel Mailing List; Ian Campbell
Subject: Re: Large increase in context switch rate
Alex Nixon (Intern) wrote:
> Yeah I've checked - the number of context switches seems to be around
> 60k regardless of whether CONFIG_PARAVIRT is switched on, and
regardless
> of whether it's running in domu or native (-j4 on dual core)
>
OK, just to be sure we're talking about the same thing, is kernbench
displaying the context switch *rate*, or the total number of context
switches during the build?
Peter is looking at vmstat, which is showing cs/sec.
J
(Don't top-post.)
Alex Nixon (Intern) wrote:
> I'm talking about total number of context switches - kernbench gets it
> from
>
> time -f "%c" make -j 4
>
> Dividing through it gives me a rate of around 250/sec (vs Peters 3000),
> but I've set CONFIG_HZ=100 (vs Peters 1000), so they don't wildly
> conflict.
>
> Well spotted :-)
>
OK, but that still doesn't account for the relatively large increase
from 2.6.18 -> 2.6.26. You're using HZ=100 in both cases, I presume.
The other variable is NOHZ and highres timers. You could try turning
those off in 2.6.26. Also, CONFIG_PREEMPT could well make a
difference. 2.6.18-xen doesn't support CONFIG_PREEMPT at all, but
pvops(-xen) does.
J
Jeremy Fitzhardinge <[email protected]> writes:
>
> OK, but that still doesn't account for the relatively large increase
> from 2.6.18 -> 2.6.26. You're using HZ=100 in both cases, I presume.
>
> The other variable is NOHZ and highres timers. You could try turning
> those off in 2.6.26. Also, CONFIG_PREEMPT could well make a
> difference. 2.6.18-xen doesn't support CONFIG_PREEMPT at all, but
> pvops(-xen) does.
If it's that easily reproducible you could just bisect it?
-Andi
> -----Original Message-----
> From: Andi Kleen [mailto:[email protected]]
> Sent: 17 July 2008 22:43
> To: Jeremy Fitzhardinge
> Cc: Alex Nixon (Intern); Peter Zijlstra; Ingo Molnar; Linux Kernel
Mailing
> List; Ian Campbell
> Subject: Re: Large increase in context switch rate
>
> Jeremy Fitzhardinge <[email protected]> writes:
> >
> > OK, but that still doesn't account for the relatively large increase
> > from 2.6.18 -> 2.6.26. You're using HZ=100 in both cases, I
presume.
> >
> > The other variable is NOHZ and highres timers. You could try
turning
> > those off in 2.6.26. Also, CONFIG_PREEMPT could well make a
> > difference. 2.6.18-xen doesn't support CONFIG_PREEMPT at all, but
> > pvops(-xen) does.
>
> If it's that easily reproducible you could just bisect it?
>
> -Andi
I've bisected the majority of the increase down to some time between
2.6.18 and 2.6.19 - kernbench results are:
Kernel version Elapsed User System Context Switches
V2.6.18 232 427 28 32713
V2.6.19 232 429 28 52812
NOHZ and highres timers are disabled wherever available, HZ=100, and
CONFIG_PREEMPT_VOLUNTARY=y
Will continue further on Monday.
- Alex
> -----Original Message-----
> From: Andi Kleen [mailto:[email protected]]
> Sent: 17 July 2008 22:43
> To: Jeremy Fitzhardinge
> Cc: Alex Nixon (Intern); Peter Zijlstra; Ingo Molnar; Linux
> Kernel Mailing List; Ian Campbell
> Subject: Re: Large increase in context switch rate
>
> Jeremy Fitzhardinge <[email protected]> writes:
> >
> > OK, but that still doesn't account for the relatively large increase
> > from 2.6.18 -> 2.6.26. You're using HZ=100 in both cases,
> I presume.
> >
> > The other variable is NOHZ and highres timers. You could
> try turning
> > those off in 2.6.26. Also, CONFIG_PREEMPT could well make a
> > difference. 2.6.18-xen doesn't support CONFIG_PREEMPT at all, but
> > pvops(-xen) does.
>
> If it's that easily reproducible you could just bisect it?
>
> -Andi
>
I've bisected down to commit ba52de123d454b57369f291348266d86f4b35070 -
[PATCH] inode-diet. Before that kernbench consistently reports about
35k context switches (total), and after that commit about 53k. The
benchmarks are being run on a tmpfs. I've verified the results on a
different machine, albeit with an almost identical setup (the same
kernels and debian distro, kernbench version, and benchmarking a build
of the same source).
Seems to be a mystery why that patch is (seemingly) the culprit...anyone
have any ideas? Maybe there's some other variable I'm not keeping
constant?
- Alex
> Seems to be a mystery why that patch is (seemingly) the culprit...anyone
Yes looks weird.
> have any ideas? Maybe there's some other variable I'm not keeping
> constant?
Did you verify it was really that patch by unapplying/reapplying it?
Sometimes there are mistakes in the bisect process.
-Andi
On Wednesday 23 July 2008 19:34, Alex Nixon (Intern) wrote:
> > -----Original Message-----
> > From: Andi Kleen [mailto:[email protected]]
> > Sent: 17 July 2008 22:43
> > To: Jeremy Fitzhardinge
> > Cc: Alex Nixon (Intern); Peter Zijlstra; Ingo Molnar; Linux
> > Kernel Mailing List; Ian Campbell
> > Subject: Re: Large increase in context switch rate
> >
> > Jeremy Fitzhardinge <[email protected]> writes:
> > > OK, but that still doesn't account for the relatively large increase
> > > from 2.6.18 -> 2.6.26. You're using HZ=100 in both cases,
> >
> > I presume.
> >
> > > The other variable is NOHZ and highres timers. You could
> >
> > try turning
> >
> > > those off in 2.6.26. Also, CONFIG_PREEMPT could well make a
> > > difference. 2.6.18-xen doesn't support CONFIG_PREEMPT at all, but
> > > pvops(-xen) does.
> >
> > If it's that easily reproducible you could just bisect it?
> >
> > -Andi
>
> I've bisected down to commit ba52de123d454b57369f291348266d86f4b35070 -
> [PATCH] inode-diet. Before that kernbench consistently reports about
> 35k context switches (total), and after that commit about 53k. The
> benchmarks are being run on a tmpfs. I've verified the results on a
> different machine, albeit with an almost identical setup (the same
> kernels and debian distro, kernbench version, and benchmarking a build
> of the same source).
>
> Seems to be a mystery why that patch is (seemingly) the culprit...anyone
> have any ideas? Maybe there's some other variable I'm not keeping
> constant?
Weird. It could possibly be triggering some different userspace behaviour
if blocksize reporting has changed anywhere. strace might help there.
Interesting if you could post the top results of profile=schedule for
a kernel with and without the patch.
On Fri, 2008-07-25 at 12:31 +0100, Alex Nixon wrote:
> >> I've bisected down to commit ba52de123d454b57369f291348266d86f4b35070 -
> >> [PATCH] inode-diet. Before that kernbench consistently reports about
> >> 35k context switches (total), and after that commit about 53k. The
> >> benchmarks are being run on a tmpfs. I've verified the results on a
> >> different machine, albeit with an almost identical setup (the same
> >> kernels and debian distro, kernbench version, and benchmarking a build
> >> of the same source).
> >>
> >> Seems to be a mystery why that patch is (seemingly) the culprit...
>
> The relevant changeset had caused the blocksize to default to 1024 (as opposed
> to 4096) - as a result there was a large increase in the time spent waiting on
> pipes.
>
> Instead of re-adding the line taken out of fs/pipe.c by Theodore I opted instead
> to change the default block size for pseudo-filesystems to PAGE_SIZE, to try
> avoid making pipe.c inconsistent with Theodore's new approach.
>
> The performance penalty from these extra context switches is fairly small, but
> is magnified when virtualization is involved, hence the desire to keep it lower
> if possible.
>
>
> From 4b568a72fc42b52279507eb4d1339e0637ae719a Mon Sep 17 00:00:00 2001
> From: Alex Nixon <t_alexn@alexn-desktop.(none)>
> Date: Fri, 25 Jul 2008 11:26:44 +0100
> Subject: [PATCH] VFS: increase pseudo-filesystem block size to PAGE_SIZE.
>
> Changeset ba52de123d454b57369f291348266d86f4b35070 caused the block size used
> by pseudo-filesystems to decrease from PAGE_SIZE to 1024 leading to a doubling
> of the number of context switches during a kernbench run.
Cool - makes sense. I'd ack it, but I know less than nothing about this
code, so I won't...
Still, good hunting on your part!
> Signed-off-by: Alex Nixon <[email protected]>
> ---
> fs/libfs.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index baeb71e..1add676 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -216,8 +216,8 @@ int get_sb_pseudo(struct file_system_type *fs_type, char *name,
>
> s->s_flags = MS_NOUSER;
> s->s_maxbytes = ~0ULL;
> - s->s_blocksize = 1024;
> - s->s_blocksize_bits = 10;
> + s->s_blocksize = PAGE_SIZE;
> + s->s_blocksize_bits = PAGE_SHIFT;
> s->s_magic = magic;
> s->s_op = ops ? ops : &simple_super_operations;
> s->s_time_gran = 1;
>> I've bisected down to commit ba52de123d454b57369f291348266d86f4b35070 -
>> [PATCH] inode-diet. Before that kernbench consistently reports about
>> 35k context switches (total), and after that commit about 53k. The
>> benchmarks are being run on a tmpfs. I've verified the results on a
>> different machine, albeit with an almost identical setup (the same
>> kernels and debian distro, kernbench version, and benchmarking a build
>> of the same source).
>>
>> Seems to be a mystery why that patch is (seemingly) the culprit...
The relevant changeset had caused the blocksize to default to 1024 (as opposed
to 4096) - as a result there was a large increase in the time spent waiting on
pipes.
Instead of re-adding the line taken out of fs/pipe.c by Theodore I opted instead
to change the default block size for pseudo-filesystems to PAGE_SIZE, to try
avoid making pipe.c inconsistent with Theodore's new approach.
The performance penalty from these extra context switches is fairly small, but
is magnified when virtualization is involved, hence the desire to keep it lower
if possible.
>From 4b568a72fc42b52279507eb4d1339e0637ae719a Mon Sep 17 00:00:00 2001
From: Alex Nixon <t_alexn@alexn-desktop.(none)>
Date: Fri, 25 Jul 2008 11:26:44 +0100
Subject: [PATCH] VFS: increase pseudo-filesystem block size to PAGE_SIZE.
Changeset ba52de123d454b57369f291348266d86f4b35070 caused the block size used
by pseudo-filesystems to decrease from PAGE_SIZE to 1024 leading to a doubling
of the number of context switches during a kernbench run.
Signed-off-by: Alex Nixon <[email protected]>
---
fs/libfs.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/libfs.c b/fs/libfs.c
index baeb71e..1add676 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -216,8 +216,8 @@ int get_sb_pseudo(struct file_system_type *fs_type, char *name,
s->s_flags = MS_NOUSER;
s->s_maxbytes = ~0ULL;
- s->s_blocksize = 1024;
- s->s_blocksize_bits = 10;
+ s->s_blocksize = PAGE_SIZE;
+ s->s_blocksize_bits = PAGE_SHIFT;
s->s_magic = magic;
s->s_op = ops ? ops : &simple_super_operations;
s->s_time_gran = 1;
--
1.5.4.3
Alex Nixon wrote:
> The relevant changeset had caused the blocksize to default to 1024 (as opposed
> to 4096) - as a result there was a large increase in the time spent waiting on
> pipes.
>
Good work!
> Instead of re-adding the line taken out of fs/pipe.c by Theodore I opted instead
> to change the default block size for pseudo-filesystems to PAGE_SIZE, to try
> avoid making pipe.c inconsistent with Theodore's new approach.
>
> The performance penalty from these extra context switches is fairly small, but
> is magnified when virtualization is involved, hence the desire to keep it lower
> if possible.
>
>
> From 4b568a72fc42b52279507eb4d1339e0637ae719a Mon Sep 17 00:00:00 2001
> From: Alex Nixon <t_alexn@alexn-desktop.(none)>
> Date: Fri, 25 Jul 2008 11:26:44 +0100
> Subject: [PATCH] VFS: increase pseudo-filesystem block size to PAGE_SIZE.
>
> Changeset ba52de123d454b57369f291348266d86f4b35070 caused the block size used
> by pseudo-filesystems to decrease from PAGE_SIZE to 1024 leading to a doubling
> of the number of context switches during a kernbench run.
>
Probably worth explicitly noting the effect on pipe buffer size.
> Signed-off-by: Alex Nixon <[email protected]>
> ---
> fs/libfs.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index baeb71e..1add676 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -216,8 +216,8 @@ int get_sb_pseudo(struct file_system_type *fs_type, char *name,
>
> s->s_flags = MS_NOUSER;
> s->s_maxbytes = ~0ULL;
> - s->s_blocksize = 1024;
> - s->s_blocksize_bits = 10;
> + s->s_blocksize = PAGE_SIZE;
> + s->s_blocksize_bits = PAGE_SHIFT;
> s->s_magic = magic;
> s->s_op = ops ? ops : &simple_super_operations;
> s->s_time_gran = 1;
>
J