2013-09-10 04:03:04

by Dave Chinner

[permalink] [raw]
Subject: [performance regression, bisected] scheduler: should_we_balance() kills filesystem performance

Hi folks,

I just updated my performance test VM to the current 3.12-git
tree after the XFS dev branch was merged. The first test I ran
which was a 16-way concurrent fsmark test to create lots of files
gave me a number about 30% lower than I expected - ~180k files/s
when I was expecting somewhere around 250k files/s.

I did a bisect, and the bisect landed on this commit:

commit 23f0d2093c789e612185180c468fa09063834e87
Author: Joonsoo Kim <[email protected]>
Date: Tue Aug 6 17:36:42 2013 +0900

sched: Factor out code to should_we_balance()

Now checking whether this cpu is appropriate to balance or not
is embedded into update_sg_lb_stats() and this checking has no direct
relationship to this function. There is not enough reason to place
this checking at update_sg_lb_stats(), except saving one iteration
for sched_group_cpus.
....

Now, i couldn't revert that patch by itself, but I reverted the
series of about 10 scheduler patches in that series total from a
current TOT and the regression went away. Hence I'm pretty confident
that the this is the patch causing the issue as i've verified it in
more than one way and the difference between "good" and "bad" was
signficantlt greater than the variance of the test (1.5-2 stddev
difference).

In more detail:

v4 filesystem v5 filesystem
3.11+xfsdev: 220k files/s 225k files/s
3.12-git 180k files/s 185k files/s
3.12-git-revert 245k files/s 247k files/s

The test vm is a 16p/16GB RAM VM, with a sparse 100TB filesystem
image sitting on a 4-way RAID0 SSD array formatted with XFS and the
image file is accessed by virtio+direct IO. The fsmark command line
is:

time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \
-d /mnt/scratch/0 -d /mnt/scratch/1 \
-d /mnt/scratch/2 -d /mnt/scratch/3 \
-d /mnt/scratch/4 -d /mnt/scratch/5 \
-d /mnt/scratch/6 -d /mnt/scratch/7 \
-d /mnt/scratch/8 -d /mnt/scratch/9 \
-d /mnt/scratch/10 -d /mnt/scratch/11 \
-d /mnt/scratch/12 -d /mnt/scratch/13 \
-d /mnt/scratch/14 -d /mnt/scratch/15 \
| tee >(stats --trim-outliers | tail -1 1>&2)

The workload on XFS runs to almost being CPU bound - the effect of
the above patch was that there was a lot of idle time left in the
system. The workload consumed the same amount of user and system
CPU, just instantaneous CPU usage was reduced by 20-30% and the
elaspsed time was increased by 20-30%.

Cheers,

Dave.
--
Dave Chinner
[email protected]


2013-09-10 04:47:44

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [performance regression, bisected] scheduler: should_we_balance() kills filesystem performance

On Tue, Sep 10, 2013 at 02:02:54PM +1000, Dave Chinner wrote:
> Hi folks,
>
> I just updated my performance test VM to the current 3.12-git
> tree after the XFS dev branch was merged. The first test I ran
> which was a 16-way concurrent fsmark test to create lots of files
> gave me a number about 30% lower than I expected - ~180k files/s
> when I was expecting somewhere around 250k files/s.
>
> I did a bisect, and the bisect landed on this commit:
>
> commit 23f0d2093c789e612185180c468fa09063834e87
> Author: Joonsoo Kim <[email protected]>
> Date: Tue Aug 6 17:36:42 2013 +0900
>
> sched: Factor out code to should_we_balance()
>
> Now checking whether this cpu is appropriate to balance or not
> is embedded into update_sg_lb_stats() and this checking has no direct
> relationship to this function. There is not enough reason to place
> this checking at update_sg_lb_stats(), except saving one iteration
> for sched_group_cpus.
> ....
>
> Now, i couldn't revert that patch by itself, but I reverted the
> series of about 10 scheduler patches in that series total from a
> current TOT and the regression went away. Hence I'm pretty confident
> that the this is the patch causing the issue as i've verified it in
> more than one way and the difference between "good" and "bad" was
> signficantlt greater than the variance of the test (1.5-2 stddev
> difference).
>
> In more detail:
>
> v4 filesystem v5 filesystem
> 3.11+xfsdev: 220k files/s 225k files/s
> 3.12-git 180k files/s 185k files/s
> 3.12-git-revert 245k files/s 247k files/s
>
> The test vm is a 16p/16GB RAM VM, with a sparse 100TB filesystem
> image sitting on a 4-way RAID0 SSD array formatted with XFS and the
> image file is accessed by virtio+direct IO. The fsmark command line
> is:
>
> time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \
> -d /mnt/scratch/0 -d /mnt/scratch/1 \
> -d /mnt/scratch/2 -d /mnt/scratch/3 \
> -d /mnt/scratch/4 -d /mnt/scratch/5 \
> -d /mnt/scratch/6 -d /mnt/scratch/7 \
> -d /mnt/scratch/8 -d /mnt/scratch/9 \
> -d /mnt/scratch/10 -d /mnt/scratch/11 \
> -d /mnt/scratch/12 -d /mnt/scratch/13 \
> -d /mnt/scratch/14 -d /mnt/scratch/15 \
> | tee >(stats --trim-outliers | tail -1 1>&2)
>
> The workload on XFS runs to almost being CPU bound - the effect of
> the above patch was that there was a lot of idle time left in the
> system. The workload consumed the same amount of user and system
> CPU, just instantaneous CPU usage was reduced by 20-30% and the
> elaspsed time was increased by 20-30%.

Hello, Dave.

Now, I look again this patch and find one mistake.
If we find that we are appropriate cpu for balancing, should_we_balance()
should return 1. But current code doesn't do so. This correspond with
your observation that a lot of idle time left.

Could you re-test your benchmark with below?

Thanks.

------------------->8-------------------------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f0a5e6..9b3fe1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5151,7 +5151,7 @@ static int should_we_balance(struct lb_env *env)
* First idle cpu or the first cpu(busiest) in this sched group
* is eligible for doing load balancing at this and above domains.
*/
- return balance_cpu != env->dst_cpu;
+ return balance_cpu == env->dst_cpu;
}

/*

2013-09-10 06:15:28

by Dave Chinner

[permalink] [raw]
Subject: Re: [performance regression, bisected] scheduler: should_we_balance() kills filesystem performance

On Tue, Sep 10, 2013 at 01:47:59PM +0900, Joonsoo Kim wrote:
> On Tue, Sep 10, 2013 at 02:02:54PM +1000, Dave Chinner wrote:
> > Hi folks,
> >
> > I just updated my performance test VM to the current 3.12-git
> > tree after the XFS dev branch was merged. The first test I ran
> > which was a 16-way concurrent fsmark test to create lots of files
> > gave me a number about 30% lower than I expected - ~180k files/s
> > when I was expecting somewhere around 250k files/s.
> >
> > I did a bisect, and the bisect landed on this commit:
> >
> > commit 23f0d2093c789e612185180c468fa09063834e87
> > Author: Joonsoo Kim <[email protected]>
> > Date: Tue Aug 6 17:36:42 2013 +0900
> >
> > sched: Factor out code to should_we_balance()
.....
> >
> > v4 filesystem v5 filesystem
> > 3.11+xfsdev: 220k files/s 225k files/s
> > 3.12-git 180k files/s 185k files/s
> > 3.12-git-revert 245k files/s 247k files/s
> >
> > The test vm is a 16p/16GB RAM VM, with a sparse 100TB filesystem
> > image sitting on a 4-way RAID0 SSD array formatted with XFS and the
> > image file is accessed by virtio+direct IO. The fsmark command line
> > is:
> >
> > time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \
> > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > -d /mnt/scratch/6 -d /mnt/scratch/7 \
> > -d /mnt/scratch/8 -d /mnt/scratch/9 \
> > -d /mnt/scratch/10 -d /mnt/scratch/11 \
> > -d /mnt/scratch/12 -d /mnt/scratch/13 \
> > -d /mnt/scratch/14 -d /mnt/scratch/15 \
> > | tee >(stats --trim-outliers | tail -1 1>&2)
> >
> > The workload on XFS runs to almost being CPU bound - the effect of
> > the above patch was that there was a lot of idle time left in the
> > system. The workload consumed the same amount of user and system
> > CPU, just instantaneous CPU usage was reduced by 20-30% and the
> > elaspsed time was increased by 20-30%.
>
> Hello, Dave.
>
> Now, I look again this patch and find one mistake.
> If we find that we are appropriate cpu for balancing, should_we_balance()
> should return 1. But current code doesn't do so. This correspond with
> your observation that a lot of idle time left.
>
> Could you re-test your benchmark with below?

Sure. It looks like your patch fixes the problem:

v4 filesystem v5 filesystem
3.11+xfsdev: 220k files/s 225k files/s
3.12-git 180k files/s 185k files/s
3.12-git-revert 245k files/s 247k files/s
3.12-git-fix 249k files/s 248k files/s

Thanks for the quick turnaround :)

Tested-by: Dave Chinner <[email protected]>

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-09-10 06:54:34

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [performance regression, bisected] scheduler: should_we_balance() kills filesystem performance

On Tue, Sep 10, 2013 at 04:15:20PM +1000, Dave Chinner wrote:
> On Tue, Sep 10, 2013 at 01:47:59PM +0900, Joonsoo Kim wrote:
> > On Tue, Sep 10, 2013 at 02:02:54PM +1000, Dave Chinner wrote:
> > > Hi folks,
> > >
> > > I just updated my performance test VM to the current 3.12-git
> > > tree after the XFS dev branch was merged. The first test I ran
> > > which was a 16-way concurrent fsmark test to create lots of files
> > > gave me a number about 30% lower than I expected - ~180k files/s
> > > when I was expecting somewhere around 250k files/s.
> > >
> > > I did a bisect, and the bisect landed on this commit:
> > >
> > > commit 23f0d2093c789e612185180c468fa09063834e87
> > > Author: Joonsoo Kim <[email protected]>
> > > Date: Tue Aug 6 17:36:42 2013 +0900
> > >
> > > sched: Factor out code to should_we_balance()
> .....
> > >
> > > v4 filesystem v5 filesystem
> > > 3.11+xfsdev: 220k files/s 225k files/s
> > > 3.12-git 180k files/s 185k files/s
> > > 3.12-git-revert 245k files/s 247k files/s
> > >
> > > The test vm is a 16p/16GB RAM VM, with a sparse 100TB filesystem
> > > image sitting on a 4-way RAID0 SSD array formatted with XFS and the
> > > image file is accessed by virtio+direct IO. The fsmark command line
> > > is:
> > >
> > > time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \
> > > -d /mnt/scratch/0 -d /mnt/scratch/1 \
> > > -d /mnt/scratch/2 -d /mnt/scratch/3 \
> > > -d /mnt/scratch/4 -d /mnt/scratch/5 \
> > > -d /mnt/scratch/6 -d /mnt/scratch/7 \
> > > -d /mnt/scratch/8 -d /mnt/scratch/9 \
> > > -d /mnt/scratch/10 -d /mnt/scratch/11 \
> > > -d /mnt/scratch/12 -d /mnt/scratch/13 \
> > > -d /mnt/scratch/14 -d /mnt/scratch/15 \
> > > | tee >(stats --trim-outliers | tail -1 1>&2)
> > >
> > > The workload on XFS runs to almost being CPU bound - the effect of
> > > the above patch was that there was a lot of idle time left in the
> > > system. The workload consumed the same amount of user and system
> > > CPU, just instantaneous CPU usage was reduced by 20-30% and the
> > > elaspsed time was increased by 20-30%.
> >
> > Hello, Dave.
> >
> > Now, I look again this patch and find one mistake.
> > If we find that we are appropriate cpu for balancing, should_we_balance()
> > should return 1. But current code doesn't do so. This correspond with
> > your observation that a lot of idle time left.
> >
> > Could you re-test your benchmark with below?
>
> Sure. It looks like your patch fixes the problem:
>
> v4 filesystem v5 filesystem
> 3.11+xfsdev: 220k files/s 225k files/s
> 3.12-git 180k files/s 185k files/s
> 3.12-git-revert 245k files/s 247k files/s
> 3.12-git-fix 249k files/s 248k files/s
>
> Thanks for the quick turnaround :)
>
> Tested-by: Dave Chinner <[email protected]>
>

Thanks for the quick turnaround, too. :)

Hello, Ingo.

I attach the formatted patch with proper SOBs and commit message.
Please merge this to fix above problem.

Thanks.

--------------------->8-------------------------
>From cf8ca492c2206e72d91ca0a1f6c59c5436132c61 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Tue, 10 Sep 2013 15:28:10 +0900
Subject: [PATCH] sched: return 1 if this cpu is proper for balancing in
should_we_balance()

Commit 23f0d20('sched: Factor out code to should_we_balance()') introduces
should_we_balance() function. This function should return 1 if this cpu is
appropriate for balancing. But current code doesn't do so. When this
happens, it returns 0, instead of 1.

This introduces performance regression which Dave Chinner reports.

v4 filesystem v5 filesystem
3.11+xfsdev: 220k files/s 225k files/s
3.12-git 180k files/s 185k files/s
3.12-git-revert 245k files/s 247k files/s

You can find more detailed information in below link.
https://lkml.org/lkml/2013/9/10/1

This patch corrects the return value of should_we_balance() function
as orignally intended.

With this patch, Dave Chinner said that regression are gone.

v4 filesystem v5 filesystem
3.11+xfsdev: 220k files/s 225k files/s
3.12-git 180k files/s 185k files/s
3.12-git-revert 245k files/s 247k files/s
3.12-git-fix 249k files/s 248k files/s

Reported-by: Dave Chinner <[email protected]>
Tested-by: Dave Chinner <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f0a5e6..9b3fe1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5151,7 +5151,7 @@ static int should_we_balance(struct lb_env *env)
* First idle cpu or the first cpu(busiest) in this sched group
* is eligible for doing load balancing at this and above domains.
*/
- return balance_cpu != env->dst_cpu;
+ return balance_cpu == env->dst_cpu;
}

/*
--
1.7.9.5

Subject: [tip:sched/urgent] sched: Fix load balancing performance regression in should_we_balance()

Commit-ID: b0cff9d88ce2f3030f73138078c5b1019f17e1cc
Gitweb: http://git.kernel.org/tip/b0cff9d88ce2f3030f73138078c5b1019f17e1cc
Author: Joonsoo Kim <[email protected]>
AuthorDate: Tue, 10 Sep 2013 15:54:49 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 10 Sep 2013 09:20:42 +0200

sched: Fix load balancing performance regression in should_we_balance()

Commit 23f0d20 ("sched: Factor out code to should_we_balance()")
introduces the should_we_balance() function. This function should
return 1 if this cpu is appropriate for balancing. But the newly
introduced code doesn't do so, it returns 0 instead of 1.

This introduces performance regression, reported by Dave Chinner:

v4 filesystem v5 filesystem
3.11+xfsdev: 220k files/s 225k files/s
3.12-git 180k files/s 185k files/s
3.12-git-revert 245k files/s 247k files/s

You can find more detailed information at:

https://lkml.org/lkml/2013/9/10/1

This patch corrects the return value of should_we_balance()
function as orignally intended.

With this patch, Dave Chinner reports that the regression is gone:

v4 filesystem v5 filesystem
3.11+xfsdev: 220k files/s 225k files/s
3.12-git 180k files/s 185k files/s
3.12-git-revert 245k files/s 247k files/s
3.12-git-fix 249k files/s 248k files/s

Reported-by: Dave Chinner <[email protected]>
Tested-by: Dave Chinner <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Chinner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f0a5e6..9b3fe1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5151,7 +5151,7 @@ static int should_we_balance(struct lb_env *env)
* First idle cpu or the first cpu(busiest) in this sched group
* is eligible for doing load balancing at this and above domains.
*/
- return balance_cpu != env->dst_cpu;
+ return balance_cpu == env->dst_cpu;
}

/*

2013-09-10 08:06:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [performance regression, bisected] scheduler: should_we_balance() kills filesystem performance

On Tue, Sep 10, 2013 at 01:47:59PM +0900, Joonsoo Kim wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7f0a5e6..9b3fe1c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5151,7 +5151,7 @@ static int should_we_balance(struct lb_env *env)
> * First idle cpu or the first cpu(busiest) in this sched group
> * is eligible for doing load balancing at this and above domains.
> */
> - return balance_cpu != env->dst_cpu;
> + return balance_cpu == env->dst_cpu;
> }

Duh.. /me kicks himself for not seeing that.

Thanks!