Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
MIME-Version: 1.0
References: <cover.1556182964.git.viresh.kumar@linaro.org>
In-Reply-To: <cover.1556182964.git.viresh.kumar@linaro.org>
From:   Song Liu <liu.song.a23@gmail.com>
Date:   Thu, 9 May 2019 17:54:35 -0400
Message-ID: <CAPhsuW6QXn9HfGN31=1w037mLhc7eSZCOSK-Lor7OWRxg8Ct7A@mail.gmail.com>
Subject: Re: [RFC V2 0/2] sched/fair: Fallback to sched-idle CPU for better performance
To:     Viresh Kumar <viresh.kumar@linaro.org>
Cc:     Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>, tkjos@google.com,
        Daniel Lezcano <daniel.lezcano@linaro.org>,
        quentin.perret@linaro.org, chris.redpath@arm.com,
        Dietmar.Eggemann@arm.com, open list <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Thu, Apr 25, 2019 at 5:38 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> Hi,
>
> Here is another attempt to get some benefit out of the sched-idle
> policy. The previous version [1] focused on getting better power numbers
> and this version tries to get better performance or lower response time
> for the tasks.
>
> The first patch is unchanged from v1 and accumulates
> information about sched-idle tasks per CPU.
>
> The second patch changes the way the target CPU is selected in the fast
> path. Currently, we target for an idle CPU in select_idle_sibling() to
> run the next task, but in case we don't find idle CPUs it is better to
> pick a CPU which will run the task the soonest, for performance reason.
> A CPU which isn't idle but has only SCHED_IDLE activity queued on it
> should be a good target based on this criteria as any normal fair task
> will most likely preempt the currently running SCHED_IDLE task
> immediately. In fact, choosing a SCHED_IDLE CPU shall give better
> results as it should be able to run the task sooner than an idle CPU
> (which requires to be woken up from an idle state).
>
> Basic testing is done with the help of rt-app currently to make sure the
> task is getting placed correctly.

I run some test with this set on our (Facebook's) web service (main job)
and ffmpeg (side job). The result looks promising.

For all the tests below, I run the web service with same load level; and
the side job with SCHED_IDLE. I presented schedule latency distribution
of the main job. The latency distribution is measured with the runqlat tool:
     https://github.com/iovisor/bpftrace/blob/master/tools/runqlat.bt

I modified the tool to only track wake up latency of the main job.

4 cases are compared here:

1. w/o this set, w/o side job;
2. w/ this set, w/o side job;
3. w/o this set, w/ side job;
4. w/ this set, w/ side job;


Case #1. w/o this set, w/o side job
@usecs:
[1]                 1705 |                                                    |
[2, 4)           1102086 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            329160 |@@@@@@@@@@@@@@@                                     |
[8, 16)            34135 |@                                                   |
[16, 32)           37466 |@                                                   |
[32, 64)           15700 |                                                    |
[64, 128)           8759 |                                                    |
[128, 256)          5714 |                                                    |
[256, 512)          3718 |                                                    |
[512, 1K)           2152 |                                                    |
[1K, 2K)             882 |                                                    |
[2K, 4K)             256 |                                                    |
[4K, 8K)              48 |                                                    |
[8K, 16K)              2 |                                                    |

Case #2. w/ this set, w/o side job;
@usecs:
[1]                 2121 |                                                    |
[2, 4)           1251877 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            401517 |@@@@@@@@@@@@@@@@                                    |
[8, 16)            64325 |@@                                                  |
[16, 32)           74352 |@@@                                                 |
[32, 64)           40583 |@                                                   |
[64, 128)          26129 |@                                                   |
[128, 256)         18612 |                                                    |
[256, 512)         12863 |                                                    |
[512, 1K)           8304 |                                                    |
[1K, 2K)            4072 |                                                    |
[2K, 4K)            1569 |                                                    |
[4K, 8K)             411 |                                                    |
[8K, 16K)             70 |                                                    |
[16K, 32K)             1 |                                                    |

From #1 and #2, we see this set probably adds a little overhead to
scheduling when there is no side job. But the overhead is clearly very
small.


Case #3. w/o this set, w/ side job;
@usecs:
[1]                 1282 |                                                    |
[2, 4)            260977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[4, 8)            446120 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)           136927 |@@@@@@@@@@@@@@@                                     |
[16, 32)          148758 |@@@@@@@@@@@@@@@@@                                   |
[32, 64)          160291 |@@@@@@@@@@@@@@@@@@                                  |
[64, 128)         177292 |@@@@@@@@@@@@@@@@@@@@                                |
[128, 256)        184573 |@@@@@@@@@@@@@@@@@@@@@                               |
[256, 512)        173405 |@@@@@@@@@@@@@@@@@@@@                                |
[512, 1K)         149662 |@@@@@@@@@@@@@@@@@                                   |
[1K, 2K)          120217 |@@@@@@@@@@@@@@                                      |
[2K, 4K)           80275 |@@@@@@@@@                                           |
[4K, 8K)           36108 |@@@@                                                |
[8K, 16K)          11121 |@                                                   |
[16K, 32K)           736 |                                                    |
[32K, 64K)            19 |                                                    |

Case #4. w/ this set, w/ side job;
@usecs:
[1]                  407 |                                                    |
[2, 4)            535378 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[4, 8)            795526 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)            93032 |@@@@@@                                              |
[16, 32)           89890 |@@@@@                                               |
[32, 64)           82775 |@@@@@                                               |
[64, 128)          84413 |@@@@@                                               |
[128, 256)         84413 |@@@@@                                               |
[256, 512)         77202 |@@@@@                                               |
[512, 1K)          66043 |@@@@                                                |
[1K, 2K)           49276 |@@@                                                 |
[2K, 4K)           30114 |@                                                   |
[4K, 8K)           11145 |                                                    |
[8K, 16K)           2328 |                                                    |
[16K, 32K)            88 |                                                    |

#3 and #4 clearly showed the benefit of this set. With this set, we see
significantly fewer latency values in the 8usecs+ ranges.

Thanks,
Song

>
> --
> viresh
>
> Viresh Kumar (2):
>   sched: Start tracking SCHED_IDLE tasks count in cfs_rq
>   sched/fair: Fallback to sched-idle CPU if idle CPU isn't found
>
>  kernel/sched/fair.c  | 42 +++++++++++++++++++++++++++++++++---------
>  kernel/sched/sched.h |  2 ++
>  2 files changed, 35 insertions(+), 9 deletions(-)
>
> --
> 2.21.0.rc0.269.g1a574e7a288b
>
> [1] https://lore.kernel.org/lkml/cover.1543229820.git.viresh.kumar@linaro.org/