DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=google.com; s=beta;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type:content-transfer-encoding;
        b=bECA0UHJrycyIcllz4uEjIfLhKmvm1WEdyixmXhU5wU8BDAyy7OvS4jpQR85vv7fWQ
         EHeFOtCkc0Ca+/gjsyEg==
MIME-Version: 1.0
In-Reply-To: <1299022433-17233-1-git-send-email-venki@google.com>
References: <1299022433-17233-1-git-send-email-venki@google.com>
From: Paul Turner <pjt@google.com>
Date: Tue, 1 Mar 2011 21:43:48 -0800
Message-ID: <AANLkTinyAyXCH6sh9pOe36dCeTO_C8wbtmZriKanGtt3@mail.gmail.com>
Subject: Re: [PATCH] sched: next buddy hint on sleep and preempt path
To: Venkatesh Pallipadi <venki@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@elte.hu>,
        linux-kernel@vger.kernel.org, Mike Galbraith <efault@gmx.de>,
        Rik van Riel <riel@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5031
Lines: 126

On Tue, Mar 1, 2011 at 3:33 PM, Venkatesh Pallipadi <venki@google.com> wrote:
> When a task in a taskgroup sleeps, pick_next_task starts all the way back at
> the root and picks the task/taskgroup with the min vruntime across all
> runnable tasks. But, when there are many frequently sleeping tasks
> across different taskgroups, it makes better sense to stay with same taskgroup
> for its slice period (or until all tasks in the taskgroup sleeps) instead of
> switching cross taskgroup on each sleep after a short runtime.
> This helps specifically where taskgroups corresponds to a process with
> multiple threads. The change reduces the number of CR3 switches in this case.
>
> Example:
> Two taskgroups with 2 threads each which are running for 2ms and
> sleeping for 1ms. Looking at sched:sched_switch shows -
>
> BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
> ? ? ?cpu-soaker-5004 ?[003] ?3683.391089
> ? ? ?cpu-soaker-5016 ?[003] ?3683.393106
> ? ? ?cpu-soaker-5005 ?[003] ?3683.395119
> ? ? ?cpu-soaker-5017 ?[003] ?3683.397130
> ? ? ?cpu-soaker-5004 ?[003] ?3683.399143
> ? ? ?cpu-soaker-5016 ?[003] ?3683.401155
> ? ? ?cpu-soaker-5005 ?[003] ?3683.403168
> ? ? ?cpu-soaker-5017 ?[003] ?3683.405170
>
> AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
> ? ? ?cpu-soaker-21890 [003] ? 865.895494
> ? ? ?cpu-soaker-21935 [003] ? 865.897506
> ? ? ?cpu-soaker-21934 [003] ? 865.899520
> ? ? ?cpu-soaker-21935 [003] ? 865.901532
> ? ? ?cpu-soaker-21934 [003] ? 865.903543
> ? ? ?cpu-soaker-21935 [003] ? 865.905546
> ? ? ?cpu-soaker-21891 [003] ? 865.907548
> ? ? ?cpu-soaker-21890 [003] ? 865.909560
> ? ? ?cpu-soaker-21891 [003] ? 865.911571
> ? ? ?cpu-soaker-21890 [003] ? 865.913582
> ? ? ?cpu-soaker-21891 [003] ? 865.915594
> ? ? ?cpu-soaker-21934 [003] ? 865.917606
>
> Similar problem is there when there are multiple taskgroups and say a task A
> preempts currently running task B of taskgroup_1. On schedule, pick_next_task
> can pick an unrelated task on taskgroup_2. Here it would be better to give some
> preference to task B on pick_next_task.
>
> A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
> client processes with 2 threads each running on a single CPU. Avg throughput
> across 5 50 sec runs was -
> BEFORE: 105.84 MB/sec
> AFTER: 112.42 MB/sec
>
> Signed-off-by: Venkatesh Pallipadi <venki@google.com>
> ---
> ?kernel/sched_fair.c | ? 20 ++++++++++++++++++--
> ?1 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 3a88dee..36e8f02 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1339,6 +1339,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> ? ? ? ?hrtick_update(rq);
> ?}
>
> +static void set_next_buddy(struct sched_entity *se);
> +
> ?/*
> ?* The dequeue_task method is called before nr_running is
> ?* decreased. We remove the task from the rbtree and
> @@ -1348,14 +1350,22 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> ?{
> ? ? ? ?struct cfs_rq *cfs_rq;
> ? ? ? ?struct sched_entity *se = &p->se;
> + ? ? ? int task_flags = flags;

simpler: int voluntary = flags & DEQUEUE_SLEEP;
>
> ? ? ? ?for_each_sched_entity(se) {
> ? ? ? ? ? ? ? ?cfs_rq = cfs_rq_of(se);
> ? ? ? ? ? ? ? ?dequeue_entity(cfs_rq, se, flags);
>
> ? ? ? ? ? ? ? ?/* Don't dequeue parent if it has other entities besides us */
> - ? ? ? ? ? ? ? if (cfs_rq->load.weight)
> + ? ? ? ? ? ? ? if (cfs_rq->load.weight) {
> + ? ? ? ? ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ? ? ? ? ?* Bias pick_next to pick a task from this cfs_rq, as
> + ? ? ? ? ? ? ? ? ? ? ? ?* p is sleeping when it is within its sched_slice.
> + ? ? ? ? ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? ? ? ? ? if (task_flags & DEQUEUE_SLEEP && se->parent)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_next_buddy(se->parent);

re-using the last_buddy would seem like a more natural fit here; also
doesn't have a clobber race with a wakeup

> ? ? ? ? ? ? ? ? ? ? ? ?break;
> + ? ? ? ? ? ? ? }
> ? ? ? ? ? ? ? ?flags |= DEQUEUE_SLEEP;
> ? ? ? ?}
>
> @@ -1887,8 +1897,14 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> ? ? ? ?update_curr(cfs_rq);
> ? ? ? ?find_matching_se(&se, &pse);
> ? ? ? ?BUG_ON(!pse);
> - ? ? ? if (wakeup_preempt_entity(se, pse) == 1)
> + ? ? ? if (wakeup_preempt_entity(se, pse) == 1) {
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* Bias pick_next to pick the sched entity that is
> + ? ? ? ? ? ? ? ?* triggering this preemption.
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? set_next_buddy(pse);

this probably wants some sort of unification with the scale-based next
buddy above

> ? ? ? ? ? ? ? ?goto preempt;
> + ? ? ? }
>
> ? ? ? ?return;
>
> --
> 1.7.3.1
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/