Message-ID: <1362675134.10972.21.camel@laptop>
Subject: Re: [PATCH] sched: wakeup buddy
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: LKML <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@kernel.org>,
        Mike Galbraith <efault@gmx.de>, Namhyung Kim <namhyung@kernel.org>,
        Alex Shi <alex.shi@intel.com>, Paul Turner <pjt@google.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        "Nikunj A. Dadhania" <nikunj@linux.vnet.ibm.com>,
        Ram Pai <linuxram@us.ibm.com>
Date: Thu, 07 Mar 2013 17:52:14 +0100
In-Reply-To: <51386207.5040808@linux.vnet.ibm.com>
References: <5136EB06.2050905@linux.vnet.ibm.com>
	 <1362645372.2606.11.camel@laptop> <51386207.5040808@linux.vnet.ibm.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4073
Lines: 90

On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote:

> On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
> > On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> > 
> >> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
> >> work well according to the test on 'perf bench sched pipe' (thanks to Peter).
> > 
> > so sched-pipe is a poor benchmark for this.. 
> > 
> > Ideally we'd write a new benchmark that has some actual data footprint
> > and we'd measure the cost of tasks being apart on the various cache
> > metrics and see what affine wakeup does for it.
> 
> I think sched-pipe is still somewhat capable, 

Yeah, its not entirely crap for this, but its not ideal either. The
very big difference you see between it running on a single cpu and on
say two threads of a single core is mostly due to preemption
'artifacts' though. Not because of cache.

So we have 2 tasks -- lets call then A and B -- involved in a single
word ping-pong. So we're both doing write(); read(); loops. Now what
happens on a single cpu is that A's write()->wakeup() of B makes B
preempt A before A hits read() and blocks. This in turn ensures that
B's write()->wakeup() of A finds an already running A and doesn't
actually need to do the full (and expensive) wakeup thing (and vice
versa).

So by constantly preempting one another they avoid the expensive bit of
going to sleep and waking up again.

wake_affine() OTOH still has a (supposed) benefit if it gets the tasks
running 'closer' (in a cache hierarchy sense) since then the data
sharing is less expensive.

> the problem is that the
> select_idle_sibling() doesn't take care the wakeup related case, it
> doesn't contain the logical to locate an idle cpu closely.

I'm not entirely sure if I understand what you mean, do you mean to say
its idea of 'closely' is not quite correct? If so, I tend to agree, see
further down.

> So even we detect the relationship successfully, select_idle_sibling()
> can only help to make sure the target cpu won't be outside of the
> current package, it's a package level bind, not mc or smp level.

That is the entire point of select_idle_sibling(), selecting a cpu
'near' the target cpu that is currently idle.

Not too long ago we had a bit of a discussion on the unholy mess that
is select_idle_sibling() and if it actually does the right thing.
Arguably it doesn't for machines that have an effective L2 cache. The 
issue is that the arch<->sched interface only knows about
last-level-cache (L3 on anything modern) and SMT.

Expanding the topology description in a way that makes sense (and
doesn't make it a bigger mess) is somewhere on the todo-list.

> > Before doing something like what you're proposing, I'd have a hard look
> > at WF_SYNC, it is possible we should disable/fix select_idle_sibling
> > for sync wakeups.
> 
> The patch is supposed to stop using wake_affine() blindly, not improve
> the wake_affine() stuff itself, the whole stuff still works, but since
> we rely on select_idle_sibling() to make the choice, the benefit is not
> so significant, especially on my one node box...

OK, I'll have to go read the actual patch for that, I'll get back to
you on that :-)

> > The idea behind sync wakeups is that we try and detect the case where
> > we wakeup up one task only to go to sleep ourselves and try and avoid
> > the regular ping-pong this would otherwise create on account of the
> > waking task still being alive and so the current cpu isn't actually
> > idle yet but we know its going to be idle soon.
> 
> Are you suggesting that we should separate the process of wakeup related
> case, not just pass current cpu to select_idle_sibling()?

Depends a bit on what you're trying to fix, so far I'm just trying to
write down what I remember about stuff and reacting to half-read
changelogs ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/