Message-ID: <559AD9CE.4090309@fb.com>
Date: Mon, 6 Jul 2015 15:41:02 -0400
From: Josef Bacik <jbacik@fb.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: Mike Galbraith <umgwanakikbuti@gmail.com>
CC: Peter Zijlstra <peterz@infradead.org>, <riel@redhat.com>,
        <mingo@redhat.com>, <linux-kernel@vger.kernel.org>,
        <morten.rasmussen@arm.com>, kernel-team <Kernel-team@fb.com>
Subject: Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for
 BALANCE_WAKE
References: <1432761736-22093-1-git-send-email-jbacik@fb.com>	 <20150528102127.GD3644@twins.programming.kicks-ass.net>	 <20150528110514.GR18673@twins.programming.kicks-ass.net>	 <1434087305.3674.26.camel@gmail.com> <5581B70D.2000800@fb.com>	 <1434588939.3444.25.camel@gmail.com> <55823F33.7040005@fb.com>	 <1434600765.3393.9.camel@gmail.com> <55957871.7080906@fb.com>	 <1435905658.6418.52.camel@gmail.com> <1436025462.17152.37.camel@gmail.com>	 <1436080661.22930.22.camel@gmail.com> <1436159590.5850.27.camel@gmail.com>	 <559A91F4.7000903@fb.com> <1436207790.2940.30.camel@gmail.com>
In-Reply-To: <1436207790.2940.30.camel@gmail.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3929
Lines: 94

On 07/06/2015 02:36 PM, Mike Galbraith wrote:
> On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote:
>> On 07/06/2015 01:13 AM, Mike Galbraith wrote:
>>> Hm.  Piddling with pgbench, which doesn't seem to collapse into a
>>> quivering heap when load exceeds cores these days, deltas weren't all
>>> that impressive, but it does appreciate the extra effort a bit, and a
>>> bit more when clients receive it as well.
>>>
>>> If you test, and have time to piddle, you could try letting wake_wide()
>>> return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
>>> the dispatcher.
>>>
>>> Numbers from my little desktop box.
>>>
>>> NO_WAKE_WIDE_IDLE
>>> postgres@homer:~> pgbench.sh
>>> clients 8       tps = 116697.697662
>>> clients 12      tps = 115160.230523
>>> clients 16      tps = 115569.804548
>>> clients 20      tps = 117879.230514
>>> clients 24      tps = 118281.753040
>>> clients 28      tps = 116974.796627
>>> clients 32      tps = 119082.163998   avg   117092.239   1.000
>>>
>>> WAKE_WIDE_IDLE
>>> postgres@homer:~> pgbench.sh
>>> clients 8       tps = 124351.735754
>>> clients 12      tps = 124419.673135
>>> clients 16      tps = 125050.716498
>>> clients 20      tps = 124813.042352
>>> clients 24      tps = 126047.442307
>>> clients 28      tps = 125373.719401
>>> clients 32      tps = 126711.243383   avg   125252.510   1.069   1.000
>>>
>>> WAKE_WIDE_IDLE (clients as well as server)
>>> postgres@homer:~> pgbench.sh
>>> clients 8       tps = 130539.795246
>>> clients 12      tps = 128984.648554
>>> clients 16      tps = 130564.386447
>>> clients 20      tps = 129149.693118
>>> clients 24      tps = 130211.119780
>>> clients 28      tps = 130325.355433
>>> clients 32      tps = 129585.656963   avg   129908.665   1.109   1.037
>
> I had a typo in my script, so those desktop box numbers were all doing
> the same number of clients.  It doesn't invalidate anything, but the
> individual deltas are just run to run variance.. not to mention that
> single cache box is not all that interesting for this anyway.  That
> happens when interconnect becomes a player.
>
>> I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
>> this solved upstream and then I'll rip out the old and put in the new,
>> I'm happy to screw around until we're all happy.  I'll throw this in a
>> kernel this morning and run stuff today.  Barring any issues with the
>> testing infrastructure I should have results today.  Thanks,
>
> I'll be interested in your results.  Taking pgbench to a little NUMA
> box, I'm seeing _nada_ outside of variance with master (crap).  I have a
> way to win significantly for _older_ kernels, and that win over master
> _may_ provide some useful insight, but I don't trust postgres/pgbench as
> far as I can toss the planet, so don't have a warm fuzzy about trying to
> use it to approximate your real world load.
>
> BTW, what's your topology look like (numactl --hardware).
>

So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the 
baseline with a slight regression at lower RPS and a slight improvement 
at high RPS.  I'm running with WAKE_WIDE_IDLE set now, that should be 
done soonish and then I'll do the 1 + sched_feat(WAKE_WIDE_IDLE) thing 
next and those results should come in the morning.  Here is the numa 
information from one of the boxes in the test cluster

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 15890 MB
node 0 free: 2651 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 16125 MB
node 1 free: 2063 MB
node distances:
node   0   1
   0:  10  20
   1:  20  10

Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/