Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc
To:     Steven Sistare <steven.sistare@oracle.com>, mingo@redhat.com,
        peterz@infradead.org
Cc:     subhra.mazumdar@oracle.com, dhaval.giani@oracle.com,
        rohit.k.jain@oracle.com, daniel.m.jordan@oracle.com,
        pavel.tatashin@microsoft.com, matt@codeblueprint.co.uk,
        umgwanakikbuti@gmail.com, riel@redhat.com, jbacik@fb.com,
        juri.lelli@redhat.com, linux-kernel@vger.kernel.org
References: <1540220381-424433-1-git-send-email-steven.sistare@oracle.com>
 <1540220381-424433-8-git-send-email-steven.sistare@oracle.com>
 <f75ddcd5-f9a0-30c5-94e4-c4077e17ffb0@arm.com>
 <b9915490-661b-0d0e-ae60-d5803357e1ec@oracle.com>
From:   Valentin Schneider <valentin.schneider@arm.com>
Message-ID: <b7d91b1d-5da1-5d70-f266-93b108379b64@arm.com>
Date:   Mon, 29 Oct 2018 19:34:50 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <b9915490-661b-0d0e-ae60-d5803357e1ec@oracle.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-GB
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On 26/10/2018 19:28, Steven Sistare wrote:
> On 10/26/2018 2:04 PM, Valentin Schneider wrote:
[...]
>>
>> I was thinking that perhaps we could have scenarios where some rq's
>> keep stealing tasks off of each other and we end up circulating tasks 
>> between CPUs. Now, that would only happen if we had a handful of tasks
>> with a very tiny period, and I'm not familiar with (real) such hyperactive
>> workloads similar to those generated by hackbench where that could happen.
> 
> That will not happen with the current code, as it only steals if nr_running > 1.
> The src loses a task, the dst gains it and has nr_running == 1, so it will not
> be re-stolen.
> 

That's indeed fine, I was thinking of something like this:


Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with
nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2).

As soon as the task on r_1 ends/blocks, we'll go through idle balancing and
can potentially steal the non-running task from r_2. Sometime later the task
that was running on r_1 wakes up, and we end up with r_1->nr_running == 2
and r_2->nr_running == 1.

IOW we've swapped their role in that example, and the whole thing can
repeat.

The shorter the period of those tasks, the more we'll migrate them
between rq's, hence why I wonder if we shouldn't have some sort of
throttling.

> If we modify the code to handle misfits, we may steal when src nr_running == 1,
> but a fast CPU will only steal the lone task from a slow one, never fast from fast, 
> and never slow from fast, so no tug of war.
> 
>> In short, I wonder if we should have task_hot() in there. Drawing a
>> parallel with load_balance(), even if load-balancing is happening between
>> rqs of the same LLC, we do go check task_hot(). Have you already experimented
>> with adding a task_hot() check in here?
> 
> I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems, 
> and it reduced steals and overall performance.
> 

Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY
sched feature and the exec_start threshold.

The first one should be sidestepped in the stealing case since we won't
pass (if env->dst_rq->nr_running), that leaves us with the threshold.

We might want to sidestep it when we are doing balancing within an LLC
domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold
in such cases.

In any case, I think it would make sense to add some LLC conditions to
task_hot() so that
- regular load_balance() can also benefit from them
- task stealing has at least some sort of throttling


On a sidenote, I find it a bit odd that the exec_start threshold depends on
sysctl_sched_migration_cost, which to me is more about idle_balance() cost
than "how long does it take for a previously run task to go cache cold".

>> I've run some iterations of hackbench (hackbench 2 process 100000) to
>> investigate this task bouncing, but I didn't really see any of it. That was
>> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system
>> with more CPUs.
>>
>> ----->8-----
>>
>> activations: # of task activations (task starts running)
>> cpu_migrations: # of activations where cpu != prev_cpu
>> % stats are percentiles
>>
>> - STEAL:
>>
>>   | stat  | cpu_migrations | activations |
>>   |-------+----------------+-------------|
>>   | count |    2005.000000 | 2005.000000 |
>>   | mean  |      16.244888 |  290.608479 |
>>   | std   |      38.963138 |  253.003528 |
>>   | min   |       0.000000 |    3.000000 |
>>   | 50%   |       3.000000 |  239.000000 |
>>   | 75%   |       8.000000 |  436.000000 |
>>   | 90%   |      45.000000 |  626.000000 |
>>   | 99%   |     188.960000 | 1073.000000 |
>>   | max   |     369.000000 | 1417.000000 |
>>
>> - NO_STEAL:
>>
>>   | stat  | cpu_migrations | activations |
>>   |-------+----------------+-------------|
>>   | count |    2005.000000 | 2005.000000 |
>>   | mean  |      15.260848 |  297.860848 |
>>   | std   |      46.331890 |  253.210813 |
>>   | min   |       0.000000 |    3.000000 |
>>   | 50%   |       3.000000 |  252.000000 |
>>   | 75%   |       7.000000 |  444.000000 |
>>   | 90%   |      32.600000 |  643.600000 |
>>   | 99%   |     214.880000 | 1127.520000 |
>>   | max   |     467.000000 | 1547.000000 |
>>
>> ----->8-----
>>
>> Otherwise, my only other concern at the moment is that since stealing
>> doesn't care about load, we could steal a task that would cause a big
>> imbalance, which wouldn't have happened with a call to load_balance().
>>
>> I don't think this can be triggered with a symmetrical workload like
>> hackbench, so I'll go explore something else.
> 
> The dst is about to go idle with zero load, so stealing can only improve the
> instantaneous balance between src and dst.  For longer term average load, we
> still rely on periodic load_balance to make adjustments.
> 

Right, so my line of thinking was that by not doing a load_balance() and
taking a shortcut (stealing a task), we may end up just postponing a
load_balance() to after we've stolen a task. I guess in those cases
there's no magic trick to be found and we just have to deal with it.

And then there's some of the logic like we have in update_sd_pick_busiest()
where we e.g. try to prevent misfit tasks from running on LITTLEs, but
then if such tasks are waiting to be run and a LITTLE frees itself up,
I *think* it's okay to steal it.

> All good questions, keep them coming.
> 
> - Steve
>