Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755101Ab3DZMIa (ORCPT ); Fri, 26 Apr 2013 08:08:30 -0400 Received: from mail-bk0-f43.google.com ([209.85.214.43]:56677 "EHLO mail-bk0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751654Ab3DZMI3 (ORCPT ); Fri, 26 Apr 2013 08:08:29 -0400 MIME-Version: 1.0 In-Reply-To: <1366910611-20048-1-git-send-email-vincent.guittot@linaro.org> References: <1366910611-20048-1-git-send-email-vincent.guittot@linaro.org> Date: Fri, 26 Apr 2013 14:08:27 +0200 Message-ID: Subject: Re: [RFC PATCH v4 00/14] sched: packing small tasks From: Vincent Guittot To: linux-kernel , LAK , "linaro-kernel@lists.linaro.org" , Peter Zijlstra , Ingo Molnar , Russell King - ARM Linux , Paul Turner , Santosh , Morten Rasmussen , Chander Kashyap , "cmetcalf@tilera.com" , "tony.luck@intel.com" , Alex Shi , Preeti U Murthy Cc: Paul McKenney , Thomas Gleixner , Len Brown , Arjan van de Ven , Amit Kucheria , Jonathan Corbet , Lukasz Majewski , Vincent Guittot Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9675 Lines: 213 Hi, The patches are available in this git tree: git://git.linaro.org/people/vingu/kernel.git sched-pack-small-tasks-v4-fixed Vincent On 25 April 2013 19:23, Vincent Guittot wrote: > Hi, > > This patchset takes advantage of the new per-task load tracking that is > available in the kernel for packing the tasks in as few as possible > CPU/Cluster/Core. It has got 2 packing modes: > -The 1st mode packs the small tasks when the system is not too busy. The main > goal is to reduce the power consumption in the low system load use cases by > minimizing the number of power domain that are enabled but it also keeps the > default behavior which is performance oriented. > -The 2nd mode packs all tasks in as few as possible power domains in order to > improve the power consumption of the system but at the cost of possible > performance decrease because of the increase of the rate of ressources sharing > compared to the default mode. > > The packing is done in 3 steps (the last step is only applicable for the > agressive packing mode): > > The 1st step looks for the best place to pack tasks in a system according to > its topology and it defines a 1st pack buddy CPU for each CPU if there is one > available. The policy for defining a buddy CPU is that we want to pack at > levels where a group of CPU can be power gated independently from others. To > describe this capability, a new flag SD_SHARE_POWERDOMAIN has been introduced, > that is used to indicate whether the groups of CPUs of a scheduling domain > share their power state. By default, this flag is set in all sched_domain in > order to keep unchanged the current behavior of the scheduler and only ARM > platform clears the SD_SHARE_POWERDOMAIN flag for MC and CPU level. > > In a 2nd step, the scheduler checks the load average of a task which wakes up > as well as the load average of the buddy CPU and it can decide to migrate the > light tasks on a not busy buddy. This check is done during the wake up because > small tasks tend to wake up between periodic load balance and asynchronously > to each other which prevents the default mechanism to catch and migrate them > efficiently. A light task is defined by a runnable_avg_sum that is less than > 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones: > The average CPU load of the task must be less than 20% and the task must have > been runnable less than 10ms when it woke up last time in order to be > electable for the packing migration. So, a task than runs 1 ms each 5ms will > be considered as a small task but a task that runs 50 ms with a period of > 500ms, will not. > Then, the business of the buddy CPU depends of the load average for the rq and > the number of running tasks. A CPU with a load average greater than 50% will > be considered as busy CPU whatever the number of running tasks is and this > threshold will be reduced by the number of running tasks in order to not > increase too much the wake up latency of a task. When the buddy CPU is busy, > the scheduler falls back to default CFS policy. > > The 3rd step is only used when the agressive packing mode is enable. In this > case, the CPUs pack their tasks in their buddy until they becomes full. Unlike > the previous step, we can't keep the same buddy so we update it during load > balance. During the periodic load balance, the scheduler computes the activity > of the system thanks the runnable_avg_sum and the cpu_power of all CPUs and > then it defines the CPUs that will be used to handle the current activity. The > selected CPUs will be their own buddy and will participate to the default > load balancing mecanism in order to share the tasks in a fair way, whereas the > not selected CPUs will not, and their buddy will be the last selected CPU. > The behavior can be summarized as: The scheduler defines how many CPUs are > required to handle the current activity, keeps the tasks on these CPUS and > perform normal load balancing (or any evolution of the current load balancer > like the use of runnable load avg from Alex https://lkml.org/lkml/2013/4/1/580) > on this limited number of CPUs . Like the other steps, the CPUs are selected to > minimize the number of power domain that must stay on. > > Change since V3: > > - Take into account comments on previous version. > - Add an agressive packing mode and a knob to select between the various mode > > Change since V2: > > - Migrate only a task that wakes up > - Change the light tasks threshold to 20% > - Change the loaded CPU threshold to not pull tasks if the current number of > running tasks is null but the load average is already greater than 50% > - Fix the algorithm for selecting the buddy CPU. > > Change since V1: > > Patch 2/6 > - Change the flag name which was not clear. The new name is > SD_SHARE_POWERDOMAIN. > - Create an architecture dependent function to tune the sched_domain flags > Patch 3/6 > - Fix issues in the algorithm that looks for the best buddy CPU > - Use pr_debug instead of pr_info > - Fix for uniprocessor > Patch 4/6 > - Remove the use of usage_avg_sum which has not been merged > Patch 5/6 > - Change the way the coherency of runnable_avg_sum and runnable_avg_period is > ensured > Patch 6/6 > - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM > platform > > Previous results for v3: > > This series has been tested with hackbench on ARM platform and the results > don't show any performance regression > > Hackbench 3.9-rc2 +patches > Mean Time (10 tests): 2.048 2.015 > stdev : 0.047 0.068 > > Previous results for V2: > > This series has been tested with MP3 play back on ARM platform: > TC2 HMP (dual CA-15 and 3xCA-7 cluster). > > The measurements have been done on an Ubuntu image during 60 seconds of > playback and the result has been normalized to 100. > > | CA15 | CA7 | total | > ------------------------------------- > default | 81 | 97 | 178 | > pack | 13 | 100 | 113 | > ------------------------------------- > > Previous results for V1: > > The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP > (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have > demonstrated that it's worth packing small tasks at all topology levels. > > The performance tests have been done on both platforms with sysbench. The > results don't show any performance regressions. These results are aligned with > the policy which uses the normal behavior with heavy use cases. > > test: sysbench --test=cpu --num-threads=N --max-requests=R run > > Results below is the average duration of 3 tests on the quad CA-9. > default is the current scheduler behavior (pack buddy CPU is -1) > pack is the scheduler with the pack mechanism > > | default | pack | > ----------------------------------- > N=8; R=200 | 3.1999 | 3.1921 | > N=8; R=2000 | 31.4939 | 31.4844 | > N=12; R=200 | 3.2043 | 3.2084 | > N=12; R=2000 | 31.4897 | 31.4831 | > N=16; R=200 | 3.1774 | 3.1824 | > N=16; R=2000 | 31.4899 | 31.4897 | > ----------------------------------- > > The power consumption tests have been done only on TC2 platform which has got > accessible power lines and I have used cyclictest to simulate small tasks. The > tests show some power consumption improvements. > > test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20 > > The measurements have been done during 16 seconds and the result has been > normalized to 100 > > | CA15 | CA7 | total | > ------------------------------------- > default | 100 | 40 | 140 | > pack | <1 | 45 | <46 | > ------------------------------------- > > The A15 cluster is less power efficient than the A7 cluster but if we assume > that the tasks is well spread on both clusters, we can guest estimate that the > power consumption on a dual cluster of CA7 would have been for a default > kernel: > > | CA7 | CA7 | total | > ------------------------------------- > default | 40 | 40 | 80 | > ------------------------------------- > > Vincent Guittot (14): > Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for > load-tracking" > sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain > sched: pack small tasks > sched: pack the idle load balance > ARM: sched: clear SD_SHARE_POWERDOMAIN > sched: add a knob to choose the packing level > sched: agressively pack at wake/fork/exec > sched: trig ILB on an idle buddy > sched: evaluate the activity level of the system > sched: update the buddy CPU > sched: filter task pull request > sched: create a new field with available capacity > sched: update the cpu_power > sched: force migration on buddy CPU > > arch/arm/kernel/topology.c | 9 + > arch/ia64/include/asm/topology.h | 1 + > arch/tile/include/asm/topology.h | 1 + > include/linux/sched.h | 11 +- > include/linux/sched/sysctl.h | 8 + > include/linux/topology.h | 4 + > kernel/sched/core.c | 14 +- > kernel/sched/fair.c | 393 +++++++++++++++++++++++++++++++++++--- > kernel/sched/sched.h | 15 +- > kernel/sysctl.c | 13 ++ > 10 files changed, 423 insertions(+), 46 deletions(-) > > -- > 1.7.9.5 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/