Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756918Ab1FGPqB (ORCPT ); Tue, 7 Jun 2011 11:46:01 -0400 Received: from e23smtp03.au.ibm.com ([202.81.31.145]:40920 "EHLO e23smtp03.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753790Ab1FGPqA (ORCPT ); Tue, 7 Jun 2011 11:46:00 -0400 Date: Tue, 7 Jun 2011 21:15:43 +0530 From: Kamalesh Babulal To: Paul Turner Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , Bharata B Rao , Dhaval Giani , Balbir Singh , Vaidyanathan Srinivasan , Srivatsa Vaddagiri , Ingo Molnar , Pavel Emelyanov Subject: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Message-ID: <20110607154542.GA2991@linux.vnet.ibm.com> Reply-To: Kamalesh Babulal References: <20110503092846.022272244@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20110503092846.022272244@google.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 15183 Lines: 457 Hi All, In our test environment, while testing the CFS Bandwidth V6 patch set on top of 55922c9d1b84. We observed that the CPU's idle time is seen between 30% to 40% while running CPU bound test, with the cgroups tasks not pinned to the CPU's. Whereas in the inverse case, where the cgroups tasks are pinned to the CPU's, the idle time seen is nearly zero. Test Scenario -------------- - 5 cgroups are created with each groups assigned 2, 2, 4, 8, 16 tasks respectively. - Each of the cgroup, has N sub-cgroups created. Where N is the NR_TASKS the cgroup is assigned with. i.e., cgroup1, will create two sub-cgroups under it and assigned one tasks per sub-group. ------------ | cgroup 1 | ------------ / \ / \ -------------- -------------- |sub-cgroup 1| |sub-cgroup 2| | (task 1) | | (task 2) | -------------- -------------- - Top cgroup is given unlimited quota (cpu.cfs_quota_us = -1) and period of 500ms (cpu.cfs_period_us = 500000). Whereas the sub-cgroups are given 250ms of quota (cpu.cfs_quota_us = 250000) and period of 500ms. i.e. the top cgroups are given unlimited bandwidth, whereas the sub-group are throttled every 250ms. - Additional if required the proportional CPU shares can be assigned to cpu.shares as NR_TASKS * 1024. i.e. cgroup1 has 2 tasks * 1024 = 2048 worth cpu.shares for cgroup1. (In the below test results published all cgroups and sub-cgroups are given the equal share of 1024). - One CPU bound while(1) task is attached to each sub-cgroup. - sum-exec time for each cgroup/sub-cgroup is captured from /proc/sched_debug after 60 seconds and analyzed for the run time of the tasks a.k.a sub-cgroup. How is the idle CPU time measured ? ------------------------------------ - vmstat stats are logged every 2 seconds, after attaching the last while1 task to 16th sub-cgroup of cgroup 5 till the 60 sec run is over. After the run idle% of a CPU is calculated by summing idle column from the vmstat log and dividing it by number of samples collected, of-course after neglecting the first record from the log. How are the tasks pinned to the CPU ? ------------------------------------- - cgroup is mounted with cpuset,cpu controller and for every 2 sub-cgroups one physical CPU is allocated. i.e. CPU 1 is allocated between 1/1 and 1/2 (Group 1, sub-cgroup 1 and sub-cgroup 2). Similarly CPUs 7 to 15 are allocated to 15/1 to 15/16 (Group 15, subgroup 1 to 16). Note that test machine used to test has 16 CPUs. Result for non-pining case --------------------------- Only the hierarchy is created as stated above and cpusets are not assigned per cgroup. Average CPU Idle percentage 34.8% (as explained above in the Idle time measured) Bandwidth shared with remaining non-Idle 65.2% * Note: For the sake of roundoff value the numbers are multiplied by 100. In the below result for cgroup1 9.2500 corresponds to sum-exec time captured from /proc/sched_debug for cgroup 1 tasks (including sub-cgroup 1 and 2). Which is in-turn 6% of the non-Idle CPU time (which is derived by 9.2500 * 65.2 / 100 ) Bandwidth of Group 1 = 9.2500 i.e = 6.0300% of non-Idle CPU time 65.2% |...... subgroup 1/1 = 48.7800 i.e = 2.9400% of 6.0300% Groups non-Idle CPU time |...... subgroup 1/2 = 51.2100 i.e = 3.0800% of 6.0300% Groups non-Idle CPU time Bandwidth of Group 2 = 9.0400 i.e = 5.8900% of non-Idle CPU time 65.2% |...... subgroup 2/1 = 51.0200 i.e = 3.0000% of 5.8900% Groups non-Idle CPU time |...... subgroup 2/2 = 48.9700 i.e = 2.8800% of 5.8900% Groups non-Idle CPU time Bandwidth of Group 3 = 16.9300 i.e = 11.0300% of non-Idle CPU time 65.2% |...... subgroup 3/1 = 26.0300 i.e = 2.8700% of 11.0300% Groups non-Idle CPU time |...... subgroup 3/2 = 25.8800 i.e = 2.8500% of 11.0300% Groups non-Idle CPU time |...... subgroup 3/3 = 22.7800 i.e = 2.5100% of 11.0300% Groups non-Idle CPU time |...... subgroup 3/4 = 25.2900 i.e = 2.7800% of 11.0300% Groups non-Idle CPU time Bandwidth of Group 4 = 27.9300 i.e = 18.2100% of non-Idle CPU time 65.2% |...... subgroup 4/1 = 16.6000 i.e = 3.0200% of 18.2100% Groups non-Idle CPU time |...... subgroup 4/2 = 8.0000 i.e = 1.4500% of 18.2100% Groups non-Idle CPU time |...... subgroup 4/3 = 9.0000 i.e = 1.6300% of 18.2100% Groups non-Idle CPU time |...... subgroup 4/4 = 7.9600 i.e = 1.4400% of 18.2100% Groups non-Idle CPU time |...... subgroup 4/5 = 12.3500 i.e = 2.2400% of 18.2100% Groups non-Idle CPU time |...... subgroup 4/6 = 16.2500 i.e = 2.9500% of 18.2100% Groups non-Idle CPU time |...... subgroup 4/7 = 12.6100 i.e = 2.2900% of 18.2100% Groups non-Idle CPU time |...... subgroup 4/8 = 17.1900 i.e = 3.1300% of 18.2100% Groups non-Idle CPU time Bandwidth of Group 5 = 36.8300 i.e = 24.0100% of non-Idle CPU time 65.2% |...... subgroup 5/1 = 56.6900 i.e = 13.6100% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/2 = 8.8600 i.e = 2.1200% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/3 = 5.5100 i.e = 1.3200% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/4 = 4.5700 i.e = 1.0900% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/5 = 7.9500 i.e = 1.9000% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/6 = 2.1600 i.e = .5100% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/7 = 2.3400 i.e = .5600% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/8 = 2.1500 i.e = .5100% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/9 = 9.7200 i.e = 2.3300% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/10 = 5.0600 i.e = 1.2100% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/11 = 4.6900 i.e = 1.1200% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/12 = 8.9700 i.e = 2.1500% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/13 = 8.4600 i.e = 2.0300% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/14 = 11.8400 i.e = 2.8400% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/15 = 6.3400 i.e = 1.5200% of 24.0100% Groups non-Idle CPU time |...... subgroup 5/16 = 5.1500 i.e = 1.2300% of 24.0100% Groups non-Idle CPU time Pinned case -------------- CPU hierarchy is created and cpusets are allocated. Average CPU Idle percentage 0% Bandwidth shared with remaining non-Idle 100% Bandwidth of Group 1 = 6.3400 i.e = 6.3400% of non-Idle CPU time 100% |...... subgroup 1/1 = 50.0400 i.e = 3.1700% of 6.3400% Groups non-Idle CPU time |...... subgroup 1/2 = 49.9500 i.e = 3.1600% of 6.3400% Groups non-Idle CPU time Bandwidth of Group 2 = 6.3200 i.e = 6.3200% of non-Idle CPU time 100% |...... subgroup 2/1 = 50.0400 i.e = 3.1600% of 6.3200% Groups non-Idle CPU time |...... subgroup 2/2 = 49.9500 i.e = 3.1500% of 6.3200% Groups non-Idle CPU time Bandwidth of Group 3 = 12.6300 i.e = 12.6300% of non-Idle CPU time 100% |...... subgroup 3/1 = 25.0300 i.e = 3.1600% of 12.6300% Groups non-Idle CPU time |...... subgroup 3/2 = 25.0100 i.e = 3.1500% of 12.6300% Groups non-Idle CPU time |...... subgroup 3/3 = 25.0000 i.e = 3.1500% of 12.6300% Groups non-Idle CPU time |...... subgroup 3/4 = 24.9400 i.e = 3.1400% of 12.6300% Groups non-Idle CPU time Bandwidth of Group 4 = 25.1000 i.e = 25.1000% of non-Idle CPU time 100% |...... subgroup 4/1 = 12.5400 i.e = 3.1400% of 25.1000% Groups non-Idle CPU time |...... subgroup 4/2 = 12.5100 i.e = 3.1400% of 25.1000% Groups non-Idle CPU time |...... subgroup 4/3 = 12.5300 i.e = 3.1400% of 25.1000% Groups non-Idle CPU time |...... subgroup 4/4 = 12.5000 i.e = 3.1300% of 25.1000% Groups non-Idle CPU time |...... subgroup 4/5 = 12.4900 i.e = 3.1300% of 25.1000% Groups non-Idle CPU time |...... subgroup 4/6 = 12.4700 i.e = 3.1200% of 25.1000% Groups non-Idle CPU time |...... subgroup 4/7 = 12.4700 i.e = 3.1200% of 25.1000% Groups non-Idle CPU time |...... subgroup 4/8 = 12.4500 i.e = 3.1200% of 25.1000% Groups non-Idle CPU time Bandwidth of Group 5 = 49.5700 i.e = 49.5700% of non-Idle CPU time 100% |...... subgroup 5/1 = 49.8500 i.e = 24.7100% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/2 = 6.2900 i.e = 3.1100% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/3 = 6.2800 i.e = 3.1100% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/4 = 6.2700 i.e = 3.1000% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/5 = 6.2700 i.e = 3.1000% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/6 = 6.2600 i.e = 3.1000% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/7 = 6.2500 i.e = 3.0900% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/8 = 6.2400 i.e = 3.0900% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/9 = 6.2400 i.e = 3.0900% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/10 = 6.2300 i.e = 3.0800% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/11 = 6.2300 i.e = 3.0800% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/12 = 6.2200 i.e = 3.0800% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/13 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/14 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/15 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time |...... subgroup 5/16 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time with equal cpu shares allocated to all the groups/sub-cgroups and CFS bandwidth configured to allow 100% CPU utilization. We see the CPU idle time in the un-pinned case. Benchmark used to reproduce the issue, is attached. Justing executing the script should report similar numbers. #!/bin/bash NR_TASKS1=2 NR_TASKS2=2 NR_TASKS3=4 NR_TASKS4=8 NR_TASKS5=16 BANDWIDTH=1 SUBGROUP=1 PRO_SHARES=0 MOUNT=/cgroup/ LOAD=/root/while1 usage() { echo "Usage $0: [-b 0|1] [-s 0|1] [-p 0|1]" echo "-b 1|0 set/unset Cgroups bandwidth control (default set)" echo "-s Create sub-groups for every task (default creates sub-group)" echo "-p create propotional shares based on cpus" exit } while getopts ":b:s:p:" arg do case $arg in b) BANDWIDTH=$OPTARG shift if [ $BANDWIDTH -gt 1 ] && [ $BANDWIDTH -lt 0 ] then usage fi ;; s) SUBGROUP=$OPTARG shift if [ $SUBGROUP -gt 1 ] && [ $SUBGROUP -lt 0 ] then usage fi ;; p) PRO_SHARES=$OPTARG shift if [ $PRO_SHARES -gt 1 ] && [ $PRO_SHARES -lt 0 ] then usage fi ;; *) esac done if [ ! -d $MOUNT ] then mkdir -p $MOUNT fi test() { echo -n "[ " if [ $1 -eq 0 ] then echo -ne '\E[42;40mOk' else echo -ne '\E[31;40mFailed' tput sgr0 echo " ]" exit fi tput sgr0 echo " ]" } mount_cgrp() { echo -n "Mounting root cgroup " mount -t cgroup -ocpu,cpuset,cpuacct none $MOUNT &> /dev/null test $? } umount_cgrp() { echo -n "Unmounting root cgroup " cd /root/ umount $MOUNT test $? } create_hierarchy() { mount_cgrp cpuset_mem=`cat $MOUNT/cpuset.mems` cpuset_cpu=`cat $MOUNT/cpuset.cpus` echo -n "creating groups/sub-groups ..." for (( i=1; i<=5; i++ )) do mkdir $MOUNT/$i echo $cpuset_mem > $MOUNT/$i/cpuset.mems echo $cpuset_cpu > $MOUNT/$i/cpuset.cpus echo -n ".." if [ $SUBGROUP -eq 1 ] then jj=$(eval echo "\$NR_TASKS$i") for (( j=1; j<=$jj; j++ )) do mkdir -p $MOUNT/$i/$j echo $cpuset_mem > $MOUNT/$i/$j/cpuset.mems echo $cpuset_cpu > $MOUNT/$i/$j/cpuset.cpus echo -n ".." done fi done echo "." } cleanup() { pkill -9 while1 &> /dev/null sleep 10 echo -n "Umount groups/sub-groups .." for (( i=1; i<=5; i++ )) do if [ $SUBGROUP -eq 1 ] then jj=$(eval echo "\$NR_TASKS$i") for (( j=1; j<=$jj; j++ )) do rmdir $MOUNT/$i/$j echo -n ".." done fi rmdir $MOUNT/$i echo -n ".." done echo " " umount_cgrp } load_tasks() { for (( i=1; i<=5; i++ )) do jj=$(eval echo "\$NR_TASKS$i") shares="1024" if [ $PRO_SHARES -eq 1 ] then eval shares=$(echo "$jj * 1024" | bc) fi echo $hares > $MOUNT/$i/cpu.shares for (( j=1; j<=$jj; j++ )) do echo "-1" > $MOUNT/$i/cpu.cfs_quota_us echo "500000" > $MOUNT/$i/cpu.cfs_period_us if [ $SUBGROUP -eq 1 ] then $LOAD & echo $! > $MOUNT/$i/$j/tasks echo "1024" > $MOUNT/$i/$j/cpu.shares if [ $BANDWIDTH -eq 1 ] then echo "500000" > $MOUNT/$i/$j/cpu.cfs_period_us echo "250000" > $MOUNT/$i/$j/cpu.cfs_quota_us fi else $LOAD & echo $! > $MOUNT/$i/tasks echo $shares > $MOUNT/$i/cpu.shares if [ $BANDWIDTH -eq 1 ] then echo "500000" > $MOUNT/$i/cpu.cfs_period_us echo "250000" > $MOUNT/$i/cpu.cfs_quota_us fi fi done done echo "Captuing idle cpu time with vmstat...." vmstat 2 100 &> vmstat_log & } pin_tasks() { cpu=0 count=1 for (( i=1; i<=5; i++ )) do if [ $SUBGROUP -eq 1 ] then jj=$(eval echo "\$NR_TASKS$i") for (( j=1; j<=$jj; j++ )) do if [ $count -gt 2 ] then cpu=$((cpu+1)) count=1 fi echo $cpu > $MOUNT/$i/$j/cpuset.cpus count=$((count+1)) done else case $i in 1) echo 0 > $MOUNT/$i/cpuset.cpus;; 2) echo 1 > $MOUNT/$i/cpuset.cpus;; 3) echo "2-3" > $MOUNT/$i/cpuset.cpus;; 4) echo "4-6" > $MOUNT/$i/cpuset.cpus;; 5) echo "7-15" > $MOUNT/$i/cpuset.cpus;; esac fi done } print_results() { eval gtot=$(cat sched_log|grep -i while|sed 's/R//g'|awk '{gtot+=$7};END{printf "%f", gtot}') for (( i=1; i<=5; i++ )) do eval temp=$(cat sched_log_$i|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}') eval tavg=$(echo "scale=4;(($temp / $gtot) * $1)/100 " | bc) eval avg=$(echo "scale=4;($temp / $gtot) * 100" | bc) eval pretty_tavg=$( echo "scale=4; $tavg * 100"| bc) # F0r pretty format echo "Bandwidth of Group $i = $avg i.e = $pretty_tavg% of non-Idle CPU time $1%" if [ $SUBGROUP -eq 1 ] then jj=$(eval echo "\$NR_TASKS$i") for (( j=1; j<=$jj; j++ )) do eval tmp=$(cat sched_log_$i-$j|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}') eval stavg=$(echo "scale=4;($tmp / $temp) * 100" | bc) eval pretty_stavg=$(echo "scale=4;(($tmp / $temp) * $tavg) * 100" | bc) echo -n "|" echo -e "...... subgroup $i/$j\t= $stavg\ti.e = $pretty_stavg% of $pretty_tavg% Groups non-Idle CPU time" done fi echo " " echo " " done } capture_results() { cat /proc/sched_debug > sched_log pkill -9 vmstat -c avg=$(cat vmstat_log |grep -iv "system"|grep -iv "swpd"|awk ' { if ( NR != 1) {id+=$15 }}END{print (id/NR)}') rem=$(echo "scale=2; 100 - $avg" |bc) echo "Average CPU Idle percentage $avg%" echo "Bandwidth shared with remaining non-Idle $rem%" for (( i=1; i<=5; i++ )) do cat sched_log |grep -i while1|grep -i " \/$i" > sched_log_$i if [ $SUBGROUP -eq 1 ] then jj=$(eval echo "\$NR_TASKS$i") for (( j=1; j<=$jj; j++ )) do cat sched_log |grep -i while1|grep -i " \/$i\/$j" > sched_log_$i-$j done fi done print_results $rem } create_hierarchy pin_tasks load_tasks sleep 60 capture_results cleanup exit Thanks, Kamalesh. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/