Hi all,
Please find attached the latest version of bandwidth control for the normal
scheduling class. This revision has undergone fairly extensive changes since
the previous version based largely on the observation that many of the edge
conditions requiring special casing around update_curr() were a result of
introducing side-effects into that operation. By introducing an interstitial
state, where we recognize that the runqueue is over bandwidth, but not marking
it throttled until we can actually remove it from the CPU we avoid the
previous possible interactions with throttled entities which eliminates some
head-scratching corner cases.
In particular I'd like to thank Peter Zijlstra who provided extensive comments
and review for the last series.
Changes since v4:
New features:
- Bandwidth control now properly works with hotplug, throttled tasks are
returned to rq on cpu-offline so that they can be migrated.
- It is now validated that hierarchies are consistent with their resource
reservations. That is, the sum of a sub-hierarchy's bandwidth requirements
will not exceed the bandwidth provisioned to the parent. (This enforcement
is optional and controlled by a sysctl.)
- It is now tracked whether quota is 'current' or not, this allows for the
expiration of slack quota from prioir scheduling periors as well as the return
of quota by idling cpus.
Major:
- The atomicity of update_curr() is restored, it will now only perform the
accounting required for bandwidth control. The act of checking whether
quota has been exceeded is made explicit. This avoids the previous corner
cases required in enqueue/dequeue-entity.
- The act of throttling is now deferred until we reach put_task(). This means
that the transition to throttled is atomic and the special case interactions
with a running-but-throttled-entity (in the case where we couldn't previously
immediately handle a resched) are no longer needed.
- The correction for shares accounting during a throttled period has been
extended to work for the children of a throttled run-queue.
- Throttled cfs_rqs are now explicitly tracked using a list, this avoids the
need to revisit every cfs_rq on period expiration on large systems.
Minor:
- Hierarchal task accounting is no longer a separate hierachy evaluation.
- (Buglet) nr_running accounting added to sched::stoptask
- (Buglet) Will no longer load balance the child hierarchies of a throttled
entity.
- (Fixlet) don't process dequeued entities twice in dequeue_task_fair()
- walk_tg_tree refactored to allow for partial sub-tree evaluations.
- Dropped some #ifdefs
- Fixed some compile warnings with various CONFIG permutations
- Local bandwidth is now consumed "negatively"
- Quota slices now 5ms
Probably some others that I missed, there was a lot of refactoring and cleanup.
Interface:
----------
Three new cgroupfs files are exported by the cpu subsystem:
cpu.cfs_period_us : period over which bandwidth is to be regulated
cpu.cfs_quota_us : bandwidth available for consumption per period
cpu.stat : statistics (such as number of throttled periods and
total throttled time)
One important interface change that this introduces (versus the rate limits
proposal) is that the defined bandwidth becomes an absolute quantifier.
Previous postings:
-----------------
v4:
https://lkml.org/lkml/2011/2/23/44
v3:
https://lkml.org/lkml/2010/10/12/44
v2:
http://lkml.org/lkml/2010/4/28/88
Original posting:
http://lkml.org/lkml/2010/2/12/393
Prior approaches:
http://lkml.org/lkml/2010/1/5/44 ["CFS Hard limits v5"]
Thanks,
- Paul
On Tue, Mar 22, 2011 at 08:03:26PM -0700, Paul Turner wrote:
> Hi all,
>
> Please find attached the latest version of bandwidth control for the normal
> scheduling class. This revision has undergone fairly extensive changes since
> the previous version based largely on the observation that many of the edge
> conditions requiring special casing around update_curr() were a result of
> introducing side-effects into that operation. By introducing an interstitial
> state, where we recognize that the runqueue is over bandwidth, but not marking
> it throttled until we can actually remove it from the CPU we avoid the
> previous possible interactions with throttled entities which eliminates some
> head-scratching corner cases.
I am seeing hard lockups occasionally, not always reproducible. This particular
one occured when I had 1 task in a bandwidth constrained parent group and 10
tasks in its child group which has infinite bandwidth on a 16 CPU system.
Here is the log...
WARNING: at kernel/watchdog.c:226 watchdog_overflow_callback+0x98/0xc0()
Hardware name: System x3650 M2 -[794796Q]-
Watchdog detected hard LOCKUP on cpu 0
Modules linked in: autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 ext4 jbd2 dm_mirror dm_region_hash dm_log dm_mod kvm_intel kvm uinput matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc cdc_ether usbnet mii ses enclosure sg serio_raw pcspkr i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca i7core_edac edac_core bnx2 ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif usb_storage pata_acpi ata_generic ata_piix megaraid_sas qla2xxx scsi_transport_fc scsi_tgt [last unloaded: microcode]
Pid: 0, comm: swapper Not tainted 2.6.38-tip #6
Call Trace:
<NMI> [<ffffffff8106558f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff81065686>] warn_slowpath_fmt+0x46/0x50
[<ffffffff810d8158>] watchdog_overflow_callback+0x98/0xc0
[<ffffffff8110fb39>] __perf_event_overflow+0x99/0x250
[<ffffffff8110d2dd>] ? perf_event_update_userpage+0xbd/0x140
[<ffffffff8110d220>] ? perf_event_update_userpage+0x0/0x140
[<ffffffff81110234>] perf_event_overflow+0x14/0x20
[<ffffffff8101eb66>] intel_pmu_handle_irq+0x306/0x560
[<ffffffff8150e4c1>] ? hw_breakpoint_exceptions_notify+0x21/0x200
[<ffffffff8150faf6>] ? kprobe_exceptions_notify+0x16/0x450
[<ffffffff8150e6f0>] perf_event_nmi_handler+0x50/0xc0
[<ffffffff81510aa4>] notifier_call_chain+0x94/0xd0
[<ffffffff81510b4c>] __atomic_notifier_call_chain+0x6c/0xa0
[<ffffffff81510ae0>] ? __atomic_notifier_call_chain+0x0/0xa0
[<ffffffff81510b96>] atomic_notifier_call_chain+0x16/0x20
[<ffffffff81510bce>] notify_die+0x2e/0x30
[<ffffffff8150d89a>] do_nmi+0xda/0x2a0
[<ffffffff8150d4e0>] nmi+0x20/0x39
[<ffffffff8109f4a3>] ? register_lock_class+0xb3/0x550
<<EOE>> <IRQ> [<ffffffff81013e73>] ? native_sched_clock+0x13/0x60
[<ffffffff810131e9>] ? sched_clock+0x9/0x10
[<ffffffff81090e0d>] ? sched_clock_cpu+0xcd/0x110
[<ffffffff810a2348>] __lock_acquire+0x98/0x15c0
[<ffffffff810a2628>] ? __lock_acquire+0x378/0x15c0
[<ffffffff81013e73>] ? native_sched_clock+0x13/0x60
[<ffffffff810131e9>] ? sched_clock+0x9/0x10
[<ffffffff81049880>] ? tg_unthrottle_down+0x0/0x50
[<ffffffff810a3928>] lock_acquire+0xb8/0x150
[<ffffffff81059e9c>] ? distribute_cfs_bandwidth+0xfc/0x1d0
[<ffffffff8150c146>] _raw_spin_lock+0x36/0x70
[<ffffffff81059e9c>] ? distribute_cfs_bandwidth+0xfc/0x1d0
[<ffffffff81059e9c>] distribute_cfs_bandwidth+0xfc/0x1d0
[<ffffffff81059da0>] ? distribute_cfs_bandwidth+0x0/0x1d0
[<ffffffff8105a0eb>] sched_cfs_period_timer+0x9b/0x100
[<ffffffff8105a050>] ? sched_cfs_period_timer+0x0/0x100
[<ffffffff8108e631>] __run_hrtimer+0x91/0x1f0
[<ffffffff8108e9fa>] hrtimer_interrupt+0xda/0x250
[<ffffffff8109a5d9>] tick_do_broadcast+0x49/0x90
[<ffffffff8109a71c>] tick_handle_oneshot_broadcast+0xfc/0x140
[<ffffffff8100ecae>] timer_interrupt+0x1e/0x30
[<ffffffff810d8bcd>] handle_irq_event_percpu+0x5d/0x230
[<ffffffff810d8e28>] handle_irq_event+0x58/0x80
[<ffffffff810dbaae>] ? handle_edge_irq+0x1e/0xe0
[<ffffffff810dbaff>] handle_edge_irq+0x6f/0xe0
[<ffffffff8100e449>] handle_irq+0x49/0xa0
[<ffffffff81516bed>] do_IRQ+0x5d/0xe0
[<ffffffff8150ce53>] ret_from_intr+0x0/0x1a
<EOI> [<ffffffff8109dbbd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff812dd074>] ? acpi_idle_enter_bm+0x242/0x27a
[<ffffffff812dd06d>] ? acpi_idle_enter_bm+0x23b/0x27a
[<ffffffff813ee532>] cpuidle_idle_call+0xc2/0x260
[<ffffffff8100c07c>] cpu_idle+0xbc/0x110
[<ffffffff814f0937>] rest_init+0xb7/0xc0
[<ffffffff814f0880>] ? rest_init+0x0/0xc0
[<ffffffff81dfffa2>] start_kernel+0x41c/0x427
[<ffffffff81dff346>] x86_64_start_reservations+0x131/0x135
[<ffffffff81dff44d>] x86_64_start_kernel+0x103/0x112
On 03/23/2011 11:03 AM, Paul Turner wrote:
> Hi all,
>
> Please find attached the latest version of bandwidth control for the normal
> scheduling class. This revision has undergone fairly extensive changes since
> the previous version based largely on the observation that many of the edge
> conditions requiring special casing around update_curr() were a result of
> introducing side-effects into that operation. By introducing an interstitial
> state, where we recognize that the runqueue is over bandwidth, but not marking
> it throttled until we can actually remove it from the CPU we avoid the
> previous possible interactions with throttled entities which eliminates some
> head-scratching corner cases.
>
Hi Paul,
I have wrote some codes to test this patchset. While run attached test case,
the test program is blocked, it seams the children tasks can not be killed,
after one or two hours, the watchdog is expired and triggers box crashed.
On Thu, Mar 31, 2011 at 12:57 AM, Xiao Guangrong
<[email protected]> wrote:
> On 03/23/2011 11:03 AM, Paul Turner wrote:
>> Hi all,
>>
>> Please find attached the latest version of bandwidth control for the normal
>> scheduling class. ?This revision has undergone fairly extensive changes since
>> the previous version based largely on the observation that many of the edge
>> conditions requiring special casing around update_curr() were a result of
>> introducing side-effects into that operation. ?By introducing an interstitial
>> state, where we recognize that the runqueue is over bandwidth, but not marking
>> it throttled until we can actually remove it from the CPU we avoid the
>> previous possible interactions with throttled entities which eliminates some
>> head-scratching corner cases.
>>
>
> Hi Paul,
>
> I have wrote some codes to test this patchset. While run attached test case,
> the test program is blocked, it seams the children tasks can not be killed,
> after one or two hours, the watchdog is expired and triggers box crashed.
>
Sorry, I was out last week.
I'm taking a look at this now, I believe the interaction lies
somewhere within the quota return paths.
Thanks for taking the time to test!
- Paul
On Tue, 2011-03-22 at 20:03 -0700, Paul Turner wrote:
>
> In particular I'd like to thank Peter Zijlstra who provided extensive comments
> and review for the last series.
While I still loathe the feature the patches look much saner than last
time.
Hi Paul,
I'm so sorry for sending this mail in the new thread, since i didn't
receive your V6 patchset from LKML.
It seams the patchset can not be applied, since it's conflict between
patch 3 and patch 5:
========Quote========
[patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking
+#ifdef CONFIG_CFS_BANDWIDTH
+ int runtime_enabled;
+ s64 runtime_remaining;
+#endif
#endif
};
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+ return &tg->cfs_bandwidth;
+}
+
+static inline u64 default_cfs_period(void);
+
[patch 05/15] sched: add a timer to handle CFS bandwidth refresh
@@ -394,12 +400,38 @@ static inline struct cfs_bandwidth *tg_c
#ifdef CONFIG_CFS_BANDWIDTH
static inline u64 default_cfs_period(void);
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+ struct cfs_bandwidth *cfs_b =
+ container_of(timer, struct cfs_bandwidth, period_timer);
+ ktime_t now;
+ int overrun;
========End quote========
I downloaded the patchset from Internet, i missed the newer version?
I have done some test after fixed the conflict by handle, below test can cause
box crash:
========Quote cpu_hotlpug.sh ========
#!/bin/sh
ROOT_PATH="/mnt"
def_quota=30000
def_period=100000
pid=
creat_process()
{
nice -n $1 cat /dev/zero > /dev/null &
pid=$!
if [ $2 -ne -1 ]; then
taskset -pc $2 $pid &> /dev/null
fi
}
HOTPLUG_PATH=$ROOT_PATH/cpu-hotplug
mount -t cgroup -o cpu none $ROOT_PATH
mkdir $HOTPLUG_PATH
echo $def_quota > $HOTPLUG_PATH/cpu.cfs_quota_us
echo $def_period > $HOTPLUG_PATH/cpu.cfs_period_us
# create 3 tasks for every cpu
for((i=0;i<3;i++))
{
creat_process -6 1
echo $pid > $HOTPLUG_PATH/tasks
}
for((i=0;i<3;i++))
{
creat_process -6 2
echo $pid > $HOTPLUG_PATH/tasks
}
for((i=0;i<3;i++))
{
creat_process -6 3
echo $pid > $HOTPLUG_PATH/tasks
}
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 1 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online
echo 1 > /sys/devices/system/cpu/cpu3/online
killall -9 cat
rmdir $HOTPLUG_PATH
umount $ROOT_PATH
======== End quote cpu_hotlpug.sh ========
Sorry to disturb you if the bug is know.
Thanks!
Hi Paul and Xiao,
Please check/test a fix at the foot of this mail.
(2011/05/20 11:12), Xiao Guangrong wrote:
> Hi Paul,
>
> I'm so sorry for sending this mail in the new thread, since i didn't
> receive your V6 patchset from LKML.
>
> It seams the patchset can not be applied, since it's conflict between
> patch 3 and patch 5:
>
> ========Quote========
(snip)
> ========End quote========
Maybe I've fixed it by hand, or git-am is so wonderful.
I believe Paul will do it right for next time.
>
> I downloaded the patchset from Internet, i missed the newer version?
>
> I have done some test after fixed the conflict by handle, below test can cause
> box crash:
>
> ========Quote cpu_hotlpug.sh ========
(snip)
> ======== End quote cpu_hotlpug.sh ========
>
> Sorry to disturb you if the bug is know.
>
> Thanks!
Thank you for reporting it, Xiao!
I confirmed that running your test cause hung-up on my box.
And after some investigation, I found that this is an infinite loop
in migrate_task() due to miscalculation of rq->nr_running; when a
task is queued to throttled entity the nr_running is incremented at
the queuing and also the unthrottling.
I made a fix for this bug and it seems works well for me.
Could you try this patch and give us your feedback, Xiao?
Thanks,
H.Seto
---
kernel/sched_fair.c | 28 +++++++++++++---------------
1 files changed, 13 insertions(+), 15 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3936393..544072f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1537,7 +1537,7 @@ static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
(void *)&udd);
- if (!cfs_rq->load.weight)
+ if (!cfs_rq->h_nr_running)
return;
task_delta = cfs_rq->h_nr_running;
@@ -1843,10 +1843,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq->h_nr_running++;
/* end evaluation on throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq)) {
- se = NULL;
- break;
- }
+ if (cfs_rq_throttled(cfs_rq))
+ goto done;
+
flags = ENQUEUE_WAKEUP;
}
@@ -1855,14 +1854,14 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq->h_nr_running++;
if (cfs_rq_throttled(cfs_rq))
- break;
+ goto done;
update_cfs_load(cfs_rq, 0);
update_cfs_shares(cfs_rq);
}
- if (!se)
- inc_nr_running(rq);
+ inc_nr_running(rq);
+done:
hrtick_update(rq);
}
@@ -1885,10 +1884,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq->h_nr_running--;
/* end evaluation on throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq)) {
- se = NULL;
- break;
- }
+ if (cfs_rq_throttled(cfs_rq))
+ goto done;
+
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
/* Avoid double update below. */
@@ -1910,14 +1908,14 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
cfs_rq->h_nr_running--;
if (cfs_rq_throttled(cfs_rq))
- break;
+ goto done;
update_cfs_load(cfs_rq, 0);
update_cfs_shares(cfs_rq);
}
- if (!se)
- dec_nr_running(rq);
+ dec_nr_running(rq);
+done:
hrtick_update(rq);
}
--
1.7.4.4
On 05/24/2011 08:53 AM, Hidetoshi Seto wrote:
> I confirmed that running your test cause hung-up on my box.
> And after some investigation, I found that this is an infinite loop
> in migrate_task() due to miscalculation of rq->nr_running; when a
> task is queued to throttled entity the nr_running is incremented at
> the queuing and also the unthrottling.
>
> I made a fix for this bug and it seems works well for me.
> Could you try this patch and give us your feedback, Xiao?
>
Thanks for your nice fix, Seto! Everything goes well now!
Tested-by: Xiao Guangrong <[email protected]>
[ Sorry for the delayed response, I was out on vacation for the second
half of May until last week -- I've now caught up on email and am
preparing the next posting ]
On Mon, May 23, 2011 at 5:53 PM, Hidetoshi Seto
<[email protected]> wrote:
> Hi Paul and Xiao,
>
> Please check/test a fix at the foot of this mail.
>
> (2011/05/20 11:12), Xiao Guangrong wrote:
>> Hi Paul,
>>
>> I'm so sorry for sending this mail in the new thread, since i didn't
>> receive your V6 patchset from LKML.
>>
>> It seams the patchset can not be applied, since it's conflict between
>> patch 3 and patch 5:
>>
>> ========Quote========
> (snip)
>> ========End quote========
>
> Maybe I've fixed it by hand, or git-am is so wonderful.
>
> I believe Paul will do it right for next time.
>
>>
>> I downloaded the patchset from Internet, i missed the newer version?
>>
>> I have done some test after fixed the conflict by handle, below test can cause
>> box crash:
>>
>> ========Quote cpu_hotlpug.sh ========
> (snip)
>> ======== End quote cpu_hotlpug.sh ========
>>
>> Sorry to disturb you if the bug is know.
>>
>> Thanks!
>
> Thank you for reporting it, Xiao!
>
> I confirmed that running your test cause hung-up on my box.
> And after some investigation, I found that this is an infinite loop
> in migrate_task() due to miscalculation of rq->nr_running; when a
> task is queued to throttled entity the nr_running is incremented at
> the queuing and also the unthrottling.
>
> I made a fix for this bug and it seems works well for me.
> Could you try this patch and give us your feedback, Xiao?
>
> Thanks,
> H.Seto
>
> ---
> kernel/sched_fair.c | 28 +++++++++++++---------------
> 1 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 3936393..544072f 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1537,7 +1537,7 @@ static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
> (void *)&udd);
>
> - if (!cfs_rq->load.weight)
> + if (!cfs_rq->h_nr_running)
> return;
>
Why change here?
> task_delta = cfs_rq->h_nr_running;
> @@ -1843,10 +1843,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> cfs_rq->h_nr_running++;
>
> /* end evaluation on throttled cfs_rq */
> - if (cfs_rq_throttled(cfs_rq)) {
> - se = NULL;
Hmm.. yeah this is a casualty of moving the h_nr_running computations
in-line as a part of the previous refactoring within the last
releases. This optimization (setting se = NULL to skip the second
half) obviously won't work properly with detecting whether we made it
to the end of the tree.
> - break;
> - }
> + if (cfs_rq_throttled(cfs_rq))
> + goto done;
> +
> flags = ENQUEUE_WAKEUP;
> }
>
> @@ -1855,14 +1854,14 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> cfs_rq->h_nr_running++;
>
> if (cfs_rq_throttled(cfs_rq))
> - break;
> + goto done;
>
> update_cfs_load(cfs_rq, 0);
> update_cfs_shares(cfs_rq);
> }
>
> - if (!se)
> - inc_nr_running(rq);
> + inc_nr_running(rq);
> +done:
> hrtick_update(rq);
> }
>
> @@ -1885,10 +1884,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> cfs_rq->h_nr_running--;
>
> /* end evaluation on throttled cfs_rq */
> - if (cfs_rq_throttled(cfs_rq)) {
> - se = NULL;
> - break;
> - }
> + if (cfs_rq_throttled(cfs_rq))
> + goto done;
> +
> /* Don't dequeue parent if it has other entities besides us */
> if (cfs_rq->load.weight) {
> /* Avoid double update below. */
> @@ -1910,14 +1908,14 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> cfs_rq->h_nr_running--;
>
> if (cfs_rq_throttled(cfs_rq))
> - break;
> + goto done;
>
> update_cfs_load(cfs_rq, 0);
> update_cfs_shares(cfs_rq);
> }
>
> - if (!se)
> - dec_nr_running(rq);
> + dec_nr_running(rq);
> +done:
> hrtick_update(rq);
> }
>
> --
> 1.7.4.4
>
>
>
How about instead something like the following. We can actually take
advantage of the second loop always executing by deferring the
accounting update on a throttle entity. This keeps the control flow
within dequeue_task_fair linear.
What do you think of (untested):
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1744,13 +1744,12 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
break;
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, flags);
- cfs_rq->h_nr_running++;
- /* end evaluation on throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq)) {
- se = NULL;
+ /* note: ordering with throttle check to perform
h_nr_running accounting on throttled entity below */
+ if (cfs_rq_throttled(cfs_rq))
break;
- }
+
+ cfs_rq->h_nr_running++;
flags = ENQUEUE_WAKEUP;
}
@@ -1786,13 +1785,12 @@ static void dequeue_task_fair(struct rq *rq,
struct task_struct *p, int flags)
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
- cfs_rq->h_nr_running--;
- /* end evaluation on throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq)) {
- se = NULL;
+ /* note: ordering with throttle check to perform
h_nr_running accounting on throttled entity below */
+ if (cfs_rq_throttled(cfs_rq))
break;
- }
+
+ cfs_rq->h_nr_running--;
/* Don't dequeue parent if it has other entities besides
(2011/06/08 11:54), Paul Turner wrote:
> On Mon, May 23, 2011 at 5:53 PM, Hidetoshi Seto
> <[email protected]> wrote:
>
>> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
>> index 3936393..544072f 100644
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -1537,7 +1537,7 @@ static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_down, tg_nop,
>> (void *)&udd);
>>
>> - if (!cfs_rq->load.weight)
>> + if (!cfs_rq->h_nr_running)
>> return;
>>
>
> Why change here?
I've confused a bit - just curious if by any chance there is throttled
cfs_rq that have (load.weight, h_nr_running) = (0, >0).
>> task_delta = cfs_rq->h_nr_running;
>> @@ -1843,10 +1843,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> cfs_rq->h_nr_running++;
>>
>> /* end evaluation on throttled cfs_rq */
>> - if (cfs_rq_throttled(cfs_rq)) {
>> - se = NULL;
>
> Hmm.. yeah this is a casualty of moving the h_nr_running computations
> in-line as a part of the previous refactoring within the last
> releases. This optimization (setting se = NULL to skip the second
> half) obviously won't work properly with detecting whether we made it
> to the end of the tree.
>
(snip)
>
> How about instead something like the following. We can actually take
> advantage of the second loop always executing by deferring the
> accounting update on a throttle entity. This keeps the control flow
> within dequeue_task_fair linear.
>
> What do you think of (untested):
>
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1744,13 +1744,12 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
> break;
> cfs_rq = cfs_rq_of(se);
> enqueue_entity(cfs_rq, se, flags);
> - cfs_rq->h_nr_running++;
>
> - /* end evaluation on throttled cfs_rq */
> - if (cfs_rq_throttled(cfs_rq)) {
> - se = NULL;
> + /* note: ordering with throttle check to perform
> h_nr_running accounting on throttled entity below */
> + if (cfs_rq_throttled(cfs_rq))
> break;
> - }
> +
> + cfs_rq->h_nr_running++;
> flags = ENQUEUE_WAKEUP;
> }
>
> @@ -1786,13 +1785,12 @@ static void dequeue_task_fair(struct rq *rq,
> struct task_struct *p, int flags)
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> dequeue_entity(cfs_rq, se, flags);
> - cfs_rq->h_nr_running--;
>
> - /* end evaluation on throttled cfs_rq */
> - if (cfs_rq_throttled(cfs_rq)) {
> - se = NULL;
> + /* note: ordering with throttle check to perform
> h_nr_running accounting on throttled entity below */
> + if (cfs_rq_throttled(cfs_rq))
> break;
> - }
> +
> + cfs_rq->h_nr_running--;
> /* Don't dequeue parent if it has other entities besides
Looks good if it abides by the nature of scheduler codes ;-)
Thanks,
H.Seto