2011-02-16 03:20:42

by Paul Turner

[permalink] [raw]
Subject: [CFS Bandwidth Control v4 0/7] Introduction

Hi all,

Please find attached v4 of CFS bandwidth control; while this rebase against
some of the latest SCHED_NORMAL code is new, the features and methodology are
fairly mature at this point and have proved both effective and stable for
several workloads.

As always, all comments/feedback welcome.

Changes since v3:
- Rebased to current tip, update to work with new group scheduling accounting
- (Bug fix) Fixed Race with unthrottling (due to changing global limit) fixed
- (Bug fix) Fixed buddy interactions -- in particular, prevent buddy
nominations from re-picking throttled entities

The skeleton of our approach is as follows:
- We maintain a global pool (per-tg) pool of unassigned quota. Within it
we track the bandwidth period, quota per period, and runtime remaining in
the current period. As bandwidth is used within a period it is decremented
from runtime. Runtime is currently synchronized using a spinlock, in the
current implementation there's no reason this couldn't be done using
atomic ops instead however the spinlock allows for a little more flexibility
in experimentation with other schemes.
- When a cfs_rq participating in a bandwidth constrained task_group executes
it acquires time in sysctl_sched_cfs_bandwidth_slice (default currently
10ms) size chunks from the global pool, this synchronizes under rq->lock and
is part of the update_curr path.
- Throttled entities are dequeued, we protect against their re-introduction to
the scheduling hierarchy via checking for a, per cfs_rq, throttled bit.

Interface:
----------
Three new cgroupfs files are exported by the cpu subsystem:
cpu.cfs_period_us : period over which bandwidth is to be regulated
cpu.cfs_quota_us : bandwidth available for consumption per period
cpu.stat : statistics (such as number of throttled periods and
total throttled time)
One important interface change that this introduces (versus the rate limits
proposal) is that the defined bandwidth becomes an absolute quantifier.

Previous postings:
-----------------
v3:
https://lkml.org/lkml/2010/10/12/44
v2:
http://lkml.org/lkml/2010/4/28/88
Original posting:
http://lkml.org/lkml/2010/2/12/393

Prior approaches:
http://lkml.org/lkml/2010/1/5/44 ("CFS Hard limits v5")

Thanks,

- Paul


2011-02-21 02:46:35

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

On 02/16/2011 11:18 AM, Paul Turner wrote:
> Hi all,
>
> Please find attached v4 of CFS bandwidth control; while this rebase against
> some of the latest SCHED_NORMAL code is new, the features and methodology are
> fairly mature at this point and have proved both effective and stable for
> several workloads.
>
> As always, all comments/feedback welcome.
>

Hi Paul,

Thanks for the great features!

I applied the patchset to kvm tree, then tested with kvm guest, unfortunately,
it seems don't work normally.

The steps is follow:

# mount -t cgroup -o cpu none /mnt/
# qemu-system-x86_64 -enable-kvm -smp 4 -m 512M -drive file=fc64.img,index=0,media=disk

Don't do any configuration in cgroup, and run the kvm guest directly (don't use libvirt),
the guest booted very slowly and i saw some "soft lockup" bugs reported in the guest,
i also noticed one CPU usage is 100% for more than 60s and other CPUs is 10%~30% in the host
when guest was booting.

And if cgroup is not mounted, the guest runs well.

The kernel config file is attached and my system cpu info is:

# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 30
model name : Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz
stepping : 5
cpu MHz : 1197.000
cache size : 8192 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 5584.73
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 30
model name : Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz
stepping : 5
cpu MHz : 1197.000
cache size : 8192 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 5585.03
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 30
model name : Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz
stepping : 5
cpu MHz : 1197.000
cache size : 8192 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 5585.03
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 30
model name : Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz
stepping : 5
cpu MHz : 1197.000
cache size : 8192 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 5585.03
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:



Attachments:
.config (87.13 kB)

2011-02-22 10:27:53

by Bharata B Rao

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

On Mon, Feb 21, 2011 at 10:47:12AM +0800, Xiao Guangrong wrote:
> On 02/16/2011 11:18 AM, Paul Turner wrote:
> > Hi all,
> >
> > Please find attached v4 of CFS bandwidth control; while this rebase against
> > some of the latest SCHED_NORMAL code is new, the features and methodology are
> > fairly mature at this point and have proved both effective and stable for
> > several workloads.
> >
> > As always, all comments/feedback welcome.
> >
>
> Hi Paul,
>
> Thanks for the great features!
>
> I applied the patchset to kvm tree, then tested with kvm guest, unfortunately,
> it seems don't work normally.
>
> The steps is follow:
>
> # mount -t cgroup -o cpu none /mnt/
> # qemu-system-x86_64 -enable-kvm -smp 4 -m 512M -drive file=fc64.img,index=0,media=disk
>
> Don't do any configuration in cgroup, and run the kvm guest directly (don't use libvirt),
> the guest booted very slowly and i saw some "soft lockup" bugs reported in the guest,
> i also noticed one CPU usage is 100% for more than 60s and other CPUs is 10%~30% in the host
> when guest was booting.
>
> And if cgroup is not mounted, the guest runs well.
>

Hi Xiao Guangrong,

Thanks for testing the patches. I do see some soft lockups in the
guest when I mount the cgroup and start VM using qemu-kvm.

Will get back after further investigation.

Regards,
Bharata.

2011-02-23 07:43:28

by Paul Turner

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

Thanks for the report Xiao -- I wasn't able to reproduce this yet with
a simple guest, I will try a more modern image tomorrow.

One suspicion is that this might be connected with the missing
runnable accounting in sched_stoptask.c.

On Sun, Feb 20, 2011 at 6:47 PM, Xiao Guangrong
<[email protected]> wrote:
> On 02/16/2011 11:18 AM, Paul Turner wrote:
>> Hi all,
>>
>> Please find attached v4 of CFS bandwidth control; while this rebase against
>> some of the latest SCHED_NORMAL code is new, the features and methodology are
>> fairly mature at this point and have proved both effective and stable for
>> several workloads.
>>
>> As always, all comments/feedback welcome.
>>
>
> Hi Paul,
>
> Thanks for the great features!
>
> I applied the patchset to kvm tree, then tested with kvm guest, unfortunately,
> it seems don't work normally.
>
> The steps is follow:
>
> # mount -t cgroup -o cpu none /mnt/
> # qemu-system-x86_64 -enable-kvm ?-smp 4 -m 512M -drive file=fc64.img,index=0,media=disk
>
> Don't do any configuration in cgroup, and run the kvm guest directly (don't use libvirt),
> the guest booted very slowly and i saw some "soft lockup" bugs reported in the guest,
> i also noticed one CPU usage is 100% for more than 60s and other CPUs is 10%~30% in the host
> when guest was booting.
>
> And if cgroup is not mounted, the guest runs well.
>
> The kernel config file is attached and my system cpu info is:
>
> # cat /proc/cpuinfo
> processor ? ? ? : 0
> vendor_id ? ? ? : GenuineIntel
> cpu family ? ? ?: 6
> model ? ? ? ? ? : 30
> model name ? ? ?: Intel(R) Core(TM) i5 CPU ? ? ? ? 760 ?@ 2.80GHz
> stepping ? ? ? ?: 5
> cpu MHz ? ? ? ? : 1197.000
> cache size ? ? ?: 8192 KB
> physical id ? ? : 0
> siblings ? ? ? ?: 4
> core id ? ? ? ? : 0
> cpu cores ? ? ? : 4
> apicid ? ? ? ? ?: 0
> initial apicid ?: 0
> fpu ? ? ? ? ? ? : yes
> fpu_exception ? : yes
> cpuid level ? ? : 11
> wp ? ? ? ? ? ? ?: yes
> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
> bogomips ? ? ? ?: 5584.73
> clflush size ? ?: 64
> cache_alignment : 64
> address sizes ? : 36 bits physical, 48 bits virtual
> power management:
>
> processor ? ? ? : 1
> vendor_id ? ? ? : GenuineIntel
> cpu family ? ? ?: 6
> model ? ? ? ? ? : 30
> model name ? ? ?: Intel(R) Core(TM) i5 CPU ? ? ? ? 760 ?@ 2.80GHz
> stepping ? ? ? ?: 5
> cpu MHz ? ? ? ? : 1197.000
> cache size ? ? ?: 8192 KB
> physical id ? ? : 0
> siblings ? ? ? ?: 4
> core id ? ? ? ? : 1
> cpu cores ? ? ? : 4
> apicid ? ? ? ? ?: 2
> initial apicid ?: 2
> fpu ? ? ? ? ? ? : yes
> fpu_exception ? : yes
> cpuid level ? ? : 11
> wp ? ? ? ? ? ? ?: yes
> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
> bogomips ? ? ? ?: 5585.03
> clflush size ? ?: 64
> cache_alignment : 64
> address sizes ? : 36 bits physical, 48 bits virtual
> power management:
>
> processor ? ? ? : 2
> vendor_id ? ? ? : GenuineIntel
> cpu family ? ? ?: 6
> model ? ? ? ? ? : 30
> model name ? ? ?: Intel(R) Core(TM) i5 CPU ? ? ? ? 760 ?@ 2.80GHz
> stepping ? ? ? ?: 5
> cpu MHz ? ? ? ? : 1197.000
> cache size ? ? ?: 8192 KB
> physical id ? ? : 0
> siblings ? ? ? ?: 4
> core id ? ? ? ? : 2
> cpu cores ? ? ? : 4
> apicid ? ? ? ? ?: 4
> initial apicid ?: 4
> fpu ? ? ? ? ? ? : yes
> fpu_exception ? : yes
> cpuid level ? ? : 11
> wp ? ? ? ? ? ? ?: yes
> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
> bogomips ? ? ? ?: 5585.03
> clflush size ? ?: 64
> cache_alignment : 64
> address sizes ? : 36 bits physical, 48 bits virtual
> power management:
>
> processor ? ? ? : 3
> vendor_id ? ? ? : GenuineIntel
> cpu family ? ? ?: 6
> model ? ? ? ? ? : 30
> model name ? ? ?: Intel(R) Core(TM) i5 CPU ? ? ? ? 760 ?@ 2.80GHz
> stepping ? ? ? ?: 5
> cpu MHz ? ? ? ? : 1197.000
> cache size ? ? ?: 8192 KB
> physical id ? ? : 0
> siblings ? ? ? ?: 4
> core id ? ? ? ? : 3
> cpu cores ? ? ? : 4
> apicid ? ? ? ? ?: 6
> initial apicid ?: 6
> fpu ? ? ? ? ? ? : yes
> fpu_exception ? : yes
> cpuid level ? ? : 11
> wp ? ? ? ? ? ? ?: yes
> flags ? ? ? ? ? : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
> bogomips ? ? ? ?: 5585.03
> clflush size ? ?: 64
> cache_alignment : 64
> address sizes ? : 36 bits physical, 48 bits virtual
> power management:
>
>
>

2011-02-23 07:51:35

by Balbir Singh

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

* Paul Turner <[email protected]> [2011-02-22 23:42:48]:

> Thanks for the report Xiao -- I wasn't able to reproduce this yet with
> a simple guest, I will try a more modern image tomorrow.
>
> One suspicion is that this might be connected with the missing
> runnable accounting in sched_stoptask.c.
>

I can confirm that, my guests work fine after the changes posted this
morning. I still see some lockdep errors, but none associated with the
scheduler.

--
Three Cheers,
Balbir

2011-02-23 07:56:55

by Paul Turner

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

On Tue, Feb 22, 2011 at 11:51 PM, Balbir Singh
<[email protected]> wrote:
> * Paul Turner <[email protected]> [2011-02-22 23:42:48]:
>
>> Thanks for the report Xiao -- I wasn't able to reproduce this yet with
>> a simple guest, I will try a more modern image tomorrow.
>>
>> One suspicion is that this might be connected with the missing
>> runnable accounting in sched_stoptask.c.
>>
>
> I can confirm that, my guests work fine after the changes posted this
> morning. I still see some lockdep errors, but none associated with the
> scheduler.
>

Excellent!

Ok, if this is resolved I'll roll this up and repost tomorrow, thanks!

> --
> ? ? ? ?Three Cheers,
> ? ? ? ?Balbir
>

2011-02-23 08:31:44

by Bharata B Rao

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

On Tue, Feb 22, 2011 at 11:56:20PM -0800, Paul Turner wrote:
> On Tue, Feb 22, 2011 at 11:51 PM, Balbir Singh
> <[email protected]> wrote:
> > * Paul Turner <[email protected]> [2011-02-22 23:42:48]:
> >
> >> Thanks for the report Xiao -- I wasn't able to reproduce this yet with
> >> a simple guest, I will try a more modern image tomorrow.
> >>
> >> One suspicion is that this might be connected with the missing
> >> runnable accounting in sched_stoptask.c.
> >>
> >
> > I can confirm that, my guests work fine after the changes posted this
> > morning. I still see some lockdep errors, but none associated with the
> > scheduler.
> >
>
> Excellent!
>
> Ok, if this is resolved I'll roll this up and repost tomorrow, thanks!

As I said in an earlier reply, I too saw the lockups in guests. However those
lockups didn't occur consistently. After the sched_stoptask.c changes, I haven't
seen any lockups till now.

BTW, the lockups looked like this for me:

...
Mounting sysfs filesystem
Creating /dev
Creating initial device nodes
Loading /lib/kbd/keymaps/i386/qwerty/us.map
BUG: soft lockup - CPU#0 stuck for 61s! [init:1]
Modules linked in:

Pid: 1, comm: init Not tainted (2.6.27.24-170.2.68.fc10.i686 #1) KVM
EIP: 0060:[<c041b93e>] EFLAGS: 00000297 CPU: 0
EIP is at __ticket_spin_lock+0x13/0x19
EAX: c080ff00 EBX: 00000000 ECX: c05751ca EDX: 00008584
ESI: df92c904 EDI: dd700d80 EBP: df819e50 ESP: df819e50
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 0805d3dc CR3: 1d706000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c06ab6cc>] lock_kernel+0x1f/0x2d
[<c05751dc>] tty_open+0x12/0x2aa
[<c0494a8c>] ? exact_match+0x0/0x7
[<c0494dc7>] chrdev_open+0x12b/0x142
[<c049141e>] __dentry_open+0x10e/0x1fc
[<c0491593>] nameidata_to_filp+0x1f/0x33
[<c0494c9c>] ? chrdev_open+0x0/0x142
[<c049b0b7>] do_filp_open+0x31c/0x611
[<c0422602>] ? set_next_entity+0x8b/0xf7
[<c041f81a>] ? need_resched+0x18/0x22
[<c049123c>] do_sys_open+0x42/0xb7
[<c04912f3>] sys_open+0x1e/0x26
[<c0404c8a>] syscall_call+0x7/0xb
=======================
BUG: soft lockup - CPU#3 stuck for 61s! [plymouthd:562]
Modules linked in:

Pid: 562, comm: plymouthd Not tainted (2.6.27.24-170.2.68.fc10.i686 #1) KVM
EIP: 0060:[<c041b93e>] EFLAGS: 00000293 CPU: 3
EIP is at __ticket_spin_lock+0x13/0x19
EAX: c080ff00 EBX: 00000000 ECX: df469198 EDX: 00008684
ESI: df469198 EDI: df424018 EBP: dd6b9dc0 ESP: dd6b9dc0
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 0804e767 CR3: 1d6b7000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c06ab6cc>] lock_kernel+0x1f/0x2d
[<c04c827a>] proc_lookup_de+0x15/0xc0
[<c04c8337>] proc_lookup+0x12/0x17
[<c04c4934>] proc_root_lookup+0x11/0x2b
[<c0498fe2>] do_lookup+0xae/0x11e
[<c049a408>] __link_path_walk+0x57e/0x6b5
[<c049a92b>] path_walk+0x4c/0x9b
[<c049ab27>] do_path_lookup+0x12d/0x175
[<c049abb4>] __path_lookup_intent_open+0x45/0x76
[<c049abf5>] path_lookup_open+0x10/0x12
[<c049ae3c>] do_filp_open+0xa1/0x611
[<c04fadaa>] ? selinux_file_alloc_security+0x22/0x41
[<c052048c>] ? trace_hardirqs_on_thunk+0xc/0x10
[<c0404cd7>] ? restore_nocheck_notrace+0x0/0xe
[<c049123c>] do_sys_open+0x42/0xb7
[<c04912f3>] sys_open+0x1e/0x26
[<c0404c8a>] syscall_call+0x7/0xb
=======================
BUG: soft lockup - CPU#2 stuck for 63s! [setfont:559]
Modules linked in:

Pid: 559, comm: setfont Not tainted (2.6.27.24-170.2.68.fc10.i686 #1) KVM
EIP: 0060:[<c053a892>] EFLAGS: 00010283 CPU: 2
EIP is at vgacon_do_font_op+0x177/0x3ec
EAX: c095a1f8 EBX: df904000 ECX: 00000001 EDX: 00000043
ESI: 00000004 EDI: c095a1f4 EBP: dd743e0c ESP: dd743df0
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 0826d000 CR3: 1d73c000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c053abac>] vgacon_font_set+0x59/0x208
[<c04427d8>] ? down+0x2b/0x2f
[<c053ab53>] ? vgacon_font_set+0x0/0x208
[<c057f659>] con_font_op+0x15c/0x378
[<c041f81a>] ? need_resched+0x18/0x22
[<c057a80f>] vt_ioctl+0x1338/0x14fd
[<c04fa419>] ? inode_has_perm+0x5b/0x65
[<c05794d7>] ? vt_ioctl+0x0/0x14fd
[<c0574b88>] tty_ioctl+0x665/0x6cf
[<c04fa689>] ? file_has_perm+0x7b/0x84
[<c0574523>] ? tty_ioctl+0x0/0x6cf
[<c049c74a>] vfs_ioctl+0x22/0x69
[<c049c9cc>] do_vfs_ioctl+0x23b/0x247
[<c04fa7ac>] ? selinux_file_ioctl+0x35/0x38
[<c049ca18>] sys_ioctl+0x40/0x5c
[<c0404c8a>] syscall_call+0x7/0xb
=======================
BUG: soft lockup - CPU#1 stuck for 63s! [loadkeys:560]
Modules linked in:

Pid: 560, comm: loadkeys Not tainted (2.6.27.24-170.2.68.fc10.i686 #1) KVM
EIP: 0060:[<c06ab54e>] EFLAGS: 00000282 CPU: 1
EIP is at _spin_unlock_irqrestore+0x2d/0x38
EAX: 00000282 EBX: 00000282 ECX: dd5b7f80 EDX: 00000282
ESI: c096f740 EDI: c096f7fc EBP: c08a5f78 ESP: c08a5f74
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 009f0210 CR3: 1d73b000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c0593456>] serial8250_handle_port+0x220/0x230
[<c059114e>] ? serial_in+0x5a/0x61
[<c05934af>] serial8250_interrupt+0x49/0xc9
[<c0465313>] handle_IRQ_event+0x2f/0x64
[<c04662b5>] handle_edge_irq+0xb2/0xf4
[<c0466203>] ? handle_edge_irq+0x0/0xf4
[<c0406e6e>] do_IRQ+0xc7/0xfe
[<c0405668>] common_interrupt+0x28/0x30
[<c06ab54e>] ? _spin_unlock_irqrestore+0x2d/0x38
[<c058ef7d>] uart_start+0x4e/0x53
[<c058fa1e>] uart_write+0xce/0xd9
[<c0575d88>] write_chan+0x1e5/0x2b0
[<c0428218>] ? default_wake_function+0x0/0xd
[<c05742c9>] tty_write+0x155/0x1d5
[<c0575ba3>] ? write_chan+0x0/0x2b0
[<c05743a9>] redirected_tty_write+0x60/0x6d
[<c0574349>] ? redirected_tty_write+0x0/0x6d
[<c04930d6>] vfs_write+0x84/0xdf
[<c04931ca>] sys_write+0x3b/0x60
[<c0404c8a>] syscall_call+0x7/0xb
=======================
input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
Setting up hotplug.
Creating block device nodes.
Creating character device nodes.

2011-02-25 10:04:38

by Paul Turner

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
<[email protected]> wrote:
> On Tue, 15 Feb 2011 19:18:31 -0800
> Paul Turner <[email protected]> wrote:
>
>> Hi all,
>>
>> Please find attached v4 of CFS bandwidth control; while this rebase
>> against some of the latest SCHED_NORMAL code is new, the features and
>> methodology are fairly mature at this point and have proved both
>> effective and stable for several workloads.
>>
>> As always, all comments/feedback welcome.
>>
>
> Hi Paul,
>
> Your patches provide a very useful but slightly different feature for
> what we need to manage idle time in order to save power. What we
> need is kind of a quota/period in terms of idle time. I have been
> playing with your patches and noticed that when the cgroup cpu usage
> exceeds the quota the effect of throttling is similar to what I have
> been trying to do with freezer subsystem. i.e. freeze and thaw at given
> period and percentage runtime.
> https://lkml.org/lkml/2011/2/15/314
>
> Have you thought about adding such feature (please see detailed
> description in the link above) to your patches?
>

So reading the description it seems like rooting everything in a
'freezer' container and then setting up a quota of

(1 - frozen_percentage) * nr_cpus * frozen_period * sec_to_usec

on a period of

frozen_period * sec_to_usec

Would provide the same functionality. Is there other unduplicated
functionality beyond this?

One thing that does seem undesirable about your approach is (as it
seems to be described) threads will not be able to take advantage of
naturally occurring idle cycles and will incur a potential performance
penalty even at use << frozen_percentage.

e.g. From your post

| |<-- 90% frozen - ->| | | |
____| |________________x_| |__________________| |_____

|<---- 5 seconds ---->|


Suppose no threads active until the wake up at x, suppose there is an
accompanying 1 second of work for that thread to do. That execution
time will be dilated to ~1.5 seconds (as it will span the 0.5 seconds
the freezer will stall for). But the true usage for this period is
~20% <<< 90%

> Thanks,
>
> Jacob
>

2011-02-25 13:06:50

by Jacob Pan

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

On Fri, 25 Feb 2011 02:03:54 -0800
Paul Turner <[email protected]> wrote:

> On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
> <[email protected]> wrote:
> > On Tue, 15 Feb 2011 19:18:31 -0800
> > Paul Turner <[email protected]> wrote:
> >
> >> Hi all,
> >>
> >> Please find attached v4 of CFS bandwidth control; while this rebase
> >> against some of the latest SCHED_NORMAL code is new, the features
> >> and methodology are fairly mature at this point and have proved
> >> both effective and stable for several workloads.
> >>
> >> As always, all comments/feedback welcome.
> >>
> >
> > Hi Paul,
> >
> > Your patches provide a very useful but slightly different feature
> > for what we need to manage idle time in order to save power. What we
> > need is kind of a quota/period in terms of idle time. I have been
> > playing with your patches and noticed that when the cgroup cpu usage
> > exceeds the quota the effect of throttling is similar to what I have
> > been trying to do with freezer subsystem. i.e. freeze and thaw at
> > given period and percentage runtime.
> > https://lkml.org/lkml/2011/2/15/314
> >
> > Have you thought about adding such feature (please see detailed
> > description in the link above) to your patches?
> >
>
> So reading the description it seems like rooting everything in a
> 'freezer' container and then setting up a quota of
>
> (1 - frozen_percentage) * nr_cpus * frozen_period * sec_to_usec
>
I guess you meant frozen_percentage is less than 1, i.e. 90 is .90. my
code treat 90 as 90. just a clarification.
> on a period of
>
> frozen_period * sec_to_usec
>
> Would provide the same functionality. Is there other unduplicated
> functionality beyond this?
Do you mean the same functionality as your patch? Not really, since my
approach will stop the tasks based on hard time slices. But seems your
patch will allow them to run if they don't exceed the quota. Am i
missing something?
That is the only functionality difference i know.

Like the reviewer of freezer patch pointed out, it is a more logical
fit to implement such feature in scheduler/yours in stead of freezer. So
i am wondering if your patch can be expended to include limiting quota
on real time.

I did a comparison study between CFS BW and freezer patch on skype with
identical quota setting as you pointed out earlier. Both use 2 sec
period and .2 sec quota (10%). Skype typically uses 5% of the CPU on my
system when placing a call(below cfs quota) and it wakes up every 100ms
to do some quick checks. Then I run skype in cpu then freezer cgroup
(with all its children). Here is my result based on timechart and
powertop.

patch name wakeups skype call?
------------------------------------------------------------------
CFS BW 10/sec yes
freezer 1/sec no

Skype might not be the best example to illustrate the real usage of the
feature, but we are targeting mobile device where they are mostly off or
often have only one application allowed in foreground. So we want to
reduce wakeups coming from the tasks that are not in the foreground.

> One thing that does seem undesirable about your approach is (as it
> seems to be described) threads will not be able to take advantage of
> naturally occurring idle cycles and will incur a potential performance
> penalty even at use << frozen_percentage.
>
> e.g. From your post
>
> | |<-- 90% frozen - ->| |
> | | ____| |________________x_| |__________________| |_____
>
> |<---- 5 seconds ---->|
>
>
> Suppose no threads active until the wake up at x, suppose there is an
> accompanying 1 second of work for that thread to do. That execution
> time will be dilated to ~1.5 seconds (as it will span the 0.5 seconds
> the freezer will stall for). But the true usage for this period is
> ~20% <<< 90%
I agree my approach does not consider the natural cycle. But I am not
sure if a thread can wake up at x when FROZEN.

2011-03-08 03:58:15

by Balbir Singh

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

* jacob pan <[email protected]> [2011-02-25 05:06:46]:

> On Fri, 25 Feb 2011 02:03:54 -0800
> Paul Turner <[email protected]> wrote:
>
> > On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
> > <[email protected]> wrote:
> > > On Tue, 15 Feb 2011 19:18:31 -0800
> > > Paul Turner <[email protected]> wrote:
> > >
> > >> Hi all,
> > >>
> > >> Please find attached v4 of CFS bandwidth control; while this rebase
> > >> against some of the latest SCHED_NORMAL code is new, the features
> > >> and methodology are fairly mature at this point and have proved
> > >> both effective and stable for several workloads.
> > >>
> > >> As always, all comments/feedback welcome.
> > >>
> > >
> > > Hi Paul,
> > >
> > > Your patches provide a very useful but slightly different feature
> > > for what we need to manage idle time in order to save power. What we
> > > need is kind of a quota/period in terms of idle time. I have been
> > > playing with your patches and noticed that when the cgroup cpu usage
> > > exceeds the quota the effect of throttling is similar to what I have
> > > been trying to do with freezer subsystem. i.e. freeze and thaw at
> > > given period and percentage runtime.
> > > https://lkml.org/lkml/2011/2/15/314
> > >
> > > Have you thought about adding such feature (please see detailed
> > > description in the link above) to your patches?
> > >
> >
> > So reading the description it seems like rooting everything in a
> > 'freezer' container and then setting up a quota of
> >
> > (1 - frozen_percentage) * nr_cpus * frozen_period * sec_to_usec
> >
> I guess you meant frozen_percentage is less than 1, i.e. 90 is .90. my
> code treat 90 as 90. just a clarification.
> > on a period of
> >
> > frozen_period * sec_to_usec
> >
> > Would provide the same functionality. Is there other unduplicated
> > functionality beyond this?
> Do you mean the same functionality as your patch? Not really, since my
> approach will stop the tasks based on hard time slices. But seems your
> patch will allow them to run if they don't exceed the quota. Am i
> missing something?
> That is the only functionality difference i know.
>
> Like the reviewer of freezer patch pointed out, it is a more logical
> fit to implement such feature in scheduler/yours in stead of freezer. So
> i am wondering if your patch can be expended to include limiting quota
> on real time.
>

Do you mean sched rt group controller? Have you looked at
cpu.rt_runtime_us and cpu.rt_perioud_us?

> I did a comparison study between CFS BW and freezer patch on skype with
> identical quota setting as you pointed out earlier. Both use 2 sec
> period and .2 sec quota (10%). Skype typically uses 5% of the CPU on my
> system when placing a call(below cfs quota) and it wakes up every 100ms
> to do some quick checks. Then I run skype in cpu then freezer cgroup
> (with all its children). Here is my result based on timechart and
> powertop.
>
> patch name wakeups skype call?
> ------------------------------------------------------------------
> CFS BW 10/sec yes
> freezer 1/sec no
>

Is this good or bad for CFS BW?

--
Three Cheers,
Balbir

2011-03-08 18:18:17

by Jacob Pan

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

on Tue, 8 Mar 2011 09:27:59 +0530 Balbir Singh wrote:
>* jacob pan <[email protected]> [2011-02-25 05:06:46]:
>
>> On Fri, 25 Feb 2011 02:03:54 -0800
>> Paul Turner <[email protected]> wrote:
>>
>> > On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
>> > <[email protected]> wrote:
>> > > On Tue, 15 Feb 2011 19:18:31 -0800
>> > > Paul Turner <[email protected]> wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> Please find attached v4 of CFS bandwidth control; while this rebase
>> > >> against some of the latest SCHED_NORMAL code is new, the features
>> > >> and methodology are fairly mature at this point and have proved
>> > >> both effective and stable for several workloads.
>> > >>
>> > >> As always, all comments/feedback welcome.
>> > >>
>> > >
>> > > Hi Paul,
>> > >
>> > > Your patches provide a very useful but slightly different feature
>> > > for what we need to manage idle time in order to save power. What we
>> > > need is kind of a quota/period in terms of idle time. I have been
>> > > playing with your patches and noticed that when the cgroup cpu usage
>> > > exceeds the quota the effect of throttling is similar to what I have
>> > > been trying to do with freezer subsystem. i.e. freeze and thaw at
>> > > given period and percentage runtime.
>> > > https://lkml.org/lkml/2011/2/15/314
>> > >
>> > > Have you thought about adding such feature (please see detailed
>> > > description in the link above) to your patches?
>> > >
>> >
>> > So reading the description it seems like rooting everything in a
>> > 'freezer' container and then setting up a quota of
>> >
>> > (1 - frozen_percentage) * nr_cpus * frozen_period * sec_to_usec
>> >
>> I guess you meant frozen_percentage is less than 1, i.e. 90 is .90. my
>> code treat 90 as 90. just a clarification.
>> > on a period of
>> >
>> > frozen_period * sec_to_usec
>> >
>> > Would provide the same functionality. Is there other unduplicated
>> > functionality beyond this?
>> Do you mean the same functionality as your patch? Not really, since my
>> approach will stop the tasks based on hard time slices. But seems your
>> patch will allow them to run if they don't exceed the quota. Am i
>> missing something?
>> That is the only functionality difference i know.
>>
>> Like the reviewer of freezer patch pointed out, it is a more logical
>> fit to implement such feature in scheduler/yours in stead of freezer. So
>> i am wondering if your patch can be expended to include limiting quota
>> on real time.
>>
>
>Do you mean sched rt group controller? Have you looked at
>cpu.rt_runtime_us and cpu.rt_perioud_us?
>
>> I did a comparison study between CFS BW and freezer patch on skype with
>> identical quota setting as you pointed out earlier. Both use 2 sec
>> period and .2 sec quota (10%). Skype typically uses 5% of the CPU on my
>> system when placing a call(below cfs quota) and it wakes up every 100ms
>> to do some quick checks. Then I run skype in cpu then freezer cgroup
>> (with all its children). Here is my result based on timechart and
>> powertop.
>>
>> patch name wakeups skype call?
>> ------------------------------------------------------------------
>> CFS BW 10/sec yes
>> freezer 1/sec no
>>
>
>Is this good or bad for CFS BW?
In terms of power saving for this particular use case, it is bad for
CFS BW. Since I am trying use cgroup to manage applications that are
not written with power saving in mind. CFS BW does not prevent
unnecessary wake-ups from these apps., therefore the system consumes
more power than the case where freezer duty cycling patch is used.
In my use case, as soon as skype is switched to the UI foreground, it
will be moved to another cgroup where enough quota will be given to
allow it place calls. Therefore, not being able to make calls while
being throttled is not a concern.

For mobile devices, often have just one app in the foreground. So
throttling background apps may not impact user experience but still can
save power.

Since CFS BW patch has the period and quota concept on BW control, that
is why I am asking if it is worth extending CFS BW patch to have a idle
time quota. Perhaps adding another parameter to allow limitting the idle
time in parallel to cfs_quota.

Rafael (CCed) wants to get an opinion from the scheduler folks before
considering the freezer patch.

Thanks,

Jacob

2011-03-09 10:13:12

by Paul Turner

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

On Fri, Feb 25, 2011 at 5:06 AM, jacob pan
<[email protected]> wrote:
> On Fri, 25 Feb 2011 02:03:54 -0800
> Paul Turner <[email protected]> wrote:
>
>> On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
>> <[email protected]> wrote:
>> > On Tue, 15 Feb 2011 19:18:31 -0800
>> > Paul Turner <[email protected]> wrote:
>> >
>> >> Hi all,
>> >>
>> >> Please find attached v4 of CFS bandwidth control; while this rebase
>> >> against some of the latest SCHED_NORMAL code is new, the features
>> >> and methodology are fairly mature at this point and have proved
>> >> both effective and stable for several workloads.
>> >>
>> >> As always, all comments/feedback welcome.
>> >>
>> >
>> > Hi Paul,
>> >
>> > Your patches provide a very useful but slightly different feature
>> > for what we need to manage idle time in order to save power. What we
>> > need is kind of a quota/period in terms of idle time. I have been
>> > playing with your patches and noticed that when the cgroup cpu usage
>> > exceeds the quota the effect of throttling is similar to what I have
>> > been trying to do with freezer subsystem. i.e. freeze and thaw at
>> > given period and percentage runtime.
>> > https://lkml.org/lkml/2011/2/15/314
>> >
>> > Have you thought about adding such feature (please see detailed
>> > description in the link above) to your patches?
>> >
>>
>> So reading the description it seems like rooting everything in a
>> 'freezer' container and then setting up a quota of
>>
>> (1 - frozen_percentage) ?* nr_cpus * frozen_period * sec_to_usec
>>
> I guess you meant frozen_percentage is less than 1, i.e. 90 is .90. my
> code treat 90 as 90. just a clarification.
>> on a period of
>>
>> frozen_period * sec_to_usec
>>
>> Would provide the same functionality. ?Is there other unduplicated
>> functionality beyond this?

Sorry -- I was out last week; comments inline.

> Do you mean the same functionality as your patch? Not really, since my
> approach will stop the tasks based on hard time slices
>. But seems your
> patch will allow them to run if they don't exceed the quota. Am i
> missing something?

Right, this is what was discussed above.

> That is the only functionality difference i know.
>
> Like the reviewer of freezer patch pointed out, it is a more logical
> fit to implement such feature in scheduler/yours in stead of freezer. So
> i am wondering if your patch can be expended to include limiting quota
> on real time.

The following two configurations should effectively exactly mirror the
freezer behavior without modification.

A) background while(1) thread on each cpu within the cgroup
This will result in synchronous consumption / exhaustion of quota in a
manor that duplicates the periodic freezing.

Given the goal is power-saving, this is obviously non-ideal. However:

B) A userspace daemon toggles quota at the desired interval

Supposing you wanted a freezer period of 100ms per second, then having
a daemon wake up at 900ms into the interval and then setting a quota
amount that is effectively zero will then "freeze" the group. Said
daemon can then release things by returning the group to an infinite
quota in 100ms, and then sleeping for another 900ms.

Is there particular advantage of doing this in-kernel?


>
> I did a comparison study between CFS BW and freezer patch on skype with
> identical quota setting as you pointed out earlier. Both use 2 sec
> period and .2 sec quota (10%). Skype typically uses 5% of the CPU on my
> system when placing a call(below cfs quota) and it wakes up every 100ms
> to do some quick checks. Then I run skype in cpu then freezer cgroup
> (with all its children). Here is my result based on timechart and
> powertop.
>
> patch name ? ? ?wakeups ? ? ? ? skype call?
> ------------------------------------------------------------------
> CFS BW ? ? ? ? ?10/sec ? ? ? ? ?yes
> freezer ? ? ? ? 1/sec ? ? ? ? ? no
>

Is this a true saving? While the actual task wake-up has been hidden,
the cpu is still coming out of a halt/idle state and processing the
interrupt/etc.

Have you had the chance to measure the actual comparative power-usage
in this case?

> Skype might not be the best example to illustrate the real usage of the
> feature, but we are targeting mobile device where they are mostly off or
> often have only one application allowed in foreground. So we want to
> reduce wakeups coming from the tasks that are not in the foreground.
>

If reducing wake-ups (at the userspace level) is proven to deliver
performance improvements, then it might be more productive to approach
that directly by considering strategies such as batching wakeups and
processing them periodically.

This would not have the negative performance impact of the current
approach, as well as being more deterministic.

>> One thing that does seem undesirable about your approach is (as it
>> seems to be described) threads will not be able to take advantage of
>> naturally occurring idle cycles and will incur a potential performance
>> penalty even at use << frozen_percentage.
>>
>> e.g. From your post
>>
>> ? ? ? ?| ?|<-- 90% frozen - ? ? ->| ?|
>> | ?| ____| ?|________________x_| ?|__________________| ?|_____
>>
>> ? ? ? ? |<---- 5 seconds ? ? ---->|
>>
>>
>> Suppose no threads active until the wake up at x, suppose there is an
>> accompanying 1 second of work for that thread to do. ?That execution
>> time will be dilated to ~1.5 seconds (as it will span the 0.5 seconds
>> the freezer will stall for). ?But the true usage for this period is
>> ~20% <<< 90%
> I agree my approach does not consider the natural cycle. But I am not
> sure if a thread can wake up at x when FROZEN.
>

While the ascii is a little mailer-mangled, in the diagram above x was
intended to precede the "frozen" time segment, but at a point where
the work it wants to do exceeds the time-before-freeze resulting in
dilation of execution and a performance regression.

2011-03-09 21:57:49

by Jacob Pan

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 0/7] Introduction

On Wed, 9 Mar 2011 02:12:36 -0800
Paul Turner <[email protected]> wrote:

> On Fri, Feb 25, 2011 at 5:06 AM, jacob pan
> <[email protected]> wrote:
> > On Fri, 25 Feb 2011 02:03:54 -0800
> > Paul Turner <[email protected]> wrote:
> >
> >> On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
> >> <[email protected]> wrote:
> >> > On Tue, 15 Feb 2011 19:18:31 -0800
> >> > Paul Turner <[email protected]> wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> Please find attached v4 of CFS bandwidth control; while this
> >> >> rebase against some of the latest SCHED_NORMAL code is new, the
> >> >> features and methodology are fairly mature at this point and
> >> >> have proved both effective and stable for several workloads.
> >> >>
> >> >> As always, all comments/feedback welcome.
> >> >>
> >> >
> >> > Hi Paul,
> >> >
> >> > Your patches provide a very useful but slightly different feature
> >> > for what we need to manage idle time in order to save power.
> >> > What we need is kind of a quota/period in terms of idle time. I
> >> > have been playing with your patches and noticed that when the
> >> > cgroup cpu usage exceeds the quota the effect of throttling is
> >> > similar to what I have been trying to do with freezer subsystem.
> >> > i.e. freeze and thaw at given period and percentage runtime.
> >> > https://lkml.org/lkml/2011/2/15/314
> >> >
> >> > Have you thought about adding such feature (please see detailed
> >> > description in the link above) to your patches?
> >> >
> >>
> >> So reading the description it seems like rooting everything in a
> >> 'freezer' container and then setting up a quota of
> >>
> >> (1 - frozen_percentage) ?* nr_cpus * frozen_period * sec_to_usec
> >>
> > I guess you meant frozen_percentage is less than 1, i.e. 90 is .90.
> > my code treat 90 as 90. just a clarification.
> >> on a period of
> >>
> >> frozen_period * sec_to_usec
> >>
> >> Would provide the same functionality. ?Is there other unduplicated
> >> functionality beyond this?
>
> Sorry -- I was out last week; comments inline.
>
> > Do you mean the same functionality as your patch? Not really, since
> > my approach will stop the tasks based on hard time slices
> >. But seems your
> > patch will allow them to run if they don't exceed the quota. Am i
> > missing something?
>
> Right, this is what was discussed above.
>
> > That is the only functionality difference i know.
> >
> > Like the reviewer of freezer patch pointed out, it is a more logical
> > fit to implement such feature in scheduler/yours in stead of
> > freezer. So i am wondering if your patch can be expended to include
> > limiting quota on real time.
>
> The following two configurations should effectively exactly mirror the
> freezer behavior without modification.
>
> A) background while(1) thread on each cpu within the cgroup
> This will result in synchronous consumption / exhaustion of quota in a
> manor that duplicates the periodic freezing.
>
> Given the goal is power-saving, this is obviously non-ideal. However:
>
> B) A userspace daemon toggles quota at the desired interval
>
> Supposing you wanted a freezer period of 100ms per second, then having
> a daemon wake up at 900ms into the interval and then setting a quota
> amount that is effectively zero will then "freeze" the group. Said
> daemon can then release things by returning the group to an infinite
> quota in 100ms, and then sleeping for another 900ms.
>
> Is there particular advantage of doing this in-kernel?
>
Yes, option B will mirror the behavior of the freezer patch. My concern
is that doing this in user space will be less efficient than doing it
in the kernel. For each period to run, the user daemon has to wake up
twice to adjust the quota. I guess if you do idle time quota check in
the kernel it may not need the extra wake-ups?
I do plan to have multiple cgroups with different period and runtime
quota, so the wake-ups will add up.

>
> >
> > I did a comparison study between CFS BW and freezer patch on skype
> > with identical quota setting as you pointed out earlier. Both use 2
> > sec period and .2 sec quota (10%). Skype typically uses 5% of the
> > CPU on my system when placing a call(below cfs quota) and it wakes
> > up every 100ms to do some quick checks. Then I run skype in cpu
> > then freezer cgroup (with all its children). Here is my result
> > based on timechart and powertop.
> >
> > patch name ? ? ?wakeups ? ? ? ? skype call?
> > ------------------------------------------------------------------
> > CFS BW ? ? ? ? ?10/sec ? ? ? ? ?yes
> > freezer ? ? ? ? 1/sec ? ? ? ? ? no
> >
>
> Is this a true saving? While the actual task wake-up has been hidden,
> the cpu is still coming out of a halt/idle state and processing the
> interrupt/etc.
>
I think it is true power saving, consider wake-ups from CPU C
states are resulted from either timer or device IRQ, frozen process will
directly reduce timer IRQ.
> Have you had the chance to measure the actual comparative power-usage
> in this case?
>
I have yet to do such study, it is in my plan.

> > Skype might not be the best example to illustrate the real usage of
> > the feature, but we are targeting mobile device where they are
> > mostly off or often have only one application allowed in
> > foreground. So we want to reduce wakeups coming from the tasks that
> > are not in the foreground.
> >
>
> If reducing wake-ups (at the userspace level) is proven to deliver
> performance improvements, then it might be more productive to approach
> that directly by considering strategies such as batching wakeups and
> processing them periodically.
>
> This would not have the negative performance impact of the current
> approach, as well as being more deterministic.
>
> >> One thing that does seem undesirable about your approach is (as it
> >> seems to be described) threads will not be able to take advantage
> >> of naturally occurring idle cycles and will incur a potential
> >> performance penalty even at use << frozen_percentage.
> >>
> >> e.g. From your post
> >>
> >> ? ? ? ?| ?|<-- 90% frozen - ? ? ->| ?|
> >> | ?| ____| ?|________________x_| ?|__________________| ?|_____
> >>
> >> ? ? ? ? |<---- 5 seconds ? ? ---->|
> >>
> >>
> >> Suppose no threads active until the wake up at x, suppose there is
> >> an accompanying 1 second of work for that thread to do. ?That
> >> execution time will be dilated to ~1.5 seconds (as it will span
> >> the 0.5 seconds the freezer will stall for). ?But the true usage
> >> for this period is ~20% <<< 90%
> > I agree my approach does not consider the natural cycle. But I am
> > not sure if a thread can wake up at x when FROZEN.
> >
>
> While the ascii is a little mailer-mangled, in the diagram above x was
> intended to precede the "frozen" time segment, but at a point where
> the work it wants to do exceeds the time-before-freeze resulting in
> dilation of execution and a performance regression.
Thanks for explaining again.

Jacob