LinuxLists.cc - massive_intr on CFS, BFS, and EDF

2009-09-24 23:54:42

Subject: massive_intr on CFS, BFS, and EDF

Hi,

I tried massive_intr, a process scheduler's fairness and throughtput testing
program for massive interactive processes, on vanilla 2.6.31, 2.6.31-bfs211
(with BFS) and 2.6.31-edf(latest linus tree + EDF patch).

CFS and BFS look good. CFS has better fairness, and BFS has better throughput.
EDF looks unfair and is unstable. I tested 3 times and the tendency was the
same.

NOTE:

- BFS patch is applied to 2.6.31, but EDF patch is applied to latest linus
tree, so they can't compare strictly. This report is just FYI.

- EDF kernel shows some strange behavior as follows:

* aptitude(debian package management tool) stacks on the way
* oocalc doesn't start

- I dont't subscribe lkml now. So if you reply, pleas CC me.

Thanks,
Satoru

===============================================================================

[ test envinronment ]

A laptop machine with x86_64 dual core CPU

[ test program ]

# CFS and BFS:
$ massive_intr 30 30

# EDF
$ schedtool -E -d 100000 -b 25000 -e ./massive_intr 30 30

It means running 30 interactive processes simultaneously for 30 secs.
Full description of this program is on the comments of source code.

URL:
http://people.redhat.com/mingo/cfs-scheduler/tools/massive_intr.c

[ test result ]

+---------------+-----------+---------+---------+---------+-----------+
| kernel | scheduler | avg(*1) | min(*2) | max(*3) | stdev(*4) |
+---------------+-----------+---------+---------+---------+-----------+
| 2.6.31 | CFS | 246 | 240 | 247 | 1.3 |
| 2.6.31-bfs211 | BFS | 254 | 241 | 268 | 7.1 |
+---------------+-----------+---------+---------+---------+-----------+
| 2.6.31-edf(*5)| EDF | 440 | 154 | 1405 | 444.8 |
+---------------+-----------+---------+---------+---------+-----------+

*1) average number of loops among all processes
*2) minimum number of loops among all processes
*3) maximum number of loops among all processes
*4) standard deviation
*5) EDF kernel hanged up on the way. The data is only among 7 threads.

High average means good throughput, and low stdev means good fairness.

[raw data]

# vanilla 2.6.31 (CFS)
sat@debian:~/practice/bfs$ uname -r
2.6.31
sat@debian:~/practice/bfs$ ./massive_intr 30 30
003873 00000246
003893 00000246
003898 00000240
003876 00000245
003888 00000245
003870 00000245
003882 00000247
003890 00000245
003872 00000245
003880 00000246
003895 00000246
003892 00000246
003878 00000246
003874 00000246
003896 00000246
003897 00000246
003884 00000246
003891 00000246
003894 00000246
003871 00000246
003886 00000247
003877 00000246
003879 00000246
003889 00000246
003881 00000246
003899 00000244
003887 00000247
003875 00000247
003885 00000247
003883 00000247

# 2.6.31-bfs211
sat@debian:~/practice/bfs$ uname -r
2.6.31-bfs211
sat@debian:~/practice/bfs$ ./massive_intr 30 30
004143 00000248
004127 00000241
004154 00000252
004145 00000255
004137 00000251
004148 00000263
004135 00000261
004153 00000247
004132 00000250
004146 00000248
004140 00000251
004130 00000245
004138 00000267
004136 00000249
004139 00000262
004141 00000255
004147 00000251
004131 00000253
004150 00000254
004152 00000254
004129 00000253
004142 00000242
004151 00000268
004128 00000263
004134 00000260
004144 00000252
004133 00000254
004149 00000265
004126 00000252
004125 00000246

# 2.6.31-edf (latest linus tree + edf patch)
sat@debian:~/practice/bfs$ uname -r
2.6.31-edf
sat@debian:~/practice/bfs$ schedtool-edf/schedtool -E -d 100000 -b
25000 -e ./massive_intr 30 30
Dumping mode: 0xa
Dumping affinity: 0xffffffff
We have 3 args to do
Dump arg 0: ./massive_intr
Dump arg 1: 30
Dump arg 2: 30
003915 00000541
003914 00001405
003916 00000310
003924 00000177
003923 00000154
003917 00000280
003918 00000211
# <kernel hunged up here>

2009-09-25 03:28:20

by Satoru Takeuchi

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

Hi Raistlin,

sat $B$5$s$O=q$-$^$7$?(B:
> NOTE:
....
> - EDF kernel shows some strange behavior as follows:
>
> * aptitude(debian package management tool) stacks on the way
> * oocalc doesn't start

In addition, aptitude and oocalc are not SCHED_EDF, but SCHED_OTHER
(SCHED_FAIR). So you probably broke some codes other than sched_edf
part.

Thanks,
Satoru

2009-09-25 05:35:22

by Dario Faggioli

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

On Fri, 2009-09-25 at 08:53 +0900, sat wrote:
> Hi,
>
Hi,

> I tried massive_intr, a process scheduler's fairness and throughtput testing
> program for massive interactive processes, on vanilla 2.6.31, 2.6.31-bfs211
> (with BFS) and 2.6.31-edf(latest linus tree + EDF patch).
>
Hey, thanks a lot, that's interesting!

> CFS and BFS look good. CFS has better fairness, and BFS has better throughput.
> EDF looks unfair and is unstable. I tested 3 times and the tendency was the
> same.
>
> NOTE:
>
> - BFS patch is applied to 2.6.31, but EDF patch is applied to latest linus
> tree, so they can't compare strictly. This report is just FYI.
>
> - EDF kernel shows some strange behavior as follows:
>
> * aptitude(debian package management tool) stacks on the way
> * oocalc doesn't start
>
> - I dont't subscribe lkml now. So if you reply, pleas CC me.
>
Ok... I have to run now, but I'll look carefully at the results when
I'll be back.

In the meanwhile, if you want and have time, here's a new version of the
patch, which solves some BUGs, in case you want to you give it another
try...

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 89f7ead..e851625 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -391,6 +391,9 @@
#define __NR_pwritev (__NR_SYSCALL_BASE+362)
#define __NR_rt_tgsigqueueinfo (__NR_SYSCALL_BASE+363)
#define __NR_perf_event_open (__NR_SYSCALL_BASE+364)
+#define __NR_sched_setscheduler_ex (__NR_SYSCALL_BASE+365)
+#define __NR_sched_setparam_ex (__NR_SYSCALL_BASE+366)
+#define __NR_sched_getparam_ex (__NR_SYSCALL_BASE+367)

/*
* The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index fafce1b..42ad362 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -374,6 +374,9 @@
CALL(sys_pwritev)
CALL(sys_rt_tgsigqueueinfo)
CALL(sys_perf_event_open)
+/* 365 */ CALL(sys_sched_setscheduler_ex)
+ CALL(sys_sched_setparam_ex)
+ CALL(sys_sched_getparam_ex)
#ifndef syscalls_counted
.equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
#define syscalls_counted
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 74619c4..fde92c2 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -832,4 +832,7 @@ ia32_sys_call_table:
.quad compat_sys_pwritev
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
.quad sys_perf_event_open
+ .quad sys_sched_setscheduler_ex
+ .quad sys_sched_setparam_ex
+ .quad sys_sched_getparam_ex
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..a83163f 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,9 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
+#define __NR_sched_setscheduler_ex 337
+#define __NR_sched_setparam_ex 338
+#define __NR_sched_getparam_ex 339

#ifdef __KERNEL__

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 8d3ad0a..a11831c 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,12 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
__SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 298
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_sched_setscheduler_ex 299
+__SYSCALL(__NR_sched_setscheduler_ex, sys_sched_setscheduler_ex)
+#define __NR_sched_setparam_ex 230
+__SYSCALL(__NR_sched_setparam_ex, sys_sched_setparam_ex)
+#define __NR_sched_getparam_ex 231
+__SYSCALL(__NR_sched_getparam_ex, sys_sched_getparam_ex)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..38f056c 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,6 @@ ENTRY(sys_call_table)
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
+ .long sys_sched_setscheduler_ex
+ .long sys_sched_setparam_ex
+ .long sys_sched_getparam_ex
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 848d1f2..feb5ecb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -38,6 +38,7 @@
#define SCHED_BATCH 3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
+#define SCHED_EDF 6
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000

@@ -94,6 +95,19 @@ struct sched_param {

#include <asm/processor.h>

+/*
+ * For an typical EDF periodic or sporadic task we need a runtime
+ * and period (equal to deadline) specification in the sched_param
+ * structure.
+ * However, the original struct sched_param can't be modified, for binary
+ * compatibility issues, and thus this new one is created.
+ */
+struct sched_param_ex {
+ int sched_priority;
+ struct timespec sched_period;
+ struct timespec sched_runtime;
+};
+
struct exec_domain;
struct futex_pi_state;
struct robust_list_head;
@@ -151,12 +165,15 @@ extern unsigned long get_parent_ip(unsigned long addr);

struct seq_file;
struct cfs_rq;
+struct edf_rq;
struct task_group;
#ifdef CONFIG_SCHED_DEBUG
extern void proc_sched_show_task(struct task_struct *p, struct seq_file *m);
extern void proc_sched_set_task(struct task_struct *p);
extern void
print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
+extern void
+print_edf_rq(struct seq_file *m, int cpu, struct edf_rq *edf_rq);
#else
static inline void
proc_sched_show_task(struct task_struct *p, struct seq_file *m)
@@ -169,6 +186,10 @@ static inline void
print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
}
+static inline void
+print_edf_rq(struct seq_file *m, int cpu, struct edf_rq *edf_rq)
+{
+}
#endif

extern unsigned long long time_sync_thresh;
@@ -1194,6 +1215,29 @@ struct sched_entity {
#endif
};

+#define EDF_NEW 1
+#define EDF_ENDED 2
+#define EDF_RUNNING 4
+#define EDF_THROTTLED 8
+
+struct sched_edf_entity {
+ struct rb_node rb_node;
+ /* actual scheduling parameters */
+ u64 deadline;
+ s64 runtime;
+ unsigned int edf_flags;
+
+ /* original parameters taken from sched_param_ex */
+ u64 runtime_max;
+ u64 period;
+#ifdef CONFIG_EDF_GROUP_SCHED
+ u64 bw;
+#endif
+
+ int nr_cpus_allowed;
+ struct hrtimer edf_timer;
+};
+
struct sched_rt_entity {
struct list_head run_list;
unsigned long timeout;
@@ -1231,6 +1275,7 @@ struct task_struct {
unsigned int rt_priority;
const struct sched_class *sched_class;
struct sched_entity se;
+ struct sched_edf_entity edf;
struct sched_rt_entity rt;

#ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -1576,6 +1621,18 @@ static inline int rt_task(struct task_struct *p)
return rt_prio(p->prio);
}

+static inline int edf_policy(int policy)
+{
+ if (unlikely(policy == SCHED_EDF))
+ return 1;
+ return 0;
+}
+
+static inline int edf_task(struct task_struct *p)
+{
+ return edf_policy(p->policy);
+}
+
static inline struct pid *task_pid(struct task_struct *task)
{
return task->pids[PIDTYPE_PID].pid;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a990ace..dad0b33 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -33,6 +33,7 @@ struct pollfd;
struct rlimit;
struct rusage;
struct sched_param;
+struct sched_param_ex;
struct semaphore;
struct sembuf;
struct shmid_ds;
@@ -390,11 +391,17 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
asmlinkage long sys_nice(int increment);
asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
struct sched_param __user *param);
+asmlinkage long sys_sched_setscheduler_ex(pid_t pid, int policy,
+ struct sched_param_ex __user *param);
asmlinkage long sys_sched_setparam(pid_t pid,
struct sched_param __user *param);
+asmlinkage long sys_sched_setparam_ex(pid_t pid,
+ struct sched_param_ex __user *param);
asmlinkage long sys_sched_getscheduler(pid_t pid);
asmlinkage long sys_sched_getparam(pid_t pid,
struct sched_param __user *param);
+asmlinkage long sys_sched_getparam_ex(pid_t pid,
+ struct sched_param_ex __user *param);
asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
unsigned long __user *user_mask_ptr);
asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/init/Kconfig b/init/Kconfig
index c7bac39..2ec22ee 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -454,6 +454,20 @@ config RT_GROUP_SCHED
realtime bandwidth for them.
See Documentation/scheduler/sched-rt-group.txt for more information.

+config EDF_GROUP_SCHED
+ bool "Group scheduling for SCHED_EDF"
+ depends on EXPERIMENTAL
+ depends on GROUP_SCHED
+ depends on CGROUPS
+ depends on !USER_SCHED
+ select CPUSETS
+ default n
+ help
+ This feature lets you explicitly specify, in terms of runtime
+ and period, the bandwidth of a task control group. This means
+ tasks and other control groups can be added, until such bandwidth
+ limit is reached, after that they will fail.
+
choice
depends on GROUP_SCHED
prompt "Basis for grouping tasks"
diff --git a/kernel/fork.c b/kernel/fork.c
index 51ad0b0..56d5a8b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1226,6 +1226,18 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->parent_exec_id = current->self_exec_id;
}

+ /*
+ * In case an EDF task is forking, both parent and child get half
+ * the bandwidth of the parent (via dividing runtime by 2).
+ * This make it a lot easier to deal with control group bandwidth, and
+ * also prevent an heavily spawning task to exhaust the system (or
+ * a control group, if enabled) bandwidht.
+ */
+ if (edf_task(p->real_parent)) {
+ p->real_parent->edf.runtime_max /= 2;
+ p->edf.runtime_max = p->real_parent->edf.runtime_max;
+ }
+
spin_lock(&current->sighand->siglock);

/*
diff --git a/kernel/sched.c b/kernel/sched.c
index 2f76e06..d50d42e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -131,6 +131,11 @@ static inline int task_has_rt_policy(struct task_struct *p)
return rt_policy(p->policy);
}

+static inline int task_has_edf_policy(struct task_struct *p)
+{
+ return edf_policy(p->policy);
+}
+
/*
* This is the priority-queue data structure of the RT scheduling class:
*/
@@ -227,6 +232,16 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
}
#endif

+struct edf_bandwidth {
+ spinlock_t lock;
+ /* runtime and period that determine the bandwidth of the group */
+ u64 runtime_max;
+ u64 period;
+ u64 bw;
+ /* accumulator for the actual allocated bandwidth in a group */
+ u64 total_bw;
+};
+
/*
* sched_domains_mutex serializes calls to arch_init_sched_domains,
* detach_destroy_domains and partition_sched_domains.
@@ -259,6 +274,11 @@ struct task_group {
unsigned long shares;
#endif

+#ifdef CONFIG_EDF_GROUP_SCHED
+ struct edf_bandwidth edf_bandwidth;
+ struct edf_rq **edf_rq;
+#endif
+
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity **rt_se;
struct rt_rq **rt_rq;
@@ -296,6 +316,10 @@ static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
static DEFINE_PER_CPU_SHARED_ALIGNED(struct cfs_rq, init_tg_cfs_rq);
#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_EDF_GROUP_SCHED
+static DEFINE_PER_CPU(struct edf_rq, init_edf_rq) ____cacheline_aligned_in_smp;
+#endif
+
#ifdef CONFIG_RT_GROUP_SCHED
static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_entity);
static DEFINE_PER_CPU_SHARED_ALIGNED(struct rt_rq, init_rt_rq);
@@ -447,6 +471,18 @@ struct cfs_rq {
#endif
};

+struct edf_rq {
+ unsigned long edf_nr_running;
+
+ /* as in CFS, the runqueue is based on an rbtree, deadline ordered */
+ struct rb_root rb_root;
+ struct rb_node *rb_leftmost;
+
+#ifdef CONFIG_EDF_GROUP_SCHED
+ struct rq *rq;
+#endif
+};
+
/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
struct rt_prio_array active;
@@ -544,6 +580,7 @@ struct rq {
u64 nr_migrations_in;

struct cfs_rq cfs;
+ struct edf_rq edf;
struct rt_rq rt;

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -1816,7 +1853,9 @@ static void calc_load_account_active(struct rq *this_rq);
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
+#include "sched_edf.c"
#include "sched_rt.c"
+
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
#endif
@@ -1837,7 +1876,7 @@ static void dec_nr_running(struct rq *rq)

static void set_load_weight(struct task_struct *p)
{
- if (task_has_rt_policy(p)) {
+ if (task_has_rt_policy(p) || task_has_edf_policy(p)) {
p->se.load.weight = prio_to_weight[0] * 2;
p->se.load.inv_weight = prio_to_wmult[0] >> 1;
return;
@@ -2523,7 +2562,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
* Revert to default priority/policy on fork if requested.
*/
if (unlikely(p->sched_reset_on_fork)) {
- if (p->policy == SCHED_FIFO || p->policy == SCHED_RR)
+ if (p->policy == SCHED_FIFO || p->policy == SCHED_RR ||
+ edf_policy(p->policy))
p->policy = SCHED_NORMAL;

if (p->normal_prio < DEFAULT_PRIO)
@@ -2541,8 +2581,12 @@ void sched_fork(struct task_struct *p, int clone_flags)
p->sched_reset_on_fork = 0;
}

- if (!rt_prio(p->prio))
- p->sched_class = &fair_sched_class;
+ if (!rt_prio(p->prio)) {
+ if (edf_policy(p->policy))
+ p->sched_class = &edf_sched_class;
+ else
+ p->sched_class = &fair_sched_class;
+ }

#ifdef CONFIG_SMP
cpu = p->sched_class->select_task_rq(p, SD_BALANCE_FORK, 0);
@@ -2565,6 +2609,68 @@ void sched_fork(struct task_struct *p, int clone_flags)
put_cpu();
}

+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_EDF_GROUP_SCHED)
+static unsigned long to_ratio(u64 period, u64 runtime)
+{
+ if (runtime == RUNTIME_INF)
+ return 1ULL << 20;
+
+ return div64_u64(runtime << 20, period);
+}
+
+#ifdef CONFIG_EDF_GROUP_SCHED
+static inline
+void __edf_tg_clear_task_bw(struct task_group *tg, u64 tsk_bw)
+{
+ tg->edf_bandwidth.total_bw -= tsk_bw;
+}
+
+static inline
+void __edf_tg_add_task_bw(struct task_group *tg, u64 tsk_bw)
+{
+ tg->edf_bandwidth.total_bw += tsk_bw;
+}
+
+static inline
+int __edf_check_task_tg_bw(struct task_struct *p, struct task_group *tg,
+ int policy, struct sched_param_ex *param_ex)
+{
+ u64 tg_bw = tg->edf_bandwidth.bw;
+ u64 bw = p->edf.bw;
+ u64 new_bw = 0;
+
+ if (tg->edf_bandwidth.bw <= 0)
+ return 0;
+
+ if (edf_policy(policy))
+ new_bw = to_ratio(timespec_to_ns(&param_ex->sched_period),
+ timespec_to_ns(&param_ex->sched_runtime));
+
+ /*
+ * Either if a task, enters, leave, or stays SCHED_EDF but changing
+ * its parameters, we need to update accordingly the allocated
+ * bandwidth of the control group it is inside.
+ */
+ if (task_has_edf_policy(p) && !edf_policy(policy)) {
+ __edf_tg_clear_task_bw(tg, bw);
+ return 1;
+ } else if (task_has_edf_policy(p) && edf_policy(policy) &&
+ tg_bw >= tg->edf_bandwidth.total_bw - bw + new_bw) {
+ __edf_tg_clear_task_bw(tg, bw);
+ __edf_tg_add_task_bw(tg, new_bw);
+ return 1;
+ } else if (edf_policy(policy) && !task_has_edf_policy(p) &&
+ tg_bw >= tg->edf_bandwidth.total_bw + new_bw) {
+ __edf_tg_add_task_bw(tg, new_bw);
+ return 1;
+ }
+
+ return 0;
+}
+#endif /* CONFIG_EDF_GROUP_SCHED */
+
+#endif /* CONFIG_RT_GROUP_SCHED || CONFIG_EDF_GROUP_SCHED */
+
/*
* wake_up_new_task - wake up a newly created task for the first time.
*
@@ -2582,6 +2688,17 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
update_rq_clock(rq);

p->prio = effective_prio(p);
+ if (edf_task(p)) {
+ struct sched_edf_entity *edf_se = &p->edf;
+
+ /*
+ * If the new task we are waking up is EDF,
+ * some initialization is needed
+ */
+ RB_CLEAR_NODE(&edf_se->rb_node);
+ edf_se->edf_flags = EDF_NEW;
+ init_edf_timer(&edf_se->edf_timer);
+ }

if (!p->sched_class->task_new || !current->se.on_rq) {
activate_task(rq, p, 0);
@@ -2725,6 +2842,23 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+#ifdef CONFIG_EDF_GROUP_SCHED
+ if (edf_task(prev)) {
+ struct task_group *tg = task_group(prev);
+ unsigned long flags;
+
+ /*
+ * When an EDF task dies, we need to drain its
+ * bandwidth from the cgroup it was accomodating hime,
+ * to make space for new tasks.
+ */
+ spin_lock_irqsave(&tg->edf_bandwidth.lock, flags);
+ __edf_tg_clear_task_bw(tg, prev->edf.bw);
+ spin_unlock_irqrestore(&tg->edf_bandwidth.lock, flags);
+
+ hrtimer_cancel(&prev->edf.edf_timer);
+ }
+#endif
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
@@ -5964,8 +6098,12 @@ void rt_mutex_setprio(struct task_struct *p, int prio)

if (rt_prio(prio))
p->sched_class = &rt_sched_class;
- else
- p->sched_class = &fair_sched_class;
+ else {
+ if (!edf_task(p))
+ p->sched_class = &fair_sched_class;
+ else
+ p->sched_class = &edf_sched_class;
+ }

p->prio = prio;

@@ -5999,9 +6137,9 @@ void set_user_nice(struct task_struct *p, long nice)
* The RT priorities are set via sched_setscheduler(), but we still
* allow the 'normal' nice value to be set - but as expected
* it wont have any effect on scheduling until the task is
- * SCHED_FIFO/SCHED_RR:
+ * SCHED_EDF, SCHED_FIFO or SCHED_RR:
*/
- if (task_has_rt_policy(p)) {
+ if (unlikely(task_has_rt_policy(p) || task_has_edf_policy(p))) {
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
@@ -6148,6 +6286,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
case SCHED_IDLE:
p->sched_class = &fair_sched_class;
break;
+ case SCHED_EDF:
+ p->sched_class = &edf_sched_class;
+ break;
case SCHED_FIFO:
case SCHED_RR:
p->sched_class = &rt_sched_class;
@@ -6161,6 +6302,23 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
set_load_weight(p);
}

+static void
+__setscheduler_ex(struct rq *rq, struct task_struct *p, int policy,
+ struct sched_param_ex *param_ex)
+{
+ struct sched_edf_entity *edf_se = &p->edf;
+
+ RB_CLEAR_NODE(&edf_se->rb_node);
+ edf_se->edf_flags = EDF_NEW;
+
+ edf_se->runtime_max = timespec_to_ns(&param_ex->sched_runtime);
+ edf_se->period = timespec_to_ns(&param_ex->sched_period);
+#ifdef CONFIG_EDF_GROUP_SCHED
+ edf_se->bw = to_ratio(edf_se->period, edf_se->runtime_max);
+#endif
+ init_edf_timer(&edf_se->edf_timer);
+}
+
/*
* check the target process has a UID that matches the current process's
*/
@@ -6178,7 +6336,9 @@ static bool check_same_owner(struct task_struct *p)
}

static int __sched_setscheduler(struct task_struct *p, int policy,
- struct sched_param *param, bool user)
+ struct sched_param *param,
+ struct sched_param_ex *param_ex,
+ bool user)
{
int retval, oldprio, oldpolicy = -1, on_rq, running;
unsigned long flags;
@@ -6198,6 +6358,7 @@ recheck:
policy &= ~SCHED_RESET_ON_FORK;

if (policy != SCHED_FIFO && policy != SCHED_RR &&
+ policy != SCHED_EDF &&
policy != SCHED_NORMAL && policy != SCHED_BATCH &&
policy != SCHED_IDLE)
return -EINVAL;
@@ -6212,6 +6373,13 @@ recheck:
(p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
(!p->mm && param->sched_priority > MAX_RT_PRIO-1))
return -EINVAL;
+ if (edf_policy(policy) && param_ex->sched_priority != 0)
+ return -EINVAL;
+ if (edf_policy(policy) && (param_ex == NULL ||
+ timespec_to_ns(&param_ex->sched_period) == 0 ||
+ timespec_to_ns(&param_ex->sched_period) <
+ timespec_to_ns(&param_ex->sched_runtime)))
+ return -EINVAL;
if (rt_policy(policy) != (param->sched_priority != 0))
return -EINVAL;

@@ -6285,6 +6453,24 @@ recheck:
spin_unlock_irqrestore(&p->pi_lock, flags);
goto recheck;
}
+#ifdef CONFIG_EDF_GROUP_SCHED
+ /*
+ * If EDF tasks are involved, check tha bandwidth of its group,
+ * and do not allow setting the task EDF if there is not enough.
+ */
+ if (edf_policy(policy) || edf_task(p)) {
+ struct task_group *tg = task_group(p);
+
+ spin_lock(&tg->edf_bandwidth.lock);
+ if (!__edf_check_task_tg_bw(p, tg, policy, param_ex)) {
+ spin_unlock(&tg->edf_bandwidth.lock);
+ __task_rq_unlock(rq);
+ spin_unlock_irqrestore(&p->pi_lock, flags);
+ return -EPERM;
+ }
+ spin_unlock(&tg->edf_bandwidth.lock);
+ }
+#endif
update_rq_clock(rq);
on_rq = p->se.on_rq;
running = task_current(rq, p);
@@ -6297,6 +6483,8 @@ recheck:

oldprio = p->prio;
__setscheduler(rq, p, policy, param->sched_priority);
+ if (edf_policy(policy))
+ __setscheduler_ex(rq, p, policy, param_ex);

if (running)
p->sched_class->set_curr_task(rq);
@@ -6324,10 +6512,17 @@ recheck:
int sched_setscheduler(struct task_struct *p, int policy,
struct sched_param *param)
{
- return __sched_setscheduler(p, policy, param, true);
+ return __sched_setscheduler(p, policy, param, NULL, true);
}
EXPORT_SYMBOL_GPL(sched_setscheduler);

+int sched_setscheduler_ex(struct task_struct *p, int policy,
+ struct sched_param *param,
+ struct sched_param_ex *param_ex)
+{
+ return __sched_setscheduler(p, policy, param, param_ex, true);
+}
+
/**
* sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
* @p: the task in question.
@@ -6342,7 +6537,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
int sched_setscheduler_nocheck(struct task_struct *p, int policy,
struct sched_param *param)
{
- return __sched_setscheduler(p, policy, param, false);
+ return __sched_setscheduler(p, policy, param, NULL, false);
}

static int
@@ -6367,6 +6562,33 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
return retval;
}

+static int
+do_sched_setscheduler_ex(pid_t pid, int policy,
+ struct sched_param_ex __user *param_ex)
+{
+ struct sched_param lparam;
+ struct sched_param_ex lparam_ex;
+ struct task_struct *p;
+ int retval;
+
+ if (!param_ex || pid < 0)
+ return -EINVAL;
+ if (copy_from_user(&lparam_ex, param_ex,
+ sizeof(struct sched_param_ex)))
+ return -EFAULT;
+
+ rcu_read_lock();
+ retval = -ESRCH;
+ p = find_process_by_pid(pid);
+ if (p != NULL) {
+ lparam.sched_priority = lparam_ex.sched_priority;
+ retval = sched_setscheduler_ex(p, policy, &lparam, &lparam_ex);
+ }
+ rcu_read_unlock();
+
+ return retval;
+}
+
/**
* sys_sched_setscheduler - set/change the scheduler policy and RT priority
* @pid: the pid in question.
@@ -6384,6 +6606,22 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
}

/**
+ * sys_sched_setscheduler_ex - set/change the scheduler policy and EDF deadline
+ * @pid: the pid in question.
+ * @policy: new policy (should be SCHED_EDF).
+ * @param: structure containg the extended EDF parameters.
+ */
+SYSCALL_DEFINE3(sched_setscheduler_ex, pid_t, pid, int, policy,
+ struct sched_param_ex __user *, param_ex)
+{
+ if (policy < 0)
+ return -EINVAL;
+
+ return do_sched_setscheduler_ex(pid, policy, param_ex);
+}
+
+
+/**
* sys_sched_setparam - set/change the RT priority of a thread
* @pid: the pid in question.
* @param: structure containing the new RT priority.
@@ -6393,6 +6631,12 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
return do_sched_setscheduler(pid, -1, param);
}

+SYSCALL_DEFINE2(sched_setparam_ex, pid_t, pid,
+ struct sched_param_ex __user *, param_ex)
+{
+ return do_sched_setscheduler_ex(pid, -1, param_ex);
+}
+
/**
* sys_sched_getscheduler - get the policy (scheduling class) of a thread
* @pid: the pid in question.
@@ -6457,6 +6701,44 @@ out_unlock:
return retval;
}

+SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
+ struct sched_param_ex __user *, param_ex)
+{
+ struct sched_param_ex lp;
+ struct task_struct *p;
+ int retval;
+
+ if (!param_ex || pid < 0)
+ return -EINVAL;
+
+ read_lock(&tasklist_lock);
+ p = find_process_by_pid(pid);
+ retval = -ESRCH;
+ if (!p)
+ goto out_unlock;
+
+ retval = security_task_getscheduler(p);
+ if (retval)
+ goto out_unlock;
+
+ lp.sched_priority = p->rt_priority;
+ lp.sched_runtime = ns_to_timespec(p->edf.runtime_max);
+ lp.sched_period = ns_to_timespec(p->edf.period);
+ read_unlock(&tasklist_lock);
+
+ /*
+ * This one might sleep, we cannot do it with a spinlock held ...
+ */
+ retval = copy_to_user(param_ex, &lp, sizeof(*param_ex)) ? -EFAULT : 0;
+
+ return retval;
+
+out_unlock:
+ read_unlock(&tasklist_lock);
+ return retval;
+
+}
+
long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
{
cpumask_var_t cpus_allowed, new_mask;
@@ -9222,6 +9504,14 @@ static void init_cfs_rq(struct cfs_rq *cfs_rq, struct rq *rq)
cfs_rq->min_vruntime = (u64)(-(1LL << 20));
}

+static void init_edf_rq(struct edf_rq *edf_rq, struct rq *rq)
+{
+ edf_rq->rb_root = RB_ROOT;
+#ifdef CONFIG_EDF_GROUP_SCHED
+ edf_rq->rq = rq;
+#endif
+}
+
static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
{
struct rt_prio_array *array;
@@ -9287,6 +9577,22 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
}
#endif

+#ifdef CONFIG_EDF_GROUP_SCHED
+void init_tg_edf_entry(struct task_group *tg, struct edf_rq *edf_rq,
+ struct sched_edf_entity *edf_se, int cpu, int add,
+ struct sched_edf_entity *parent)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ tg->edf_rq[cpu] = &rq->edf;
+
+ spin_lock_init(&tg->edf_bandwidth.lock);
+ tg->edf_bandwidth.runtime_max = 0;
+ tg->edf_bandwidth.period = 0;
+ tg->edf_bandwidth.total_bw = 0;
+}
+#endif
+
#ifdef CONFIG_RT_GROUP_SCHED
static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
struct sched_rt_entity *rt_se, int cpu, int add,
@@ -9328,6 +9634,10 @@ void __init sched_init(void)
#ifdef CONFIG_RT_GROUP_SCHED
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
#endif
+#ifdef CONFIG_EDF_GROUP_SCHED
+ alloc_size += 2 * nr_cpu_ids * sizeof(void **);
+#endif
+
#ifdef CONFIG_USER_SCHED
alloc_size *= 2;
#endif
@@ -9356,6 +9666,10 @@ void __init sched_init(void)
ptr += nr_cpu_ids * sizeof(void **);
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_EDF_GROUP_SCHED
+ init_task_group.edf_rq = (struct edf_rq **)ptr;
+ ptr += nr_cpu_ids * sizeof(void **);
+#endif /* CONFIG_EDF_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
init_task_group.rt_se = (struct sched_rt_entity **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
@@ -9415,6 +9729,7 @@ void __init sched_init(void)
rq->calc_load_active = 0;
rq->calc_load_update = jiffies + LOAD_FREQ;
init_cfs_rq(&rq->cfs, rq);
+ init_edf_rq(&rq->edf, rq);
init_rt_rq(&rq->rt, rq);
#ifdef CONFIG_FAIR_GROUP_SCHED
init_task_group.shares = init_task_group_load;
@@ -9462,6 +9777,13 @@ void __init sched_init(void)
#endif
#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_EDF_GROUP_SCHED
+#ifdef CONFIG_CGROUP_SCHED
+ init_tg_edf_entry(&init_task_group, &rq->edf,
+ NULL, i, 1, NULL);
+#endif
+#endif
+
rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
#ifdef CONFIG_RT_GROUP_SCHED
INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
@@ -9772,6 +10094,68 @@ static inline void unregister_fair_sched_group(struct task_group *tg, int cpu)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_EDF_GROUP_SCHED
+static void free_edf_sched_group(struct task_group *tg)
+{
+ kfree(tg->edf_rq);
+}
+
+int alloc_edf_sched_group(struct task_group *tg, struct task_group *parent)
+{
+ struct rq *rq;
+ int i;
+
+ tg->edf_rq = kzalloc(sizeof(struct edf_rq*) * nr_cpu_ids, GFP_KERNEL);
+ if (!tg->edf_rq)
+ return 0;
+
+ for_each_possible_cpu(i) {
+ rq = cpu_rq(i);
+ init_tg_edf_entry(tg, &rq->edf, NULL, i, 0, NULL);
+ }
+
+ return 1;
+}
+
+int sched_edf_can_attach(struct cgroup *cgrp, struct task_struct *tsk)
+{
+ struct task_group *tg = container_of(cgroup_subsys_state(cgrp,
+ cpu_cgroup_subsys_id),
+ struct task_group, css);
+ u64 tg_bw = tg->edf_bandwidth.bw;
+ u64 tsk_bw = tsk->edf.bw;
+
+ if (!edf_task(tsk))
+ return 1;
+
+ /*
+ * Check if the requested group has enough free space
+ * for this particular task.
+ */
+ if (tg_bw < tsk_bw + tg->edf_bandwidth.total_bw)
+ return 0;
+
+ return 1;
+}
+#else
+static inline void free_edf_sched_group(struct task_group *tg)
+{
+}
+
+static inline
+int alloc_edf_sched_group(struct task_group *tg, struct task_group *parent)
+{
+ return 1;
+}
+#endif
+static inline void register_edf_sched_group(struct task_group *tg, int cpu)
+{
+}
+
+static inline void unregister_edf_sched_group(struct task_group *tg, int cpu)
+{
+}
+
#ifdef CONFIG_RT_GROUP_SCHED
static void free_rt_sched_group(struct task_group *tg)
{
@@ -9864,6 +10248,7 @@ static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
static void free_sched_group(struct task_group *tg)
{
free_fair_sched_group(tg);
+ free_edf_sched_group(tg);
free_rt_sched_group(tg);
kfree(tg);
}
@@ -9882,12 +10267,16 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_fair_sched_group(tg, parent))
goto err;

+ if (!alloc_edf_sched_group(tg, parent))
+ goto err;
+
if (!alloc_rt_sched_group(tg, parent))
goto err;

spin_lock_irqsave(&task_group_lock, flags);
for_each_possible_cpu(i) {
register_fair_sched_group(tg, i);
+ register_edf_sched_group(tg, i);
register_rt_sched_group(tg, i);
}
list_add_rcu(&tg->list, &task_groups);
@@ -9918,12 +10307,31 @@ void sched_destroy_group(struct task_group *tg)
{
unsigned long flags;
int i;
+#ifdef CONFIG_EDF_GROUP_SCHED
+ struct task_group *parent = tg->parent;
+ u64 tg_bw = tg->edf_bandwidth.bw;

spin_lock_irqsave(&task_group_lock, flags);
+
+ /*
+ * If an EDF group goes away, its parent group
+ * (if any), ends up with some free bandwidth that
+ * it might use for other groups/tasks.
+ */
+ spin_lock(&parent->edf_bandwidth.lock);
+ if (tg_bw && parent)
+ parent->edf_bandwidth.total_bw -= tg_bw;
+ spin_unlock(&parent->edf_bandwidth.lock);
+#else
+ spin_lock_irqsave(&task_group_lock, flags);
+#endif
+
for_each_possible_cpu(i) {
unregister_fair_sched_group(tg, i);
+ unregister_edf_sched_group(tg, i);
unregister_rt_sched_group(tg, i);
}
+
list_del_rcu(&tg->list);
list_del_rcu(&tg->siblings);
spin_unlock_irqrestore(&task_group_lock, flags);
@@ -10069,14 +10477,6 @@ unsigned long sched_group_shares(struct task_group *tg)
*/
static DEFINE_MUTEX(rt_constraints_mutex);

-static unsigned long to_ratio(u64 period, u64 runtime)
-{
- if (runtime == RUNTIME_INF)
- return 1ULL << 20;
-
- return div64_u64(runtime << 20, period);
-}
-
/* Must be called with tasklist_lock held */
static inline int tg_has_rt_tasks(struct task_group *tg)
{
@@ -10380,9 +10780,15 @@ static int
cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct task_struct *tsk)
{
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_EDF_GROUP_SCHED)
#ifdef CONFIG_RT_GROUP_SCHED
if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
return -EINVAL;
+#endif
+#ifdef CONFIG_EDF_GROUP_SCHED
+ if (!sched_edf_can_attach(cgrp, tsk))
+ return -EINVAL;
+#endif
#else
/* We don't support RT-tasks being in separate groups */
if (tsk->sched_class != &fair_sched_class)
@@ -10396,6 +10802,30 @@ static void
cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup *old_cont, struct task_struct *tsk)
{
+#ifdef CONFIG_EDF_GROUP_SCHED
+ struct task_group *tg = container_of(cgroup_subsys_state(cgrp,
+ cpu_cgroup_subsys_id),
+ struct task_group, css);
+ struct task_group *old_tg = container_of(cgroup_subsys_state(old_cont,
+ cpu_cgroup_subsys_id),
+ struct task_group, css);
+ u64 tsk_bw = tsk->edf.bw;
+
+ /*
+ * An amount of bandwidth equal to the bandwidth of tsk
+ * is freed in the former group of tsk, and declared occupied
+ * in the new one.
+ */
+ spin_lock_irq(&tg->edf_bandwidth.lock);
+ tg->edf_bandwidth.total_bw += tsk_bw;
+ spin_unlock_irq(&tg->edf_bandwidth.lock);
+
+ if (old_tg) {
+ spin_lock_irq(&old_tg->edf_bandwidth.lock);
+ old_tg->edf_bandwidth.total_bw -= tsk_bw;
+ spin_unlock_irq(&old_tg->edf_bandwidth.lock);
+ }
+#endif
sched_move_task(tsk);
}

@@ -10438,6 +10868,100 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
}
#endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_EDF_GROUP_SCHED
+/*
+ * We allow runtime_max > period, since it makes sense to
+ * assign more than 100% bandwidth to a group, if on an SMP
+ * machine.
+ */
+static int __edf_schedulable(struct task_group *tg,
+ u64 runtime_max, u64 period)
+{
+ struct task_group *parent = tg->parent;
+ u64 old_bw = tg->edf_bandwidth.bw;
+ u64 parent_bw = parent ? parent->edf_bandwidth.bw : 0;
+ u64 bw = period > 0 ? to_ratio(period, runtime_max) : 0;
+ int ret = 0;
+
+ if (bw < tg->edf_bandwidth.total_bw)
+ return -EINVAL;
+
+ if (parent) {
+ /*
+ * All we need to do is check if the parent group is able
+ * to accomodate the new bandwidth of this group or not,
+ * given its actual allocated bandwidth, e.g., to other
+ * groups or tasks.
+ */
+ if (parent_bw < parent->edf_bandwidth.total_bw -
+ old_bw + bw)
+ return -EINVAL;
+
+ spin_lock_irq(&parent->edf_bandwidth.lock);
+ parent->edf_bandwidth.total_bw -= old_bw;
+ parent->edf_bandwidth.total_bw += bw;
+ spin_unlock_irq(&parent->edf_bandwidth.lock);
+ }
+
+ if (!ret) {
+ spin_lock_irq(&tg->edf_bandwidth.lock);
+ tg->edf_bandwidth.runtime_max = runtime_max;
+ tg->edf_bandwidth.period = period;
+ tg->edf_bandwidth.bw = bw;
+ spin_unlock_irq(&tg->edf_bandwidth.lock);
+ }
+
+ return ret;
+}
+
+static int cpu_edf_runtime_write_uint(struct cgroup *cgrp,
+ struct cftype *cftype,
+ u64 edf_runtime_us)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+
+ return __edf_schedulable(tg,
+ edf_runtime_us * NSEC_PER_USEC,
+ tg->edf_bandwidth.period);
+}
+
+static u64 cpu_edf_runtime_read_uint(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ u64 runtime;
+
+ spin_lock_irq(&tg->edf_bandwidth.lock);
+ runtime = tg->edf_bandwidth.runtime_max;
+ spin_unlock_irq(&tg->edf_bandwidth.lock);
+ do_div(runtime, NSEC_PER_USEC);
+
+ return runtime;
+}
+
+static int cpu_edf_period_write_uint(struct cgroup *cgrp, struct cftype *cftype,
+ u64 edf_period_us)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+
+ return __edf_schedulable(tg,
+ tg->edf_bandwidth.runtime_max,
+ edf_period_us * NSEC_PER_USEC);
+}
+
+static u64 cpu_edf_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ u64 period;
+
+ spin_lock_irq(&tg->edf_bandwidth.lock);
+ period = tg->edf_bandwidth.period;
+ spin_unlock_irq(&tg->edf_bandwidth.lock);
+ do_div(period, NSEC_PER_USEC);
+
+ return period;
+}
+#endif
+
static struct cftype cpu_files[] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -10458,6 +10982,19 @@ static struct cftype cpu_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
#endif
+#ifdef CONFIG_EDF_GROUP_SCHED
+ {
+ .name = "edf_runtime_us",
+ .read_u64 = cpu_edf_runtime_read_uint,
+ .write_u64 = cpu_edf_runtime_write_uint,
+ },
+ {
+ .name = "edf_period_us",
+ .read_u64 = cpu_edf_period_read_uint,
+ .write_u64 = cpu_edf_period_write_uint,
+ },
+#endif
+
};

static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index efb8440..1252a43 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -146,7 +146,8 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
}

#if defined(CONFIG_CGROUP_SCHED) && \
- (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+ (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED) || \
+ defined(CONFIG_EDF_GROUP_SCHED))
static void task_group_path(struct task_group *tg, char *buf, int buflen)
{
/* may be NULL if the underlying cgroup isn't fully-created yet */
@@ -218,6 +219,38 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#endif
}

+void print_edf_rq(struct seq_file *m, int cpu, struct edf_rq *edf_rq)
+{
+ s64 min_deadline = -1, max_deadline = -1;
+ struct rq *rq = &per_cpu(runqueues, cpu);
+ struct sched_edf_entity *last;
+ unsigned long flags;
+
+ SEQ_printf(m, "\nedf_rq[%d]:\n", cpu);
+
+ spin_lock_irqsave(&rq->lock, flags);
+ if (edf_rq->rb_leftmost)
+ min_deadline = (rb_entry(edf_rq->rb_leftmost,
+ struct sched_edf_entity,
+ rb_node))->deadline;
+ last = __pick_edf_last_entity(edf_rq);
+ if (last)
+ max_deadline = last->deadline;
+ spin_unlock_irqrestore(&rq->lock, flags);
+
+#define P(x) \
+ SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(x))
+#define PN(x) \
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(x))
+
+ P(edf_rq->edf_nr_running);
+ PN(min_deadline);
+ PN(max_deadline);
+
+#undef PN
+#undef P
+}
+
void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
{
#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_RT_GROUP_SCHED)
@@ -300,6 +333,7 @@ static void print_cpu(struct seq_file *m, int cpu)
#undef P
#endif
print_cfs_stats(m, cpu);
+ print_edf_stats(m, cpu);
print_rt_stats(m, cpu);

print_rq(m, rq, cpu);
diff --git a/kernel/sched_edf.c b/kernel/sched_edf.c
new file mode 100644
index 0000000..25dde69
--- /dev/null
+++ b/kernel/sched_edf.c
@@ -0,0 +1,548 @@
+/*
+ * Earliest Deadline Firsf (EDF) Scheduling Class (SCHED_EDF policy)
+ *
+ * Earliest Deadline First scheduling for periodic or sporadic tasks.
+ *
+ * An EDF task specifies:
+ * - a maximum runtime,
+ * - a period or minimum interarrival time.
+ * The scheduler, on its turn:
+ * - schedule the task according to its deadline d=t+period,
+ * - enforces the task to not overcame its bandwidth B=runtime_max/period.
+ *
+ * At the end of each instance of a periodic job (or sporadic activation),
+ * the task may inform the scheduler about such event, by calling
+ * sched_yield(), and then suspending until next activation time.
+ *
+ * The strategy used to confine each task inside its bandwidth reservation
+ * is the Constant Bandwidth Server (CBS) scheduling, a slight variation on
+ * EDF that makes this possible.
+ *
+ * Correct behavior, i.e., no EDF task misses its deadline, is only
+ * guaranteed if the parameters of all the EDF entities in the system are
+ * assigned so to result in a schedulable solution.
+ * However, thanks to bandwidth isolation, overruns and deadline misses
+ * remains local to a single task, and does not affect any other task in
+ * the system.
+ *
+ * Groups, if configured, have bandwidth as well, and it is enforced that
+ * the sum of the bandwidths of entities (tasks and groups) belonging to
+ * a group stays below its own bandwidth.
+ *
+ * Copyright (C) 2009 Dario Faggioli, Michael Trimarchi
+ */
+
+static const struct sched_class edf_sched_class;
+
+static inline struct task_struct *edf_task_of(struct sched_edf_entity *edf_se)
+{
+ return container_of(edf_se, struct task_struct, edf);
+}
+
+static inline struct rq *rq_of_edf_rq(struct edf_rq *edf_rq)
+{
+ return container_of(edf_rq, struct rq, edf);
+}
+
+static inline struct edf_rq *edf_rq_of_se(struct sched_edf_entity *edf_se)
+{
+ struct task_struct *p = edf_task_of(edf_se);
+ struct rq *rq = task_rq(p);
+
+ return &rq->edf;
+}
+
+static inline int edf_se_boosted(struct sched_edf_entity *edf_se)
+{
+ struct task_struct *p = edf_task_of(edf_se);
+
+ return p->prio != p->normal_prio;
+}
+
+static inline int on_edf_rq(struct sched_edf_entity *edf_se)
+{
+ return !RB_EMPTY_NODE(&edf_se->rb_node);
+}
+
+#define for_each_leaf_edf_rq(edf_rq, rq) \
+ for (edf_rq = &rq->edf; edf_rq; edf_rq = NULL)
+
+static inline int edf_time_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+static inline u64 edf_max_deadline(u64 a, u64 b)
+{
+ s64 delta = (s64)(b - a);
+ if (delta > 0)
+ a = b;
+
+ return a;
+}
+
+static inline void __edf_new_to_running(struct sched_edf_entity *edf_se)
+{
+ struct edf_rq *edf_rq = edf_rq_of_se(edf_se);
+ struct rq *rq = rq_of_edf_rq(edf_rq);
+
+ edf_se->edf_flags &= ~EDF_NEW;
+ edf_se->edf_flags |= EDF_RUNNING;
+
+ edf_se->runtime = edf_se->runtime_max;
+ edf_se->deadline = rq->clock + edf_se->period;
+}
+
+static void enqueue_edf_entity(struct sched_edf_entity *edf_se);
+static void dequeue_edf_entity(struct sched_edf_entity *edf_se);
+static void check_edf_preempt_curr(struct task_struct *p, struct rq *rq);
+
+static void update_edf_params(struct sched_edf_entity *edf_se)
+{
+ struct edf_rq *edf_rq = edf_rq_of_se(edf_se);
+ struct rq *rq = rq_of_edf_rq(edf_rq);
+ u64 left, right;
+
+ if (edf_se->edf_flags & EDF_NEW) {
+ __edf_new_to_running(edf_se);
+ return;
+ }
+ if (edf_se->edf_flags & EDF_ENDED) {
+ edf_se->edf_flags &= ~EDF_ENDED;
+ edf_se->runtime = edf_se->runtime > 0 ? 0 : edf_se->runtime;
+ goto update;
+ }
+
+ if (edf_se->runtime < 0)
+ goto update;
+
+ /*
+ * Update the deadline to the current time only if:
+ * - the budget has been completely exhausted;
+ * - using the ramaining budget, with the current deadline, would
+ * make the task exceed its bandwidth;
+ * - the deadline itself is in the past.
+ *
+ * For the second condition to hold, we check if:
+ * runtime / (deadline - rq->clock) >= runtime_max / period
+ *
+ * Which basically says if, in the time left before the current
+ * deadline, the tasks overcome its expected runtime by using the
+ * residual budget (left and right are the two sides of the equation,
+ * after a bit of shuffling to use multiplications instead of
+ * divisions).
+ */
+ left = edf_se->period * edf_se->runtime;
+ right = (edf_se->deadline - rq->clock) * edf_se->runtime_max;
+
+ if (edf_time_before(edf_se->deadline, rq->clock) ||
+ edf_time_before(right, left)) {
+update:
+ edf_se->deadline = max(edf_se->deadline, rq->clock);
+ /*
+ * Keep moving the deadline and replenishing runtime by the
+ * proper amount at least until the runtime becomes positive.
+ */
+ while (edf_se->runtime < 0) {
+ edf_se->deadline += edf_se->period;
+ edf_se->runtime += edf_se->runtime_max;
+ }
+ /*
+ * Avoid giving the entity more than its maximum runtime,
+ * which would be wrong!
+ */
+ if (edf_se->runtime > edf_se->runtime_max)
+ edf_se->runtime = edf_se->runtime_max;
+ }
+}
+
+static int start_edf_timer(struct sched_edf_entity *edf_se)
+{
+ struct edf_rq *edf_rq = edf_rq_of_se(edf_se);
+ struct rq *rq = rq_of_edf_rq(edf_rq);
+ ktime_t now, act;
+ s64 delta;
+
+ act = ns_to_ktime(edf_se->deadline);
+ now = hrtimer_cb_get_time(&edf_se->edf_timer);
+ delta = ktime_to_ns(now) - rq->clock;
+ if (delta < 0)
+ act = ktime_sub_ns(act, -delta);
+ else
+ act = ktime_add_ns(act, delta);
+
+ hrtimer_set_expires(&edf_se->edf_timer, act);
+
+ hrtimer_start_expires(&edf_se->edf_timer, HRTIMER_MODE_ABS);
+
+ return hrtimer_active(&edf_se->edf_timer);
+}
+
+static enum hrtimer_restart edf_timer(struct hrtimer *timer)
+{
+ struct sched_edf_entity *edf_se = container_of(timer,
+ struct sched_edf_entity,
+ edf_timer);
+ struct task_struct *p = edf_task_of(edf_se);
+ struct edf_rq *edf_rq = edf_rq_of_se(edf_se);
+ struct rq *rq = rq_of_edf_rq(edf_rq);
+
+ spin_lock(&rq->lock);
+ if (!edf_task(p))
+ goto unlock;
+
+ BUG_ON(on_edf_rq(edf_se));
+
+ edf_se->edf_flags &= ~EDF_THROTTLED;
+ if (p->se.on_rq) {
+ update_edf_params(edf_se);
+ enqueue_edf_entity(edf_se);
+ check_edf_preempt_curr(edf_task_of(edf_se), rq);
+ }
+unlock:
+ spin_unlock(&rq->lock);
+
+ return HRTIMER_NORESTART;
+}
+
+static void init_edf_timer(struct hrtimer *timer)
+{
+ if (hrtimer_active(timer)) {
+ hrtimer_try_to_cancel(timer);
+ return;
+ }
+
+ hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ timer->function = edf_timer;
+}
+
+static int edf_runtime_exceeded(struct sched_edf_entity *edf_se)
+{
+ if (edf_se->runtime > 0 || edf_se_boosted(edf_se))
+ return 0;
+
+ BUG_ON(hrtimer_active(&edf_se->edf_timer));
+
+ dequeue_edf_entity(edf_se);
+ if (!start_edf_timer(edf_se)) {
+ update_edf_params(edf_se);
+ enqueue_edf_entity(edf_se);
+ } else
+ edf_se->edf_flags |= EDF_THROTTLED;
+
+ return 1;
+}
+
+static void update_curr_edf(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ struct sched_edf_entity *edf_se = &curr->edf;
+ u64 delta_exec;
+
+ if (!edf_task(curr) || !on_edf_rq(edf_se))
+ return;
+
+ delta_exec = rq->clock - curr->se.exec_start;
+ if (unlikely((s64)delta_exec < 0))
+ delta_exec = 0;
+
+ schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
+
+ curr->se.sum_exec_runtime += delta_exec;
+ account_group_exec_runtime(curr, delta_exec);
+
+ curr->se.exec_start = rq->clock;
+ cpuacct_charge(curr, delta_exec);
+
+ edf_se->runtime -= delta_exec;
+ /*
+ * If we exhausted our runtime it has to be replenished
+ * either immediately or after one (or more) CBS period(s).
+ */
+ if (edf_runtime_exceeded(edf_se))
+ resched_task(curr);
+}
+
+static void enqueue_edf_entity(struct sched_edf_entity *edf_se)
+{
+ struct edf_rq *edf_rq = edf_rq_of_se(edf_se);
+ struct rb_node **link = &edf_rq->rb_root.rb_node;
+ struct rb_node *parent = NULL;
+ struct sched_edf_entity *entry;
+ int leftmost = 1;
+
+ BUG_ON(!RB_EMPTY_NODE(&edf_se->rb_node));
+
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct sched_edf_entity, rb_node);
+ if (!edf_time_before(entry->deadline, edf_se->deadline))
+ link = &parent->rb_left;
+ else {
+ link = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ if (leftmost)
+ edf_rq->rb_leftmost = &edf_se->rb_node;
+
+ rb_link_node(&edf_se->rb_node, parent, link);
+ rb_insert_color(&edf_se->rb_node, &edf_rq->rb_root);
+
+ edf_rq->edf_nr_running++;
+}
+
+static void dequeue_edf_entity(struct sched_edf_entity *edf_se)
+{
+ struct edf_rq *edf_rq = edf_rq_of_se(edf_se);
+
+ if (RB_EMPTY_NODE(&edf_se->rb_node))
+ return;
+
+ if (edf_rq->rb_leftmost == &edf_se->rb_node) {
+ struct rb_node *next_node;
+ struct sched_edf_entity *next;
+
+ next_node = rb_next(&edf_se->rb_node);
+ edf_rq->rb_leftmost = next_node;
+
+ if (next_node)
+ next = rb_entry(next_node, struct sched_edf_entity,
+ rb_node);
+ }
+
+ rb_erase(&edf_se->rb_node, &edf_rq->rb_root);
+ RB_CLEAR_NODE(&edf_se->rb_node);
+
+ edf_rq->edf_nr_running--;
+}
+
+static void check_preempt_curr_edf(struct rq *rq, struct task_struct *p,
+ int sync)
+{
+ if (unlikely(rt_prio(p->prio)))
+ goto resched;
+
+ if (unlikely(p->sched_class != &edf_sched_class))
+ return;
+
+ if (edf_time_before(p->edf.deadline, rq->curr->edf.deadline))
+resched:
+ resched_task(rq->curr);
+}
+
+static void check_edf_preempt_curr(struct task_struct *p, struct rq *rq)
+{
+ if (rt_prio(rq->curr->prio))
+ return;
+
+ if (rq->curr->sched_class != &edf_sched_class ||
+ edf_time_before(p->edf.deadline, rq->curr->edf.deadline))
+ //if (!task_running(p);
+ resched_task(rq->curr);
+}
+
+static void
+enqueue_task_edf(struct rq *rq, struct task_struct *p, int wakeup)
+{
+ struct sched_edf_entity *edf_se = &p->edf;
+
+ BUG_ON(on_edf_rq(edf_se));
+
+ /*
+ * Only enqueue entities with some remaining runtime.
+ */
+ if (edf_se->edf_flags & EDF_THROTTLED)
+ return;
+
+ update_edf_params(edf_se);
+ enqueue_edf_entity(edf_se);
+}
+
+static void
+dequeue_task_edf(struct rq *rq, struct task_struct *p, int sleep)
+{
+ struct sched_edf_entity *edf_se = &p->edf;
+
+ if (!on_edf_rq(edf_se))
+ return;
+
+ update_curr_edf(rq);
+ dequeue_edf_entity(edf_se);
+}
+
+static void yield_task_edf(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+ struct sched_edf_entity *edf_se = &p->edf;
+
+ edf_se->edf_flags |= EDF_ENDED;
+}
+
+#ifdef CONFIG_SCHED_HRTICK
+static void start_hrtick_edf(struct rq *rq, struct task_struct *p)
+{
+ struct sched_edf_entity *edf_se = &p->edf;
+ s64 delta;
+
+ delta = edf_se->runtime_max - edf_se->runtime;
+
+ if (delta > 10000)
+ hrtick_start(rq, delta);
+}
+#else
+static void start_hrtick_edf(struct rq *rq, struct task_struct *p)
+{
+}
+#endif
+
+static struct sched_edf_entity *__pick_edf_last_entity(struct edf_rq *edf_rq)
+{
+ struct rb_node *last = rb_last(&edf_rq->rb_root);
+
+ if (!last)
+ return NULL;
+
+ return rb_entry(last, struct sched_edf_entity, rb_node);
+}
+
+static struct sched_edf_entity *pick_next_edf_entity(struct rq *rq,
+ struct edf_rq *edf_rq)
+{
+ struct rb_node *left = edf_rq->rb_leftmost;
+
+ if (!left)
+ return NULL;
+
+ return rb_entry(left, struct sched_edf_entity, rb_node);
+}
+
+struct task_struct *pick_next_task_edf(struct rq *rq)
+{
+ struct sched_edf_entity *edf_se;
+ struct edf_rq *edf_rq = &rq->edf;
+ struct task_struct *p;
+
+ BUG_ON(edf_rq->edf_nr_running < 0);
+
+ if (!edf_rq->edf_nr_running)
+ return NULL;
+
+ edf_se = pick_next_edf_entity(rq, edf_rq);
+ BUG_ON(!edf_se);
+
+ p = edf_task_of(edf_se);
+ p->se.exec_start = rq->clock;
+#ifdef CONFIG_SCHED_HRTICK
+ if (hrtick_enabled(rq))
+ start_hrtick_edf(rq, p);
+#endif
+ return p;
+}
+
+static void put_prev_task_edf(struct rq *rq, struct task_struct *p)
+{
+ update_curr_edf(rq);
+ p->se.exec_start = 0;
+}
+
+static void task_tick_edf(struct rq *rq, struct task_struct *p, int queued)
+{
+ update_curr_edf(rq);
+}
+
+static void set_curr_task_edf(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ p->se.exec_start = rq->clock;
+}
+
+static void prio_changed_edf(struct rq *rq, struct task_struct *p,
+ int oldprio, int running)
+{
+ check_edf_preempt_curr(p, rq);
+}
+
+static void switched_to_edf(struct rq *rq, struct task_struct *p,
+ int running)
+{
+ check_edf_preempt_curr(p, rq);
+}
+
+#ifdef CONFIG_SMP
+static int select_task_rq_edf(struct task_struct *p, int sd_flag, int flags)
+{
+ return task_cpu(p);
+}
+
+static unsigned long
+load_balance_edf(struct rq *this_rq, int this_cpu, struct rq *busiest,
+ unsigned long max_load_move,
+ struct sched_domain *sd, enum cpu_idle_type idle,
+ int *all_pinned, int *this_best_prio)
+{
+ /* for now, don't touch EDF tasks */
+ return 0;
+}
+
+static int
+move_one_task_edf(struct rq *this_rq, int this_cpu, struct rq *busiest,
+ struct sched_domain *sd, enum cpu_idle_type idle)
+{
+ return 0;
+}
+
+static void set_cpus_allowed_edf(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+ int weight = cpumask_weight(new_mask);
+
+ BUG_ON(!edf_task(p));
+
+ cpumask_copy(&p->cpus_allowed, new_mask);
+ p->edf.nr_cpus_allowed = weight;
+}
+#endif
+
+static const struct sched_class edf_sched_class = {
+ .next = &fair_sched_class,
+ .enqueue_task = enqueue_task_edf,
+ .dequeue_task = dequeue_task_edf,
+ .yield_task = yield_task_edf,
+
+ .check_preempt_curr = check_preempt_curr_edf,
+
+ .pick_next_task = pick_next_task_edf,
+ .put_prev_task = put_prev_task_edf,
+
+#ifdef CONFIG_SMP
+ .select_task_rq = select_task_rq_edf,
+
+ .load_balance = load_balance_edf,
+ .move_one_task = move_one_task_edf,
+ .set_cpus_allowed = set_cpus_allowed_edf,
+#endif
+
+ .set_curr_task = set_curr_task_edf,
+ .task_tick = task_tick_edf,
+
+ .prio_changed = prio_changed_edf,
+ .switched_to = switched_to_edf,
+};
+
+#ifdef CONFIG_SCHED_DEBUG
+void print_edf_rq(struct seq_file *m, int cpu, struct edf_rq *edf_rq);
+
+static void print_edf_stats(struct seq_file *m, int cpu)
+{
+ struct edf_rq *edf_rq = &cpu_rq(cpu)->edf;
+
+ rcu_read_lock();
+ for_each_leaf_edf_rq(edf_rq, cpu_rq(cpu))
+ print_edf_rq(m, cpu, edf_rq);
+ rcu_read_unlock();
+}
+#endif /* CONFIG_SCHED_DEBUG */
+
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ecc637a..cae7fb8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1571,10 +1571,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_

update_curr(cfs_rq);

- if (unlikely(rt_prio(p->prio))) {
+ if (rt_prio(p->prio) || task_has_edf_policy(p))
resched_task(curr);
- return;
- }

if (unlikely(p->sched_class != &fair_sched_class))
return;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a4d790c..4511fc9 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1746,7 +1746,7 @@ unsigned int get_rr_interval_rt(struct task_struct *task)
}

static const struct sched_class rt_sched_class = {
- .next = &fair_sched_class,
+ .next = &edf_sched_class,
.enqueue_task = enqueue_task_rt,
.dequeue_task = dequeue_task_rt,
.yield_task = yield_task_rt,
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 26185d7..d32aaf0 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -24,6 +24,7 @@ static int __read_mostly tracer_enabled;

static struct task_struct *wakeup_task;
static int wakeup_cpu;
+static int wakeup_edf;
static int wakeup_current_cpu;
static unsigned wakeup_prio = -1;
static int wakeup_rt;
@@ -214,6 +215,9 @@ probe_wakeup(struct rq *rq, struct task_struct *p, int success)
tracing_record_cmdline(p);
tracing_record_cmdline(current);

+ if (wakeup_edf && !edf_task(p))
+ return;
+
if ((wakeup_rt && !rt_task(p)) ||
p->prio >= wakeup_prio ||
p->prio >= current->prio)
@@ -340,12 +344,21 @@ static int __wakeup_tracer_init(struct trace_array *tr)

static int wakeup_tracer_init(struct trace_array *tr)
{
+ wakeup_edf = 0;
+ wakeup_rt = 0;
+ return __wakeup_tracer_init(tr);
+}
+
+static int wakeup_edf_tracer_init(struct trace_array *tr)
+{
+ wakeup_edf = 1;
wakeup_rt = 0;
return __wakeup_tracer_init(tr);
}

static int wakeup_rt_tracer_init(struct trace_array *tr)
{
+ wakeup_edf = 0;
wakeup_rt = 1;
return __wakeup_tracer_init(tr);
}
@@ -384,6 +397,20 @@ static struct tracer wakeup_tracer __read_mostly =
#endif
};

+static struct tracer wakeup_edf_tracer __read_mostly =
+{
+ .name = "wakeup_edf",
+ .init = wakeup_edf_tracer_init,
+ .reset = wakeup_tracer_reset,
+ .start = wakeup_tracer_start,
+ .stop = wakeup_tracer_stop,
+ .wait_pipe = poll_wait_pipe,
+ .print_max = 1,
+#ifdef CONFIG_FTRACE_SELFTEST
+ .selftest = trace_selftest_startup_wakeup,
+#endif
+};
+
static struct tracer wakeup_rt_tracer __read_mostly =
{
.name = "wakeup_rt",
@@ -406,6 +433,10 @@ __init static int init_wakeup_tracer(void)
if (ret)
return ret;

+ ret = register_tracer(&wakeup_edf_tracer);
+ if (ret)
+ return ret;
+
ret = register_tracer(&wakeup_rt_tracer);
if (ret)
return ret;

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2009-09-25 05:35:47

by Dario Faggioli

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

On Fri, 2009-09-25 at 12:27 +0900, Satoru Takeuchi wrote:
> In addition, aptitude and oocalc are not SCHED_EDF, but SCHED_OTHER
> (SCHED_FAIR). So you probably broke some codes other than sched_edf
> part.
>
Strange... It does not happen here, I'll investigate.

Thanks again,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2009-09-25 17:08:31

by Michael Trimarchi

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

Hi,
sat wrote:
> Hi,
>
> I tried massive_intr, a process scheduler's fairness and throughtput testing
> program for massive interactive processes, on vanilla 2.6.31, 2.6.31-bfs211
> (with BFS) and 2.6.31-edf(latest linus tree + EDF patch).
>
> CFS and BFS look good. CFS has better fairness, and BFS has better throughput.
> EDF looks unfair and is unstable. I tested 3 times and the tendency was the
> same.
Thank you for your testing, the instability problems you hit are due to
some bugs in our implementation, we'll try to reproduce and fix them.
Your test shows that we have some problems with how we handle fork();
the code attempts at distributing the bandwidth between the parent and
the child, and the fact that the children hang means that there is
something wrong in how we handle recharges. It is important to note that
the expected behavior of an edf scheduler is *not* a fair one. It has to
do its best to guarantee the deadlines of the admitted tasks.

Regards Michael

2009-09-25 21:33:46

by Chris Friesen

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

On 09/25/2009 10:07 AM, Michael Trimarchi wrote:
> It is important to note that
> the expected behavior of an edf scheduler is *not* a fair one. It has to
> do its best to guarantee the deadlines of the admitted tasks.

Do you allow oversubscription with EDF? It would seem so based on these
results. Would it maybe make sense to disallow oversubscription, or
make it an option?

If you have massive oversubscription with EDF, what is the design
intent? Do you try to meet the goals on as many tasks as possible,
while the oversubscribed tasks get nothing?

Chris

2009-09-25 22:37:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

On Fri, 2009-09-25 at 15:31 -0600, Chris Friesen wrote:
> On 09/25/2009 10:07 AM, Michael Trimarchi wrote:
> > It is important to note that
> > the expected behavior of an edf scheduler is *not* a fair one. It has to
> > do its best to guarantee the deadlines of the admitted tasks.
>
> Do you allow oversubscription with EDF? It would seem so based on these
> results. Would it maybe make sense to disallow oversubscription, or
> make it an option?

afaiu he doesn't, he simply splits the task's wcet between parent and
child and (intends?) to feed back on child exit.

> If you have massive oversubscription with EDF, what is the design
> intent? Do you try to meet the goals on as many tasks as possible,
> while the oversubscribed tasks get nothing?

oversubscribing edf isn't in general recommended, iirc u>1 gives
unbounded latencies with edf.

2009-09-25 22:48:18

by Chris Friesen

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

On 09/25/2009 04:37 PM, Peter Zijlstra wrote:
> On Fri, 2009-09-25 at 15:31 -0600, Chris Friesen wrote:

>> Do you allow oversubscription with EDF? It would seem so based on these
>> results. Would it maybe make sense to disallow oversubscription, or
>> make it an option?
>
> afaiu he doesn't, he simply splits the task's wcet between parent and
> child and (intends?) to feed back on child exit.

How does this work if we have one parent task that forks off a bunch of
kids like massive_intr? Does each child get half the bandwidth of the
previous child?

> oversubscribing edf isn't in general recommended, iirc u>1 gives
> unbounded latencies with edf.

Right...so would it perhaps be more interesting to try a modified test
where each child's bandwidth is equal and small enough that there is no
oversubscription?

Chris

2009-09-26 06:50:33

by Dario Faggioli

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

On Fri, 2009-09-25 at 15:31 -0600, Chris Friesen wrote:
> Do you allow oversubscription with EDF? It would seem so based on these
> results. Would it maybe make sense to disallow oversubscription, or
> make it an option?
>
What I want is the user to be able to achieve whatever result he likes.

Thus, with EDF_GROUP_SCHED enabled, you first have to assign a bandwidth
value, at least to the root group. If 100% (e.g., on UP) is used as such
value, oversubscription will never happen.

If EDF_GROUP_SCHED is off, nothing is in place right now, but I already
planned to add it, and I'm just introducing something like
sysctl_sched_rt_runtime, sysctl_sched_rt_period, in the new version I'm
working on right now.

> If you have massive oversubscription with EDF, what is the design
> intent? Do you try to meet the goals on as many tasks as possible,
> while the oversubscribed tasks get nothing?
>
No, nothing like that... So using this scheduler without avoiding system
overload will be nonsene, unless some kind of logic (as you was
describing) is not implemented in userspace, and that's why I think
oversubscription avoidance should be an available, but also,
configurrable feature... Comments on that?

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2009-09-26 06:56:12

by Dario Faggioli

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

On Sat, 2009-09-26 at 00:37 +0200, Peter Zijlstra wrote:
> On Fri, 2009-09-25 at 15:31 -0600, Chris Friesen wrote:
> > Do you allow oversubscription with EDF? It would seem so based on these
> > results. Would it maybe make sense to disallow oversubscription, or
> > make it an option?
>
> afaiu he doesn't,
yeaah, as explained in the previous mail, this is in place only if group
scheduling is on since now, but I'll add these bits to non-group
solution, aalready planned that. :-)

> he simply splits the task's wcet between parent and
> child and (intends?) to feed back on child exit.
>
yes again, that's what the submitted patch does, which is an arbitrary
choice among all the non-optimal solutions I'm able to think of, as
explained in the first e-mail! :-(

Now, I'm going to give your suggestion of assigning children 0 bandwidth
a shot, and ask the parent (they can't they have no bandwidth!) to give
them some runtime/deadline to make them run.
This also has some issues, I think, but looks more natural... At least
does not affect parent's bandwidth, possibly reducing it to (almost)
0!! :-(

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2009-09-26 07:22:21

by Dario Faggioli

[permalink] [raw]

Subject: Re: massive_intr on CFS, BFS, and EDF

On Fri, 2009-09-25 at 16:46 -0600, Chris Friesen wrote:
> On 09/25/2009 04:37 PM, Peter Zijlstra wrote:
> > afaiu he doesn't, he simply splits the task's wcet between parent and
> > child and (intends?) to feed back on child exit.
>
> How does this work if we have one parent task that forks off a bunch of
> kids like massive_intr? Does each child get half the bandwidth of the
> previous child?
>
Yes, that's exactly how I think it goes... Moreover, also the parent's
bandwidth is halved at each fork, which make things very odd in
workloads like massive_intr! :-(

> Right...so would it perhaps be more interesting to try a modified test
> where each child's bandwidth is equal and small enough that there is no
> oversubscription?
>
So right! I think that should be the way to go, and I'll try to do
something like this as soon as I can... I promise... :-D

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part