Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI

Peter, Dario,


On Tue, Dec 17, 2013 at 1:27 PM, Peter Zijlstra <[email protected]> wrote:
> From: Dario Faggioli <[email protected]>
>
> Add the syscalls needed for supporting scheduling algorithms
> with extended scheduling parameters (e.g., SCHED_DEADLINE).
>
> In general, it makes possible to specify a periodic/sporadic task,
> that executes for a given amount of runtime at each instance, and is
> scheduled according to the urgency of their own timing constraints,
> i.e.:
>
> - a (maximum/typical) instance execution time,
> - a minimum interval between consecutive instances,
> - a time constraint by which each instance must be completed.
>
> Thus, both the data structure that holds the scheduling parameters of
> the tasks and the system calls dealing with it must be extended.
> Unfortunately, modifying the existing struct sched_param would break
> the ABI and result in potentially serious compatibility issues with
> legacy binaries.
>
> For these reasons, this patch:
>
> - defines the new struct sched_attr, containing all the fields
> that are necessary for specifying a task in the computational
> model described above;
> - defines and implements the new scheduling related syscalls that
> manipulate it, i.e., sched_setscheduler2(), sched_setattr()
> and sched_getattr().

Is someone (e.g., one of you) planning to write man pages for the new
sched_setattr() and sched_getattr() system calls? (Also, for the
future, please CC [email protected] on patches that change the
API, then those of us who don't follow LKML get a heads up about
upcoming API changes.)

Thanks,

Michael


> Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
> proof of concept and for developing and testing purposes. Making them
> available on other architectures is straightforward.
>
> Since no "user" for these new parameters is introduced in this patch,
> the implementation of the new system calls is just identical to their
> already existing counterpart. Future patches that implement scheduling
> policies able to exploit the new data structure must also take care of
> modifying the sched_*attr() calls accordingly with their own purposes.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Dario Faggioli <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> [ Twiddled the changelog. ]
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> arch/arm/include/asm/unistd.h | 2
> arch/arm/include/uapi/asm/unistd.h | 3
> arch/arm/kernel/calls.S | 3
> arch/x86/syscalls/syscall_32.tbl | 3
> arch/x86/syscalls/syscall_64.tbl | 3
> include/linux/sched.h | 54 ++++++++
> include/linux/syscalls.h | 8 +
> kernel/sched/core.c | 234 +++++++++++++++++++++++++++++++++++--
> 8 files changed, 298 insertions(+), 12 deletions(-)
>
> --- a/arch/arm/include/asm/unistd.h
> +++ b/arch/arm/include/asm/unistd.h
> @@ -15,7 +15,7 @@
>
> #include <uapi/asm/unistd.h>
>
> -#define __NR_syscalls (380)
> +#define __NR_syscalls (383)
> #define __ARM_NR_cmpxchg (__ARM_NR_BASE+0x00fff0)
>
> #define __ARCH_WANT_STAT64
> --- a/arch/arm/include/uapi/asm/unistd.h
> +++ b/arch/arm/include/uapi/asm/unistd.h
> @@ -406,6 +406,9 @@
> #define __NR_process_vm_writev (__NR_SYSCALL_BASE+377)
> #define __NR_kcmp (__NR_SYSCALL_BASE+378)
> #define __NR_finit_module (__NR_SYSCALL_BASE+379)
> +#define __NR_sched_setscheduler2 (__NR_SYSCALL_BASE+380)
> +#define __NR_sched_setattr (__NR_SYSCALL_BASE+381)
> +#define __NR_sched_getattr (__NR_SYSCALL_BASE+382)
>
> /*
> * This may need to be greater than __NR_last_syscall+1 in order to
> --- a/arch/arm/kernel/calls.S
> +++ b/arch/arm/kernel/calls.S
> @@ -389,6 +389,9 @@
> CALL(sys_process_vm_writev)
> CALL(sys_kcmp)
> CALL(sys_finit_module)
> +/* 380 */ CALL(sys_sched_setscheduler2)
> + CALL(sys_sched_setattr)
> + CALL(sys_sched_getattr)
> #ifndef syscalls_counted
> .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
> #define syscalls_counted
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -357,3 +357,6 @@
> 348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev
> 349 i386 kcmp sys_kcmp
> 350 i386 finit_module sys_finit_module
> +351 i386 sched_setattr sys_sched_setattr
> +352 i386 sched_getattr sys_sched_getattr
> +353 i386 sched_setscheduler2 sys_sched_setscheduler2
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -320,6 +320,9 @@
> 311 64 process_vm_writev sys_process_vm_writev
> 312 common kcmp sys_kcmp
> 313 common finit_module sys_finit_module
> +314 common sched_setattr sys_sched_setattr
> +315 common sched_getattr sys_sched_getattr
> +316 common sched_setscheduler2 sys_sched_setscheduler2
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -56,6 +56,58 @@ struct sched_param {
>
> #include <asm/processor.h>
>
> +#define SCHED_ATTR_SIZE_VER0 40 /* sizeof first published struct */
> +
> +/*
> + * Extended scheduling parameters data structure.
> + *
> + * This is needed because the original struct sched_param can not be
> + * altered without introducing ABI issues with legacy applications
> + * (e.g., in sched_getparam()).
> + *
> + * However, the possibility of specifying more than just a priority for
> + * the tasks may be useful for a wide variety of application fields, e.g.,
> + * multimedia, streaming, automation and control, and many others.
> + *
> + * This variant (sched_attr) is meant at describing a so-called
> + * sporadic time-constrained task. In such model a task is specified by:
> + * - the activation period or minimum instance inter-arrival time;
> + * - the maximum (or average, depending on the actual scheduling
> + * discipline) computation time of all instances, a.k.a. runtime;
> + * - the deadline (relative to the actual activation time) of each
> + * instance.
> + * Very briefly, a periodic (sporadic) task asks for the execution of
> + * some specific computation --which is typically called an instance--
> + * (at most) every period. Moreover, each instance typically lasts no more
> + * than the runtime and must be completed by time instant t equal to
> + * the instance activation time + the deadline.
> + *
> + * This is reflected by the actual fields of the sched_attr structure:
> + *
> + * @sched_priority task's priority (might still be useful)
> + * @sched_flags for customizing the scheduler behaviour
> + * @sched_deadline representative of the task's deadline
> + * @sched_runtime representative of the task's runtime
> + * @sched_period representative of the task's period
> + *
> + * Given this task model, there are a multiplicity of scheduling algorithms
> + * and policies, that can be used to ensure all the tasks will make their
> + * timing constraints.
> + *
> + * @size size of the structure, for fwd/bwd compat.
> + */
> +struct sched_attr {
> + int sched_priority;
> + unsigned int sched_flags;
> + u64 sched_runtime;
> + u64 sched_deadline;
> + u64 sched_period;
> + u32 size;
> +
> + /* Align to u64. */
> + u32 __reserved;
> +};
> +
> struct exec_domain;
> struct futex_pi_state;
> struct robust_list_head;
> @@ -1960,6 +2012,8 @@ extern int sched_setscheduler(struct tas
> const struct sched_param *);
> extern int sched_setscheduler_nocheck(struct task_struct *, int,
> const struct sched_param *);
> +extern int sched_setscheduler2(struct task_struct *, int,
> + const struct sched_attr *);
> extern struct task_struct *idle_task(int cpu);
> /**
> * is_idle_task - is the specified task an idle task?
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -38,6 +38,7 @@ struct rlimit;
> struct rlimit64;
> struct rusage;
> struct sched_param;
> +struct sched_attr;
> struct sel_arg_struct;
> struct semaphore;
> struct sembuf;
> @@ -277,11 +278,18 @@ asmlinkage long sys_clock_nanosleep(cloc
> asmlinkage long sys_nice(int increment);
> asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
> struct sched_param __user *param);
> +asmlinkage long sys_sched_setscheduler2(pid_t pid, int policy,
> + struct sched_attr __user *attr);
> asmlinkage long sys_sched_setparam(pid_t pid,
> struct sched_param __user *param);
> +asmlinkage long sys_sched_setattr(pid_t pid,
> + struct sched_attr __user *attr);
> asmlinkage long sys_sched_getscheduler(pid_t pid);
> asmlinkage long sys_sched_getparam(pid_t pid,
> struct sched_param __user *param);
> +asmlinkage long sys_sched_getattr(pid_t pid,
> + struct sched_attr __user *attr,
> + unsigned int size);
> asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
> unsigned long __user *user_mask_ptr);
> asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3023,7 +3023,8 @@ static bool check_same_owner(struct task
> }
>
> static int __sched_setscheduler(struct task_struct *p, int policy,
> - const struct sched_param *param, bool user)
> + const struct sched_attr *attr,
> + bool user)
> {
> int retval, oldprio, oldpolicy = -1, on_rq, running;
> unsigned long flags;
> @@ -3053,11 +3054,11 @@ static int __sched_setscheduler(struct t
> * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
> * SCHED_BATCH and SCHED_IDLE is 0.
> */
> - if (param->sched_priority < 0 ||
> - (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
> - (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
> + if (attr->sched_priority < 0 ||
> + (p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
> + (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
> return -EINVAL;
> - if (rt_policy(policy) != (param->sched_priority != 0))
> + if (rt_policy(policy) != (attr->sched_priority != 0))
> return -EINVAL;
>
> /*
> @@ -3073,8 +3074,8 @@ static int __sched_setscheduler(struct t
> return -EPERM;
>
> /* can't increase priority */
> - if (param->sched_priority > p->rt_priority &&
> - param->sched_priority > rlim_rtprio)
> + if (attr->sched_priority > p->rt_priority &&
> + attr->sched_priority > rlim_rtprio)
> return -EPERM;
> }
>
> @@ -3123,7 +3124,7 @@ static int __sched_setscheduler(struct t
> * If not changing anything there's no need to proceed further:
> */
> if (unlikely(policy == p->policy && (!rt_policy(policy) ||
> - param->sched_priority == p->rt_priority))) {
> + attr->sched_priority == p->rt_priority))) {
> task_rq_unlock(rq, p, &flags);
> return 0;
> }
> @@ -3160,7 +3161,7 @@ static int __sched_setscheduler(struct t
>
> oldprio = p->prio;
> prev_class = p->sched_class;
> - __setscheduler(rq, p, policy, param->sched_priority);
> + __setscheduler(rq, p, policy, attr->sched_priority);
>
> if (running)
> p->sched_class->set_curr_task(rq);
> @@ -3188,10 +3189,20 @@ static int __sched_setscheduler(struct t
> int sched_setscheduler(struct task_struct *p, int policy,
> const struct sched_param *param)
> {
> - return __sched_setscheduler(p, policy, param, true);
> + struct sched_attr attr = {
> + .sched_priority = param->sched_priority
> + };
> + return __sched_setscheduler(p, policy, &attr, true);
> }
> EXPORT_SYMBOL_GPL(sched_setscheduler);
>
> +int sched_setscheduler2(struct task_struct *p, int policy,
> + const struct sched_attr *attr)
> +{
> + return __sched_setscheduler(p, policy, attr, true);
> +}
> +EXPORT_SYMBOL_GPL(sched_setscheduler2);
> +
> /**
> * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
> * @p: the task in question.
> @@ -3208,7 +3219,10 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
> int sched_setscheduler_nocheck(struct task_struct *p, int policy,
> const struct sched_param *param)
> {
> - return __sched_setscheduler(p, policy, param, false);
> + struct sched_attr attr = {
> + .sched_priority = param->sched_priority
> + };
> + return __sched_setscheduler(p, policy, &attr, false);
> }
>
> static int
> @@ -3233,6 +3247,97 @@ do_sched_setscheduler(pid_t pid, int pol
> return retval;
> }
>
> +/*
> + * Mimics kernel/events/core.c perf_copy_attr().
> + */
> +static int sched_copy_attr(struct sched_attr __user *uattr,
> + struct sched_attr *attr)
> +{
> + u32 size;
> + int ret;
> +
> + if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0))
> + return -EFAULT;
> +
> + /*
> + * zero the full structure, so that a short copy will be nice.
> + */
> + memset(attr, 0, sizeof(*attr));
> +
> + ret = get_user(size, &uattr->size);
> + if (ret)
> + return ret;
> +
> + if (size > PAGE_SIZE) /* silly large */
> + goto err_size;
> +
> + if (!size) /* abi compat */
> + size = SCHED_ATTR_SIZE_VER0;
> +
> + if (size < SCHED_ATTR_SIZE_VER0)
> + goto err_size;
> +
> + /*
> + * If we're handed a bigger struct than we know of,
> + * ensure all the unknown bits are 0 - i.e. new
> + * user-space does not rely on any kernel feature
> + * extensions we dont know about yet.
> + */
> + if (size > sizeof(*attr)) {
> + unsigned char __user *addr;
> + unsigned char __user *end;
> + unsigned char val;
> +
> + addr = (void __user *)uattr + sizeof(*attr);
> + end = (void __user *)uattr + size;
> +
> + for (; addr < end; addr++) {
> + ret = get_user(val, addr);
> + if (ret)
> + return ret;
> + if (val)
> + goto err_size;
> + }
> + size = sizeof(*attr);
> + }
> +
> + ret = copy_from_user(attr, uattr, size);
> + if (ret)
> + return -EFAULT;
> +
> +out:
> + return ret;
> +
> +err_size:
> + put_user(sizeof(*attr), &uattr->size);
> + ret = -E2BIG;
> + goto out;
> +}
> +
> +static int
> +do_sched_setscheduler2(pid_t pid, int policy,
> + struct sched_attr __user *attr_uptr)
> +{
> + struct sched_attr attr;
> + struct task_struct *p;
> + int retval;
> +
> + if (!attr_uptr || pid < 0)
> + return -EINVAL;
> +
> + if (sched_copy_attr(attr_uptr, &attr))
> + return -EFAULT;
> +
> + rcu_read_lock();
> + retval = -ESRCH;
> + p = find_process_by_pid(pid);
> + if (p != NULL)
> + retval = sched_setscheduler2(p, policy, &attr);
> + rcu_read_unlock();
> +
> + return retval;
> +}
> +
> /**
> * sys_sched_setscheduler - set/change the scheduler policy and RT priority
> * @pid: the pid in question.
> @@ -3252,6 +3357,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_
> }
>
> /**
> + * sys_sched_setscheduler2 - same as above, but with extended sched_param
> + * @pid: the pid in question.
> + * @policy: new policy (could use extended sched_param).
> + * @attr: structure containg the extended parameters.
> + */
> +SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
> + struct sched_attr __user *, attr)
> +{
> + if (policy < 0)
> + return -EINVAL;
> +
> + return do_sched_setscheduler2(pid, policy, attr);
> +}
> +
> +/**
> * sys_sched_setparam - set/change the RT priority of a thread
> * @pid: the pid in question.
> * @param: structure containing the new RT priority.
> @@ -3264,6 +3384,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, p
> }
>
> /**
> + * sys_sched_setattr - same as above, but with extended sched_attr
> + * @pid: the pid in question.
> + * @attr: structure containing the extended parameters.
> + */
> +SYSCALL_DEFINE2(sched_setattr, pid_t, pid,
> + struct sched_attr __user *, attr)
> +{
> + return do_sched_setscheduler2(pid, -1, attr);
> +}
> +
> +/**
> * sys_sched_getscheduler - get the policy (scheduling class) of a thread
> * @pid: the pid in question.
> *
> @@ -3329,6 +3460,87 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, p
> return retval;
>
> out_unlock:
> + rcu_read_unlock();
> + return retval;
> +}
> +
> +static int sched_read_attr(struct sched_attr __user *uattr,
> + struct sched_attr *attr,
> + unsigned int usize)
> +{
> + int ret;
> +
> + if (!access_ok(VERIFY_WRITE, uattr, usize))
> + return -EFAULT;
> +
> + /*
> + * If we're handed a smaller struct than we know of,
> + * ensure all the unknown bits are 0 - i.e. old
> + * user-space does not get uncomplete information.
> + */
> + if (usize < sizeof(*attr)) {
> + unsigned char *addr;
> + unsigned char *end;
> +
> + addr = (void *)attr + usize;
> + end = (void *)attr + sizeof(*attr);
> +
> + for (; addr < end; addr++) {
> + if (*addr)
> + goto err_size;
> + }
> +
> + attr->size = usize;
> + }
> +
> + ret = copy_to_user(uattr, attr, usize);
> + if (ret)
> + return -EFAULT;
> +
> +out:
> + return ret;
> +
> +err_size:
> + ret = -E2BIG;
> + goto out;
> +}
> +
> +/**
> + * sys_sched_getattr - same as above, but with extended "sched_param"
> + * @pid: the pid in question.
> + * @attr: structure containing the extended parameters.
> + * @size: sizeof(attr) for fwd/bwd comp.
> + */
> +SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> + unsigned int, size)
> +{
> + struct sched_attr attr = {
> + .size = sizeof(struct sched_attr),
> + };
> + struct task_struct *p;
> + int retval;
> +
> + if (!uattr || pid < 0 || size > PAGE_SIZE ||
> + size < SCHED_ATTR_SIZE_VER0)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + p = find_process_by_pid(pid);
> + retval = -ESRCH;
> + if (!p)
> + goto out_unlock;
> +
> + retval = security_task_getscheduler(p);
> + if (retval)
> + goto out_unlock;
> +
> + attr.sched_priority = p->rt_priority;
> + rcu_read_unlock();
> +
> + retval = sched_read_attr(uattr, &attr, size);
> + return retval;
> +
> +out_unlock:
> rcu_read_unlock();
> return retval;
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/



--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/


2014-01-21 15:39:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI

On Tue, Jan 21, 2014 at 03:36:37PM +0100, Michael Kerrisk wrote:
> Peter, Dario,

> Is someone (e.g., one of you) planning to write man pages for the new
> sched_setattr() and sched_getattr() system calls? (Also, for the
> future, please CC [email protected] on patches that change the
> API, then those of us who don't follow LKML get a heads up about
> upcoming API changes.)

first draft, shamelessly stolen from SCHED_SETSCHEDULER(2).

One note on both the original as well as the below: process is
ambiguous, the syscalls actually apply to a single thread of a process,
not the entire process.

---


NAME
sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
#include <sched.h>

struct sched_attr {
u32 size;

u32 sched_policy;
u64 sched_flags;

/* SCHED_NORMAL, SCHED_BATCH */
s32 sched_nice;

/* SCHED_FIFO, SCHED_RR */
u32 sched_priority;

/* SCHED_DEADLINE */
u64 sched_runtime;
u64 sched_deadline;
u64 sched_period;
};

int sched_setattr(pid_t pid, const struct sched_attr *attr);

int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);

DESCRIPTION
sched_setattr() sets both the scheduling policy and the
associated attributes for the process whose ID is specified in
pid. If pid equals zero, the scheduling policy and attributes
of the calling process will be set. The interpretation of the
argument attr depends on the selected policy. Currently, Linux
supports the following "normal" (i.e., non-real-time) scheduling
policies:

SCHED_OTHER the standard "fair" time-sharing policy;

SCHED_BATCH for "batch" style execution of processes; and

SCHED_IDLE for running very low priority background jobs.

The following "real-time" policies are also supported, for
special time-critical applications that need precise control
over the way in which runnable processes are selected for
execution:

SCHED_FIFO a first-in, first-out policy;

SCHED_RR a round-robin policy; and

SCHED_DEADLINE a deadline policy.

The semantics of each of these policies are detailed below.

sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel
structure, the kernel verifies all additional fields are '0' if
not the syscall will fail with -E2BIG.

sched_attr::sched_policy the desired scheduling policy.

sched_attr::sched_flags additional flags that can influence
scheduling behaviour. Currently as per Linux kernel 3.14:

SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
on fork().

is the only supported flag.

sched_attr::sched_nice should only be set for SCHED_OTHER,
SCHED_BATCH, the desired nice value [-20,19], see NICE(2).

sched_attr::sched_priority should only be set for SCHED_FIFO,
SCHED_RR, the desired static priority [1,99].

sched_attr::sched_runtime
sched_attr::sched_deadline
sched_attr::sched_period should only be set for SCHED_DEADLINE
and are the traditional sporadic task model parameters.

sched_getattr() queries the scheduling policy currently applied
to the process identified by pid. If pid equals zero, the
policy of the calling process will be retrieved.

The size argument should reflect the size of struct sched_attr
as known to userspace. The kernel fills out sched_attr::size to
the size of its sched_attr structure. If the user provided
structure is larger, additional fields are not touched. If the
user provided structure is smaller, but the kernel needs to
return values outside the provided space, the syscall will fail
with -E2BIG.

The other sched_attr fields are filled out as described in
sched_setattr().


${insert SCHED_* descriptions}

SCHED_DEADLINE: Sporadic task model deadline scheduling
SCHED_DEADLINE is an implementation of GEDF (Global Earliest
Deadline First) with additional CBS (Constant Bandwidth Server).
The CBS guarantees that tasks that over-run their specified
budget are throttled and do not affect the correct performance
of other SCHED_DEADLINE tasks.

SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN

Setting SCHED_DEADLINE can fail with -EINVAL when admission
control tests fail.

${NOTE: should we change that to -EBUSY ? }


Other than that its pretty much the same as the existing
SCHED_SETSCHEDULER(2) page.

2014-01-21 15:46:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI

On Tue, Jan 21, 2014 at 04:38:51PM +0100, Peter Zijlstra wrote:
> SCHED_DEADLINE: Sporadic task model deadline scheduling
> SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> Deadline First) with additional CBS (Constant Bandwidth Server).

We might want to re-word that to:

SCHED_DEADLINE currently is an implementation of GEDF, however
any policy that correctly schedules the sporadic task model is
a valid implementation.

To make sure we should not rely on the actual implementation; there's
many possible algorithms to schedule the sporadic task model.

> The CBS guarantees that tasks that over-run their specified
> budget are throttled and do not affect the correct performance
> of other SCHED_DEADLINE tasks.
>
> SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
>
> Setting SCHED_DEADLINE can fail with -EINVAL when admission
> control tests fail.
>

2014-01-21 16:03:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI

On Tue, 21 Jan 2014 16:46:03 +0100
Peter Zijlstra <[email protected]> wrote:

> On Tue, Jan 21, 2014 at 04:38:51PM +0100, Peter Zijlstra wrote:
> > SCHED_DEADLINE: Sporadic task model deadline scheduling
> > SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > Deadline First) with additional CBS (Constant Bandwidth Server).
>
> We might want to re-word that to:
>
> SCHED_DEADLINE currently is an implementation of GEDF, however
> any policy that correctly schedules the sporadic task model is
> a valid implementation.
>
> To make sure we should not rely on the actual implementation; there's
> many possible algorithms to schedule the sporadic task model.

Probably should post some links to GEDF documentation too?

-- Steve

>
> > The CBS guarantees that tasks that over-run their specified
> > budget are throttled and do not affect the correct performance
> > of other SCHED_DEADLINE tasks.
> >
> > SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> >
> > Setting SCHED_DEADLINE can fail with -EINVAL when admission
> > control tests fail.
> >

2014-01-21 16:07:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI

On Tue, Jan 21, 2014 at 11:02:55AM -0500, Steven Rostedt wrote:
> On Tue, 21 Jan 2014 16:46:03 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Jan 21, 2014 at 04:38:51PM +0100, Peter Zijlstra wrote:
> > > SCHED_DEADLINE: Sporadic task model deadline scheduling
> > > SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > > Deadline First) with additional CBS (Constant Bandwidth Server).
> >
> > We might want to re-word that to:
> >
> > SCHED_DEADLINE currently is an implementation of GEDF, however
> > any policy that correctly schedules the sporadic task model is
> > a valid implementation.
> >
> > To make sure we should not rely on the actual implementation; there's
> > many possible algorithms to schedule the sporadic task model.
>
> Probably should post some links to GEDF documentation too?

At best I think we can do something like:

SEE ALSO
Documentation/scheduler/sched_deadline.txt in the Linux kernel
source tree (since kernel 3.14).

Possibly also an ISBN for a good scheduling theory book (if there exists
such a thing), but I would have to rely on others to provide such as my
shelfs are devoid of such material.

2014-01-21 16:46:20

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI

On 01/21/2014 05:06 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 11:02:55AM -0500, Steven Rostedt wrote:
>> On Tue, 21 Jan 2014 16:46:03 +0100
>> Peter Zijlstra <[email protected]> wrote:
>>
>>> On Tue, Jan 21, 2014 at 04:38:51PM +0100, Peter Zijlstra wrote:
>>>> SCHED_DEADLINE: Sporadic task model deadline scheduling
>>>> SCHED_DEADLINE is an implementation of GEDF (Global Earliest
>>>> Deadline First) with additional CBS (Constant Bandwidth Server).
>>>
>>> We might want to re-word that to:
>>>
>>> SCHED_DEADLINE currently is an implementation of GEDF, however
>>> any policy that correctly schedules the sporadic task model is
>>> a valid implementation.
>>>
>>> To make sure we should not rely on the actual implementation; there's
>>> many possible algorithms to schedule the sporadic task model.
>>
>> Probably should post some links to GEDF documentation too?
>
> At best I think we can do something like:
>
> SEE ALSO
> Documentation/scheduler/sched_deadline.txt in the Linux kernel
> source tree (since kernel 3.14).
>
> Possibly also an ISBN for a good scheduling theory book (if there exists
> such a thing), but I would have to rely on others to provide such as my
> shelfs are devoid of such material.
>

Well, picking just one is not that easy, I'd say (among many others):

- Handbook of Scheduling: Algorithms, Models, and Performance Analysis
by Joseph Y-T. Leung, James H. Anderson - ISBN-10: 1584883979
(especially cap. 30);
- Hard Real-Time Computing Systems by Giorgio C. Buttazzo
ISBN 978-1-4614-0675-4 (even if it is more about UP);
- A survey of hard real-time scheduling for multiprocessor systems
by RI Davis, A Burns - ACM Computing Surveys (CSUR), 2011
(available at http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf);

Probably last one is better (as is freely downloadable). We should add
something in the documentation too.

Thanks,

- Juri

Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI

Peter, Dario,

This is a little late in the day, but I think it's an important point
to just check before this API goes final.

> SYNOPSIS
> #include <sched.h>
>
> struct sched_attr {
> u32 size;
>
> u32 sched_policy;
> u64 sched_flags;
[...]
> };
>
> int sched_setattr(pid_t pid, const struct sched_attr *attr);
>
> int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);

So, I that there's a flags field in the structure, which allows for
some extensibility for these calls in the future. However, is it
worthwhile to consider adding a 'flags' argument in the base signature
of either of these calls, to allow for some possible extensions in the
future. (See http://lwn.net/SubscriberLink/585415/7b905c0248a158a2/ ).

Cheers,

Michael

2014-02-14 16:20:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI

On Fri, Feb 14, 2014 at 03:13:22PM +0100, Michael Kerrisk (man-pages) wrote:
> Peter, Dario,
>
> This is a little late in the day, but I think it's an important point
> to just check before this API goes final.
>
> > SYNOPSIS
> > #include <sched.h>
> >
> > struct sched_attr {
> > u32 size;
> >
> > u32 sched_policy;
> > u64 sched_flags;
> [...]
> > };
> >
> > int sched_setattr(pid_t pid, const struct sched_attr *attr);
> >
> > int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);
>
> So, I that there's a flags field in the structure, which allows for
> some extensibility for these calls in the future. However, is it
> worthwhile to consider adding a 'flags' argument in the base signature
> of either of these calls, to allow for some possible extensions in the
> future. (See http://lwn.net/SubscriberLink/585415/7b905c0248a158a2/ ).

Sure why not.. I picked 'unsigned long' for the flags argument; I don't
think there's a real standard for this, I've seen: 'int' 'unsigned int'
and 'unsigned long' flags.

Please holler if there is indeed a preference and I picked the wrong
one.

BTW; do you need more text on the manpage thingy I send you or was that
sufficient?

---
Subject: sched: Add 'flags' argument to sched_{set,get}attr() syscalls

Because of a recent syscall design debate; its deemed appropriate for
each syscall to have a flags argument for future extension; without
immediately requiring new syscalls.

Suggested-by: Michael Kerrisk <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/syscalls.h | 6 ++++--
kernel/sched/core.c | 11 ++++++-----
2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40ed9e9a77e5..bf41aeb09078 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -281,13 +281,15 @@ asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
asmlinkage long sys_sched_setparam(pid_t pid,
struct sched_param __user *param);
asmlinkage long sys_sched_setattr(pid_t pid,
- struct sched_attr __user *attr);
+ struct sched_attr __user *attr,
+ unsigned long flags);
asmlinkage long sys_sched_getscheduler(pid_t pid);
asmlinkage long sys_sched_getparam(pid_t pid,
struct sched_param __user *param);
asmlinkage long sys_sched_getattr(pid_t pid,
struct sched_attr __user *attr,
- unsigned int size);
+ unsigned int size,
+ unsigned long flags);
asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
unsigned long __user *user_mask_ptr);
asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9764fbc537..deeaa54fdf92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3631,13 +3631,14 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
* @pid: the pid in question.
* @uattr: structure containing the extended parameters.
*/
-SYSCALL_DEFINE2(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr)
+SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
+ unsigned long, flags)
{
struct sched_attr attr;
struct task_struct *p;
int retval;

- if (!uattr || pid < 0)
+ if (!uattr || pid < 0 || flags)
return -EINVAL;

if (sched_copy_attr(uattr, &attr))
@@ -3774,8 +3775,8 @@ static int sched_read_attr(struct sched_attr __user *uattr,
* @uattr: structure containing the extended parameters.
* @size: sizeof(attr) for fwd/bwd comp.
*/
-SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
- unsigned int, size)
+SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
+ unsigned int, size, unsigned long, flags)
{
struct sched_attr attr = {
.size = sizeof(struct sched_attr),
@@ -3784,7 +3785,7 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
int retval;

if (!uattr || pid < 0 || size > PAGE_SIZE ||
- size < SCHED_ATTR_SIZE_VER0)
+ size < SCHED_ATTR_SIZE_VER0 || flags)
return -EINVAL;

rcu_read_lock();

2014-02-15 12:52:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI


* Peter Zijlstra <[email protected]> wrote:

> > > SYNOPSIS
> > > #include <sched.h>
> > >
> > > struct sched_attr {
> > > u32 size;
> > >
> > > u32 sched_policy;
> > > u64 sched_flags;
> > [...]
> > > };
> > >
> > > int sched_setattr(pid_t pid, const struct sched_attr *attr);
> > >
> > > int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);
> >
> > So, I that there's a flags field in the structure, which allows for
> > some extensibility for these calls in the future. However, is it
> > worthwhile to consider adding a 'flags' argument in the base signature
> > of either of these calls, to allow for some possible extensions in the
> > future. (See http://lwn.net/SubscriberLink/585415/7b905c0248a158a2/ ).
>
> Sure why not.. I picked 'unsigned long' for the flags argument; I
> don't think there's a real standard for this, I've seen: 'int'
> 'unsigned int' and 'unsigned long' flags.
>
> Please holler if there is indeed a preference and I picked the wrong
> one.

Yo!

So, since this is an ABI, if it's a true 64-bit flags value then
please make it u64 - and 'unsigned int' or u32 otherwise. I don't
think we have many (any?) 'long' argument syscall ABIs.

'unsigned long' is generally a bad choice because it's u32 on 32-bit
platforms and u64 on 64-bit platforms.

Now, for syscall argument ABIs it's not a lethal mistake to make (as
compared to say ABI data structures), because syscall arguments have
their own types and width anyway, so any definition mistake can
usually be fixed after the fact.

Thanks,

Ingo

Subject: Re: [PATCH 01/13] sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI

Hello Peter,

On 02/14/2014 05:19 PM, Peter Zijlstra wrote:
> On Fri, Feb 14, 2014 at 03:13:22PM +0100, Michael Kerrisk (man-pages) wrote:
>> Peter, Dario,
>>
>> This is a little late in the day, but I think it's an important point
>> to just check before this API goes final.
>>
>>> SYNOPSIS
>>> #include <sched.h>
>>>
>>> struct sched_attr {
>>> u32 size;
>>>
>>> u32 sched_policy;
>>> u64 sched_flags;
>> [...]
>>> };
>>>
>>> int sched_setattr(pid_t pid, const struct sched_attr *attr);
>>>
>>> int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size);
>>
>> So, I that there's a flags field in the structure, which allows for
>> some extensibility for these calls in the future. However, is it
>> worthwhile to consider adding a 'flags' argument in the base signature
>> of either of these calls, to allow for some possible extensions in the
>> future. (See http://lwn.net/SubscriberLink/585415/7b905c0248a158a2/ ).
>
> Sure why not..

Well, it doesn't need to be added gratuitously -- just if you think there's
some nonzero chance it might prove useful in the future ;-).

> I picked 'unsigned long' for the flags argument; I don't
> think there's a real standard for this, I've seen: 'int' 'unsigned int'
> and 'unsigned long' flags.
>
> Please holler if there is indeed a preference and I picked the wrong
> one.
>
> BTW; do you need more text on the manpage thingy I send you or was that
> sufficient?

If your could take another pass though your existing text, to incorporate the
new flags stuff, and then send a page to me + linux-man@
that would be great.

Cheers,

Michael


> ---
> Subject: sched: Add 'flags' argument to sched_{set,get}attr() syscalls
>
> Because of a recent syscall design debate; its deemed appropriate for
> each syscall to have a flags argument for future extension; without
> immediately requiring new syscalls.
>
> Suggested-by: Michael Kerrisk <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> include/linux/syscalls.h | 6 ++++--
> kernel/sched/core.c | 11 ++++++-----
> 2 files changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 40ed9e9a77e5..bf41aeb09078 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -281,13 +281,15 @@ asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
> asmlinkage long sys_sched_setparam(pid_t pid,
> struct sched_param __user *param);
> asmlinkage long sys_sched_setattr(pid_t pid,
> - struct sched_attr __user *attr);
> + struct sched_attr __user *attr,
> + unsigned long flags);
> asmlinkage long sys_sched_getscheduler(pid_t pid);
> asmlinkage long sys_sched_getparam(pid_t pid,
> struct sched_param __user *param);
> asmlinkage long sys_sched_getattr(pid_t pid,
> struct sched_attr __user *attr,
> - unsigned int size);
> + unsigned int size,
> + unsigned long flags);
> asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
> unsigned long __user *user_mask_ptr);
> asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fb9764fbc537..deeaa54fdf92 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3631,13 +3631,14 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
> * @pid: the pid in question.
> * @uattr: structure containing the extended parameters.
> */
> -SYSCALL_DEFINE2(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr)
> +SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> + unsigned long, flags)
> {
> struct sched_attr attr;
> struct task_struct *p;
> int retval;
>
> - if (!uattr || pid < 0)
> + if (!uattr || pid < 0 || flags)
> return -EINVAL;
>
> if (sched_copy_attr(uattr, &attr))
> @@ -3774,8 +3775,8 @@ static int sched_read_attr(struct sched_attr __user *uattr,
> * @uattr: structure containing the extended parameters.
> * @size: sizeof(attr) for fwd/bwd comp.
> */
> -SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> - unsigned int, size)
> +SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> + unsigned int, size, unsigned long, flags)
> {
> struct sched_attr attr = {
> .size = sizeof(struct sched_attr),
> @@ -3784,7 +3785,7 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> int retval;
>
> if (!uattr || pid < 0 || size > PAGE_SIZE ||
> - size < SCHED_ATTR_SIZE_VER0)
> + size < SCHED_ATTR_SIZE_VER0 || flags)
> return -EINVAL;
>
> rcu_read_lock();
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: [tip:sched/urgent] sched: Add 'flags' argument to sched_{set, get}attr() syscalls

Commit-ID: 6d35ab48090b10c5ea5604ed5d6e91f302dc6060
Gitweb: http://git.kernel.org/tip/6d35ab48090b10c5ea5604ed5d6e91f302dc6060
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 14 Feb 2014 17:19:29 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 21 Feb 2014 21:27:10 +0100

sched: Add 'flags' argument to sched_{set,get}attr() syscalls

Because of a recent syscall design debate; its deemed appropriate for
each syscall to have a flags argument for future extension; without
immediately requiring new syscalls.

Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Suggested-by: Michael Kerrisk <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/syscalls.h | 6 ++++--
kernel/sched/core.c | 11 ++++++-----
2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40ed9e9..a747a77 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -281,13 +281,15 @@ asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
asmlinkage long sys_sched_setparam(pid_t pid,
struct sched_param __user *param);
asmlinkage long sys_sched_setattr(pid_t pid,
- struct sched_attr __user *attr);
+ struct sched_attr __user *attr,
+ unsigned int flags);
asmlinkage long sys_sched_getscheduler(pid_t pid);
asmlinkage long sys_sched_getparam(pid_t pid,
struct sched_param __user *param);
asmlinkage long sys_sched_getattr(pid_t pid,
struct sched_attr __user *attr,
- unsigned int size);
+ unsigned int size,
+ unsigned int flags);
asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
unsigned long __user *user_mask_ptr);
asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a6e7470..6edbef2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3661,13 +3661,14 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
* @pid: the pid in question.
* @uattr: structure containing the extended parameters.
*/
-SYSCALL_DEFINE2(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr)
+SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
+ unsigned int, flags)
{
struct sched_attr attr;
struct task_struct *p;
int retval;

- if (!uattr || pid < 0)
+ if (!uattr || pid < 0 || flags)
return -EINVAL;

if (sched_copy_attr(uattr, &attr))
@@ -3804,8 +3805,8 @@ err_size:
* @uattr: structure containing the extended parameters.
* @size: sizeof(attr) for fwd/bwd comp.
*/
-SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
- unsigned int, size)
+SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
+ unsigned int, size, unsigned int, flags)
{
struct sched_attr attr = {
.size = sizeof(struct sched_attr),
@@ -3814,7 +3815,7 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
int retval;

if (!uattr || pid < 0 || size > PAGE_SIZE ||
- size < SCHED_ATTR_SIZE_VER0)
+ size < SCHED_ATTR_SIZE_VER0 || flags)
return -EINVAL;

rcu_read_lock();

2014-04-09 09:25:45

by Peter Zijlstra

[permalink] [raw]
Subject: sched_{set,get}attr() manpage

On Mon, Feb 17, 2014 at 02:20:29PM +0100, Michael Kerrisk (man-pages) wrote:
> If your could take another pass though your existing text, to incorporate the
> new flags stuff, and then send a page to me + linux-man@
> that would be great.


Sorry, this slipped my mind. An updated version below. Heavy borrowing
from SCHED_SETSCHEDULER(2) as before.

---

NAME
sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
#include <sched.h>

struct sched_attr {
u32 size;
u32 sched_policy;
u64 sched_flags;

/* SCHED_NORMAL, SCHED_BATCH */
s32 sched_nice;
/* SCHED_FIFO, SCHED_RR */
u32 sched_priority;
/* SCHED_DEADLINE */
u64 sched_runtime;
u64 sched_deadline;
u64 sched_period;
};
int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);

int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);

DESCRIPTION
sched_setattr() sets both the scheduling policy and the
associated attributes for the process whose ID is specified in
pid. If pid equals zero, the scheduling policy and attributes
of the calling process will be set. The interpretation of the
argument attr depends on the selected policy. Currently, Linux
supports the following "normal" (i.e., non-real-time) scheduling
policies:

SCHED_OTHER the standard "fair" time-sharing policy;

SCHED_BATCH for "batch" style execution of processes; and

SCHED_IDLE for running very low priority background jobs.

The following "real-time" policies are also supported, for
special time-critical applications that need precise control
over the way in which runnable processes are selected for
execution:

SCHED_FIFO a first-in, first-out policy;

SCHED_RR a round-robin policy; and

SCHED_DEADLINE a deadline policy.

The semantics of each of these policies are detailed below.

sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel
structure, the kernel verifies all additional fields are '0' if
not the syscall will fail with -E2BIG.

sched_attr::sched_policy the desired scheduling policy.

sched_attr::sched_flags additional flags that can influence
scheduling behaviour. Currently as per Linux kernel 3.14:

SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
on fork().

is the only supported flag.

sched_attr::sched_nice should only be set for SCHED_OTHER,
SCHED_BATCH, the desired nice value [-20,19], see NICE(2).

sched_attr::sched_priority should only be set for SCHED_FIFO,
SCHED_RR, the desired static priority [1,99].

sched_attr::sched_runtime
sched_attr::sched_deadline
sched_attr::sched_period should only be set for SCHED_DEADLINE
and are the traditional sporadic task model parameters.

The flags argument should be 0.

sched_getattr() queries the scheduling policy currently applied
to the process identified by pid. If pid equals zero, the
policy of the calling process will be retrieved.

The size argument should reflect the size of struct sched_attr
as known to userspace. The kernel fills out sched_attr::size to
the size of its sched_attr structure. If the user provided
structure is larger, additional fields are not touched. If the
user provided structure is smaller, but the kernel needs to
return values outside the provided space, the syscall will fail
with -E2BIG.

The flags argument should be 0.

The other sched_attr fields are filled out as described in
sched_setattr().

Scheduling Policies
The scheduler is the kernel component that decides which runnable
process will be executed by the CPU next. Each process has an associ‐
ated scheduling policy and a static scheduling priority, sched_prior‐
ity; these are the settings that are modified by sched_setscheduler().
The scheduler makes it decisions based on knowledge of the scheduling
policy and static priority of all processes on the system.

For processes scheduled under one of the normal scheduling policies
(SCHED_OTHER, SCHED_IDLE, SCHED_BATCH), sched_priority is not used in
scheduling decisions (it must be specified as 0).

Processes scheduled under one of the real-time policies (SCHED_FIFO,
SCHED_RR) have a sched_priority value in the range 1 (low) to 99
(high). (As the numbers imply, real-time processes always have higher
priority than normal processes.) Note well: POSIX.1-2001 only requires
an implementation to support a minimum 32 distinct priority levels for
the real-time policies, and some systems supply just this minimum.
Portable programs should use sched_get_priority_min(2) and
sched_get_priority_max(2) to find the range of priorities supported for
a particular policy.

Conceptually, the scheduler maintains a list of runnable processes for
each possible sched_priority value. In order to determine which
process runs next, the scheduler looks for the nonempty list with the
highest static priority and selects the process at the head of this
list.

A process's scheduling policy determines where it will be inserted into
the list of processes with equal static priority and how it will move
inside this list.

All scheduling is preemptive: if a process with a higher static prior‐
ity becomes ready to run, the currently running process will be pre‐
empted and returned to the wait list for its static priority level.
The scheduling policy only determines the ordering within the list of
runnable processes with equal static priority.

SCHED_DEADLINE: Sporadic task model deadline scheduling
SCHED_DEADLINE is an implementation of GEDF (Global Earliest
Deadline First) with additional CBS (Constant Bandwidth Server).
The CBS guarantees that tasks that over-run their specified
budget are throttled and do not affect the correct performance
of other SCHED_DEADLINE tasks.

SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN

Setting SCHED_DEADLINE can fail with -EINVAL when admission
control tests fail.

SCHED_FIFO: First In-First Out scheduling
SCHED_FIFO can only be used with static priorities higher than 0, which
means that when a SCHED_FIFO processes becomes runnable, it will always
immediately preempt any currently running SCHED_OTHER, SCHED_BATCH, or
SCHED_IDLE process. SCHED_FIFO is a simple scheduling algorithm with‐
out time slicing. For processes scheduled under the SCHED_FIFO policy,
the following rules apply:

* A SCHED_FIFO process that has been preempted by another process of
higher priority will stay at the head of the list for its priority
and will resume execution as soon as all processes of higher prior‐
ity are blocked again.

* When a SCHED_FIFO process becomes runnable, it will be inserted at
the end of the list for its priority.

* A call to sched_setscheduler() or sched_setparam(2) will put the
SCHED_FIFO (or SCHED_RR) process identified by pid at the start of
the list if it was runnable. As a consequence, it may preempt the
currently running process if it has the same priority.
(POSIX.1-2001 specifies that the process should go to the end of the
list.)

* A process calling sched_yield(2) will be put at the end of the list.

No other events will move a process scheduled under the SCHED_FIFO pol‐
icy in the wait list of runnable processes with equal static priority.

A SCHED_FIFO process runs until either it is blocked by an I/O request,
it is preempted by a higher priority process, or it calls
sched_yield(2).

SCHED_RR: Round Robin scheduling
SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described
above for SCHED_FIFO also applies to SCHED_RR, except that each process
is only allowed to run for a maximum time quantum. If a SCHED_RR
process has been running for a time period equal to or longer than the
time quantum, it will be put at the end of the list for its priority.
A SCHED_RR process that has been preempted by a higher priority process
and subsequently resumes execution as a running process will complete
the unexpired portion of its round robin time quantum. The length of
the time quantum can be retrieved using sched_rr_get_interval(2).

SCHED_OTHER: Default Linux time-sharing scheduling
SCHED_OTHER can only be used at static priority 0. SCHED_OTHER is the
standard Linux time-sharing scheduler that is intended for all pro‐
cesses that do not require the special real-time mechanisms. The
process to run is chosen from the static priority 0 list based on a
dynamic priority that is determined only inside this list. The dynamic
priority is based on the nice value (set by nice(2) or setpriority(2))
and increased for each time quantum the process is ready to run, but
denied to run by the scheduler. This ensures fair progress among all
SCHED_OTHER processes.

SCHED_BATCH: Scheduling batch processes
(Since Linux 2.6.16.) SCHED_BATCH can only be used at static priority
0. This policy is similar to SCHED_OTHER in that it schedules the
process according to its dynamic priority (based on the nice value).
The difference is that this policy will cause the scheduler to always
assume that the process is CPU-intensive. Consequently, the scheduler
will apply a small scheduling penalty with respect to wakeup behaviour,
so that this process is mildly disfavored in scheduling decisions.

This policy is useful for workloads that are noninteractive, but do not
want to lower their nice value, and for workloads that want a determin‐
istic scheduling policy without interactivity causing extra preemptions
(between the workload's tasks).

SCHED_IDLE: Scheduling very low priority jobs
(Since Linux 2.6.23.) SCHED_IDLE can only be used at static priority
0; the process nice value has no influence for this policy.

This policy is intended for running jobs at extremely low priority
(lower even than a +19 nice value with the SCHED_OTHER or SCHED_BATCH
policies).

RETURN VALUE
On success, sched_setattr() and sched_getattr() return 0. On
error, -1 is returned, and errno is set appropriately.

ERRORS
EINVAL The scheduling policy is not one of the recognized policies,
param is NULL, or param does not make sense for the policy.

EPERM The calling process does not have appropriate privileges.

ESRCH The process whose ID is pid could not be found.

E2BIG The provided storage for struct sched_attr is either too
big, see sched_setattr(), or too small, see sched_getattr().

NOTES
While the text above (and in SCHED_SETSCHEDULER(2)) talks about
processes, in actual fact these system calls are thread specific.

2014-04-09 15:22:46

by Henrik Austad

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

On Wed, Apr 09, 2014 at 11:25:10AM +0200, Peter Zijlstra wrote:
> On Mon, Feb 17, 2014 at 02:20:29PM +0100, Michael Kerrisk (man-pages) wrote:
> > If your could take another pass though your existing text, to incorporate the
> > new flags stuff, and then send a page to me + linux-man@
> > that would be great.
>
>
> Sorry, this slipped my mind. An updated version below. Heavy borrowing
> from SCHED_SETSCHEDULER(2) as before.
>
> ---
>
> NAME
> sched_setattr, sched_getattr - set and get scheduling policy/attributes
>
> SYNOPSIS
> #include <sched.h>
>
> struct sched_attr {
> u32 size;
> u32 sched_policy;
> u64 sched_flags;
>
> /* SCHED_NORMAL, SCHED_BATCH */
> s32 sched_nice;
> /* SCHED_FIFO, SCHED_RR */
> u32 sched_priority;
> /* SCHED_DEADLINE */
> u64 sched_runtime;
> u64 sched_deadline;
> u64 sched_period;
> };
> int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
>
> int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
>
> DESCRIPTION
> sched_setattr() sets both the scheduling policy and the
> associated attributes for the process whose ID is specified in
> pid. If pid equals zero, the scheduling policy and attributes
> of the calling process will be set. The interpretation of the
> argument attr depends on the selected policy. Currently, Linux
> supports the following "normal" (i.e., non-real-time) scheduling
> policies:
>
> SCHED_OTHER the standard "fair" time-sharing policy;
>
> SCHED_BATCH for "batch" style execution of processes; and
>
> SCHED_IDLE for running very low priority background jobs.
>
> The following "real-time" policies are also supported, for

why the "'s?

> special time-critical applications that need precise control
> over the way in which runnable processes are selected for
> execution:
>
> SCHED_FIFO a first-in, first-out policy;
>
> SCHED_RR a round-robin policy; and
>
> SCHED_DEADLINE a deadline policy.
>
> The semantics of each of these policies are detailed below.
>
> sched_attr::size must be set to the size of the structure, as in
> sizeof(struct sched_attr), if the provided structure is smaller
> than the kernel structure, any additional fields are assumed
> '0'. If the provided structure is larger than the kernel
> structure, the kernel verifies all additional fields are '0' if
> not the syscall will fail with -E2BIG.
>
> sched_attr::sched_policy the desired scheduling policy.
>
> sched_attr::sched_flags additional flags that can influence
> scheduling behaviour. Currently as per Linux kernel 3.14:
>
> SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> on fork().
>
> is the only supported flag.
>
> sched_attr::sched_nice should only be set for SCHED_OTHER,
> SCHED_BATCH, the desired nice value [-20,19], see NICE(2).
>
> sched_attr::sched_priority should only be set for SCHED_FIFO,
> SCHED_RR, the desired static priority [1,99].
>
> sched_attr::sched_runtime
> sched_attr::sched_deadline
> sched_attr::sched_period should only be set for SCHED_DEADLINE
> and are the traditional sporadic task model parameters.
>
> The flags argument should be 0.
>
> sched_getattr() queries the scheduling policy currently applied
> to the process identified by pid. If pid equals zero, the
> policy of the calling process will be retrieved.
>
> The size argument should reflect the size of struct sched_attr
> as known to userspace. The kernel fills out sched_attr::size to
> the size of its sched_attr structure. If the user provided
> structure is larger, additional fields are not touched. If the
> user provided structure is smaller, but the kernel needs to
> return values outside the provided space, the syscall will fail
> with -E2BIG.
>
> The flags argument should be 0.

What about SCHED_FLAG_RESET_ON_FOR?

> The other sched_attr fields are filled out as described in
> sched_setattr().
>
> Scheduling Policies
> The scheduler is the kernel component that decides which runnable
> process will be executed by the CPU next. Each process has an associ‐
> ated scheduling policy and a static scheduling priority, sched_prior‐
> ity; these are the settings that are modified by sched_setscheduler().
> The scheduler makes it decisions based on knowledge of the scheduling
> policy and static priority of all processes on the system.

Isn't this last sentence redundant/sliglhtly repetitive?

> For processes scheduled under one of the normal scheduling policies
> (SCHED_OTHER, SCHED_IDLE, SCHED_BATCH), sched_priority is not used in
> scheduling decisions (it must be specified as 0).
>
> Processes scheduled under one of the real-time policies (SCHED_FIFO,
> SCHED_RR) have a sched_priority value in the range 1 (low) to 99
> (high). (As the numbers imply, real-time processes always have higher
> priority than normal processes.) Note well: POSIX.1-2001 only requires
> an implementation to support a minimum 32 distinct priority levels for
> the real-time policies, and some systems supply just this minimum.
> Portable programs should use sched_get_priority_min(2) and
> sched_get_priority_max(2) to find the range of priorities supported for
> a particular policy.
>
> Conceptually, the scheduler maintains a list of runnable processes for
> each possible sched_priority value. In order to determine which
> process runs next, the scheduler looks for the nonempty list with the
> highest static priority and selects the process at the head of this
> list.
>
> A process's scheduling policy determines where it will be inserted into
> the list of processes with equal static priority and how it will move
> inside this list.
>
> All scheduling is preemptive: if a process with a higher static prior‐
> ity becomes ready to run, the currently running process will be pre‐
> empted and returned to the wait list for its static priority level.
> The scheduling policy only determines the ordering within the list of
> runnable processes with equal static priority.
>
> SCHED_DEADLINE: Sporadic task model deadline scheduling
> SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> Deadline First) with additional CBS (Constant Bandwidth Server).
> The CBS guarantees that tasks that over-run their specified
> budget are throttled and do not affect the correct performance
> of other SCHED_DEADLINE tasks.
>
> SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
>
> Setting SCHED_DEADLINE can fail with -EINVAL when admission
> control tests fail.

Perhaps add a note about the deadline-class having higher priority than the
other classes; i.e. if a deadline-task is runnable, it will preempt any
other SCHED_(RR|FIFO) regardless of priority?

> SCHED_FIFO: First In-First Out scheduling
> SCHED_FIFO can only be used with static priorities higher than 0, which
> means that when a SCHED_FIFO processes becomes runnable, it will always
> immediately preempt any currently running SCHED_OTHER, SCHED_BATCH, or
> SCHED_IDLE process. SCHED_FIFO is a simple scheduling algorithm with‐
> out time slicing. For processes scheduled under the SCHED_FIFO policy,
> the following rules apply:
>
> * A SCHED_FIFO process that has been preempted by another process of
> higher priority will stay at the head of the list for its priority
> and will resume execution as soon as all processes of higher prior‐
> ity are blocked again.
>
> * When a SCHED_FIFO process becomes runnable, it will be inserted at
> the end of the list for its priority.
>
> * A call to sched_setscheduler() or sched_setparam(2) will put the
> SCHED_FIFO (or SCHED_RR) process identified by pid at the start of
> the list if it was runnable. As a consequence, it may preempt the
> currently running process if it has the same priority.
> (POSIX.1-2001 specifies that the process should go to the end of the
> list.)
>
> * A process calling sched_yield(2) will be put at the end of the list.

How about the recent discussion regarding sched_yield(). Is this correct?

lkml.kernel.org/r/[email protected]

Is this the correct place to add a note explaining te potentional pitfalls
using sched_yield?

> No other events will move a process scheduled under the SCHED_FIFO pol‐
> icy in the wait list of runnable processes with equal static priority.
>
> A SCHED_FIFO process runs until either it is blocked by an I/O request,
> it is preempted by a higher priority process, or it calls
> sched_yield(2).
>
> SCHED_RR: Round Robin scheduling
> SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described
> above for SCHED_FIFO also applies to SCHED_RR, except that each process
> is only allowed to run for a maximum time quantum. If a SCHED_RR
> process has been running for a time period equal to or longer than the
> time quantum, it will be put at the end of the list for its priority.
> A SCHED_RR process that has been preempted by a higher priority process
> and subsequently resumes execution as a running process will complete
> the unexpired portion of its round robin time quantum. The length of
> the time quantum can be retrieved using sched_rr_get_interval(2).

-> Default is 0.1HZ ms

This is a question I get form time to time, having this in the manpage
would be helpful.

> SCHED_OTHER: Default Linux time-sharing scheduling
> SCHED_OTHER can only be used at static priority 0. SCHED_OTHER is the
> standard Linux time-sharing scheduler that is intended for all pro‐
> cesses that do not require the special real-time mechanisms. The
> process to run is chosen from the static priority 0 list based on a
> dynamic priority that is determined only inside this list. The dynamic
> priority is based on the nice value (set by nice(2) or setpriority(2))
> and increased for each time quantum the process is ready to run, but
> denied to run by the scheduler. This ensures fair progress among all
> SCHED_OTHER processes.
>
> SCHED_BATCH: Scheduling batch processes
> (Since Linux 2.6.16.) SCHED_BATCH can only be used at static priority
> 0. This policy is similar to SCHED_OTHER in that it schedules the
> process according to its dynamic priority (based on the nice value).
> The difference is that this policy will cause the scheduler to always
> assume that the process is CPU-intensive. Consequently, the scheduler
> will apply a small scheduling penalty with respect to wakeup behaviour,
> so that this process is mildly disfavored in scheduling decisions.
>
> This policy is useful for workloads that are noninteractive, but do not
> want to lower their nice value, and for workloads that want a determin‐
> istic scheduling policy without interactivity causing extra preemptions
> (between the workload's tasks).
>
> SCHED_IDLE: Scheduling very low priority jobs
> (Since Linux 2.6.23.) SCHED_IDLE can only be used at static priority
> 0; the process nice value has no influence for this policy.
>
> This policy is intended for running jobs at extremely low priority
> (lower even than a +19 nice value with the SCHED_OTHER or SCHED_BATCH
> policies).
>
> RETURN VALUE
> On success, sched_setattr() and sched_getattr() return 0. On
> error, -1 is returned, and errno is set appropriately.
>
> ERRORS
> EINVAL The scheduling policy is not one of the recognized policies,
> param is NULL, or param does not make sense for the policy.
>
> EPERM The calling process does not have appropriate privileges.
>
> ESRCH The process whose ID is pid could not be found.
>
> E2BIG The provided storage for struct sched_attr is either too
> big, see sched_setattr(), or too small, see sched_getattr().

Where's the EBUSY? It can throw this from __sched_setscheduler() when it
checks if there's enough bandwidth to run the task.

>
> NOTES
> While the text above (and in SCHED_SETSCHEDULER(2)) talks about
> processes, in actual fact these system calls are thread specific.


--
Henrik Austad

2014-04-09 15:42:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

On Wed, Apr 09, 2014 at 05:19:11PM +0200, Henrik Austad wrote:
> > The following "real-time" policies are also supported, for
>
> why the "'s?

I borrowed those from SCHED_SETSCHEDULER(2).

> > sched_attr::sched_flags additional flags that can influence
> > scheduling behaviour. Currently as per Linux kernel 3.14:
> >
> > SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> > to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> > on fork().
> >
> > is the only supported flag.

...

> > The flags argument should be 0.
>
> What about SCHED_FLAG_RESET_ON_FOR?

Different flags. The one is sched_attr::flags the other is
sched_setattr(.flags).

> > The other sched_attr fields are filled out as described in
> > sched_setattr().
> >
> > Scheduling Policies
> > The scheduler is the kernel component that decides which runnable
> > process will be executed by the CPU next. Each process has an associ‐
> > ated scheduling policy and a static scheduling priority, sched_prior‐
> > ity; these are the settings that are modified by sched_setscheduler().
> > The scheduler makes it decisions based on knowledge of the scheduling
> > policy and static priority of all processes on the system.
>
> Isn't this last sentence redundant/sliglhtly repetitive?

I borrowed that from SCHED_SETSCHEDULER(2) again.

> > SCHED_DEADLINE: Sporadic task model deadline scheduling
> > SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > Deadline First) with additional CBS (Constant Bandwidth Server).
> > The CBS guarantees that tasks that over-run their specified
> > budget are throttled and do not affect the correct performance
> > of other SCHED_DEADLINE tasks.
> >
> > SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> >
> > Setting SCHED_DEADLINE can fail with -EINVAL when admission
> > control tests fail.
>
> Perhaps add a note about the deadline-class having higher priority than the
> other classes; i.e. if a deadline-task is runnable, it will preempt any
> other SCHED_(RR|FIFO) regardless of priority?

Yes, good point, will do.

> > SCHED_FIFO: First In-First Out scheduling
> > SCHED_FIFO can only be used with static priorities higher than 0, which
> > means that when a SCHED_FIFO processes becomes runnable, it will always
> > immediately preempt any currently running SCHED_OTHER, SCHED_BATCH, or
> > SCHED_IDLE process. SCHED_FIFO is a simple scheduling algorithm with‐
> > out time slicing. For processes scheduled under the SCHED_FIFO policy,
> > the following rules apply:
> >
> > * A SCHED_FIFO process that has been preempted by another process of
> > higher priority will stay at the head of the list for its priority
> > and will resume execution as soon as all processes of higher prior‐
> > ity are blocked again.
> >
> > * When a SCHED_FIFO process becomes runnable, it will be inserted at
> > the end of the list for its priority.
> >
> > * A call to sched_setscheduler() or sched_setparam(2) will put the
> > SCHED_FIFO (or SCHED_RR) process identified by pid at the start of
> > the list if it was runnable. As a consequence, it may preempt the
> > currently running process if it has the same priority.
> > (POSIX.1-2001 specifies that the process should go to the end of the
> > list.)
> >
> > * A process calling sched_yield(2) will be put at the end of the list.
>
> How about the recent discussion regarding sched_yield(). Is this correct?
>
> lkml.kernel.org/r/[email protected]
>
> Is this the correct place to add a note explaining te potentional pitfalls
> using sched_yield?

I'm not sure; there's a SCHED_YIELD(2) manpage to fill with that
nonsense.

Also; I realized I have not described the DEADLINE sched_yield()
behaviour.

> > No other events will move a process scheduled under the SCHED_FIFO pol‐
> > icy in the wait list of runnable processes with equal static priority.
> >
> > A SCHED_FIFO process runs until either it is blocked by an I/O request,
> > it is preempted by a higher priority process, or it calls
> > sched_yield(2).
> >
> > SCHED_RR: Round Robin scheduling
> > SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described
> > above for SCHED_FIFO also applies to SCHED_RR, except that each process
> > is only allowed to run for a maximum time quantum. If a SCHED_RR
> > process has been running for a time period equal to or longer than the
> > time quantum, it will be put at the end of the list for its priority.
> > A SCHED_RR process that has been preempted by a higher priority process
> > and subsequently resumes execution as a running process will complete
> > the unexpired portion of its round robin time quantum. The length of
> > the time quantum can be retrieved using sched_rr_get_interval(2).
>
> -> Default is 0.1HZ ms
>
> This is a question I get form time to time, having this in the manpage
> would be helpful.

Again, brazenly stolen from SCHED_SETSCHEDULER(2); but yes. Also I'm not
sure I'd call RR an enhancement of anything much at all ;-)

> > ERRORS
> > EINVAL The scheduling policy is not one of the recognized policies,
> > param is NULL, or param does not make sense for the policy.
> >
> > EPERM The calling process does not have appropriate privileges.
> >
> > ESRCH The process whose ID is pid could not be found.
> >
> > E2BIG The provided storage for struct sched_attr is either too
> > big, see sched_setattr(), or too small, see sched_getattr().
>
> Where's the EBUSY? It can throw this from __sched_setscheduler() when it
> checks if there's enough bandwidth to run the task.

Uhhm.. it got lost :-) /me quickly adds.

2014-04-10 07:47:27

by Juri Lelli

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

Hi all,

On Wed, 9 Apr 2014 17:42:04 +0200
Peter Zijlstra <[email protected]> wrote:

> On Wed, Apr 09, 2014 at 05:19:11PM +0200, Henrik Austad wrote:
> > > The following "real-time" policies are also supported, for
> >
> > why the "'s?
>
> I borrowed those from SCHED_SETSCHEDULER(2).
>
> > > sched_attr::sched_flags additional flags that can influence
> > > scheduling behaviour. Currently as per Linux kernel 3.14:
> > >
> > > SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> > > to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> > > on fork().
> > >
> > > is the only supported flag.
>
> ...
>
> > > The flags argument should be 0.
> >
> > What about SCHED_FLAG_RESET_ON_FOR?
>
> Different flags. The one is sched_attr::flags the other is
> sched_setattr(.flags).
>
> > > The other sched_attr fields are filled out as described in
> > > sched_setattr().
> > >
> > > Scheduling Policies
> > > The scheduler is the kernel component that decides which runnable
> > > process will be executed by the CPU next. Each process has an associ‐
> > > ated scheduling policy and a static scheduling priority, sched_prior‐
> > > ity; these are the settings that are modified by sched_setscheduler().
> > > The scheduler makes it decisions based on knowledge of the scheduling
> > > policy and static priority of all processes on the system.
> >
> > Isn't this last sentence redundant/sliglhtly repetitive?
>
> I borrowed that from SCHED_SETSCHEDULER(2) again.
>
> > > SCHED_DEADLINE: Sporadic task model deadline scheduling
> > > SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > > Deadline First) with additional CBS (Constant Bandwidth Server).
> > > The CBS guarantees that tasks that over-run their specified
> > > budget are throttled and do not affect the correct performance
> > > of other SCHED_DEADLINE tasks.
> > >
> > > SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> > >
> > > Setting SCHED_DEADLINE can fail with -EINVAL when admission
> > > control tests fail.
> >
> > Perhaps add a note about the deadline-class having higher priority than the
> > other classes; i.e. if a deadline-task is runnable, it will preempt any
> > other SCHED_(RR|FIFO) regardless of priority?
>
> Yes, good point, will do.
>
> > > SCHED_FIFO: First In-First Out scheduling
> > > SCHED_FIFO can only be used with static priorities higher than 0, which
> > > means that when a SCHED_FIFO processes becomes runnable, it will always
> > > immediately preempt any currently running SCHED_OTHER, SCHED_BATCH, or
> > > SCHED_IDLE process. SCHED_FIFO is a simple scheduling algorithm with‐
> > > out time slicing. For processes scheduled under the SCHED_FIFO policy,
> > > the following rules apply:
> > >
> > > * A SCHED_FIFO process that has been preempted by another process of
> > > higher priority will stay at the head of the list for its priority
> > > and will resume execution as soon as all processes of higher prior‐
> > > ity are blocked again.
> > >
> > > * When a SCHED_FIFO process becomes runnable, it will be inserted at
> > > the end of the list for its priority.
> > >
> > > * A call to sched_setscheduler() or sched_setparam(2) will put the
> > > SCHED_FIFO (or SCHED_RR) process identified by pid at the start of
> > > the list if it was runnable. As a consequence, it may preempt the
> > > currently running process if it has the same priority.
> > > (POSIX.1-2001 specifies that the process should go to the end of the
> > > list.)
> > >
> > > * A process calling sched_yield(2) will be put at the end of the list.
> >
> > How about the recent discussion regarding sched_yield(). Is this correct?
> >
> > lkml.kernel.org/r/[email protected]
> >
> > Is this the correct place to add a note explaining te potentional pitfalls
> > using sched_yield?
>
> I'm not sure; there's a SCHED_YIELD(2) manpage to fill with that
> nonsense.
>
> Also; I realized I have not described the DEADLINE sched_yield()
> behaviour.
>

So, for SCHED_DEADLINE we currently have this behaviour:

/*
* Yield task semantic for -deadline tasks is:
*
* get off from the CPU until our next instance, with
* a new runtime. This is of little use now, since we
* don't have a bandwidth reclaiming mechanism. Anyway,
* bandwidth reclaiming is planned for the future, and
* yield_task_dl will indicate that some spare budget
* is available for other task instances to use it.
*/

But, considering also the discussion above, I'm less sure now that's
what we want. Still, I think we will want some way in the future to be
able to say "I'm finished with my current job, give this remaining
runtime to someone else", like another syscall or something.

Thanks,

- Juri

> > > No other events will move a process scheduled under the SCHED_FIFO pol‐
> > > icy in the wait list of runnable processes with equal static priority.
> > >
> > > A SCHED_FIFO process runs until either it is blocked by an I/O request,
> > > it is preempted by a higher priority process, or it calls
> > > sched_yield(2).
> > >
> > > SCHED_RR: Round Robin scheduling
> > > SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described
> > > above for SCHED_FIFO also applies to SCHED_RR, except that each process
> > > is only allowed to run for a maximum time quantum. If a SCHED_RR
> > > process has been running for a time period equal to or longer than the
> > > time quantum, it will be put at the end of the list for its priority.
> > > A SCHED_RR process that has been preempted by a higher priority process
> > > and subsequently resumes execution as a running process will complete
> > > the unexpired portion of its round robin time quantum. The length of
> > > the time quantum can be retrieved using sched_rr_get_interval(2).
> >
> > -> Default is 0.1HZ ms
> >
> > This is a question I get form time to time, having this in the manpage
> > would be helpful.
>
> Again, brazenly stolen from SCHED_SETSCHEDULER(2); but yes. Also I'm not
> sure I'd call RR an enhancement of anything much at all ;-)
>
> > > ERRORS
> > > EINVAL The scheduling policy is not one of the recognized policies,
> > > param is NULL, or param does not make sense for the policy.
> > >
> > > EPERM The calling process does not have appropriate privileges.
> > >
> > > ESRCH The process whose ID is pid could not be found.
> > >
> > > E2BIG The provided storage for struct sched_attr is either too
> > > big, see sched_setattr(), or too small, see sched_getattr().
> >
> > Where's the EBUSY? It can throw this from __sched_setscheduler() when it
> > checks if there's enough bandwidth to run the task.
>
> Uhhm.. it got lost :-) /me quickly adds.

2014-04-10 09:59:38

by Claudio Scordino

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

Il 10/04/2014 09:47, Juri Lelli ha scritto:
> Hi all,
>
> On Wed, 9 Apr 2014 17:42:04 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>> On Wed, Apr 09, 2014 at 05:19:11PM +0200, Henrik Austad wrote:
>>>> The following "real-time" policies are also supported, for
>>> why the "'s?
>> I borrowed those from SCHED_SETSCHEDULER(2).
>>
>>>> sched_attr::sched_flags additional flags that can influence
>>>> scheduling behaviour. Currently as per Linux kernel 3.14:
>>>>
>>>> SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
>>>> to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
>>>> on fork().
>>>>
>>>> is the only supported flag.
>> ...
>>
>>>> The flags argument should be 0.
>>> What about SCHED_FLAG_RESET_ON_FOR?
>> Different flags. The one is sched_attr::flags the other is
>> sched_setattr(.flags).
>>
>>>> The other sched_attr fields are filled out as described in
>>>> sched_setattr().
>>>>
>>>> Scheduling Policies
>>>> The scheduler is the kernel component that decides which runnable
>>>> process will be executed by the CPU next. Each process has an associ‐
>>>> ated scheduling policy and a static scheduling priority, sched_prior‐
>>>> ity; these are the settings that are modified by sched_setscheduler().
>>>> The scheduler makes it decisions based on knowledge of the scheduling
>>>> policy and static priority of all processes on the system.
>>> Isn't this last sentence redundant/sliglhtly repetitive?
>> I borrowed that from SCHED_SETSCHEDULER(2) again.
>>
>>>> SCHED_DEADLINE: Sporadic task model deadline scheduling
>>>> SCHED_DEADLINE is an implementation of GEDF (Global Earliest
>>>> Deadline First) with additional CBS (Constant Bandwidth Server).
>>>> The CBS guarantees that tasks that over-run their specified
>>>> budget are throttled and do not affect the correct performance
>>>> of other SCHED_DEADLINE tasks.
>>>>
>>>> SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
>>>>
>>>> Setting SCHED_DEADLINE can fail with -EINVAL when admission
>>>> control tests fail.
>>> Perhaps add a note about the deadline-class having higher priority than the
>>> other classes; i.e. if a deadline-task is runnable, it will preempt any
>>> other SCHED_(RR|FIFO) regardless of priority?
>> Yes, good point, will do.
>>
>>>> SCHED_FIFO: First In-First Out scheduling
>>>> SCHED_FIFO can only be used with static priorities higher than 0, which
>>>> means that when a SCHED_FIFO processes becomes runnable, it will always
>>>> immediately preempt any currently running SCHED_OTHER, SCHED_BATCH, or
>>>> SCHED_IDLE process. SCHED_FIFO is a simple scheduling algorithm with‐
>>>> out time slicing. For processes scheduled under the SCHED_FIFO policy,
>>>> the following rules apply:
>>>>
>>>> * A SCHED_FIFO process that has been preempted by another process of
>>>> higher priority will stay at the head of the list for its priority
>>>> and will resume execution as soon as all processes of higher prior‐
>>>> ity are blocked again.
>>>>
>>>> * When a SCHED_FIFO process becomes runnable, it will be inserted at
>>>> the end of the list for its priority.
>>>>
>>>> * A call to sched_setscheduler() or sched_setparam(2) will put the
>>>> SCHED_FIFO (or SCHED_RR) process identified by pid at the start of
>>>> the list if it was runnable. As a consequence, it may preempt the
>>>> currently running process if it has the same priority.
>>>> (POSIX.1-2001 specifies that the process should go to the end of the
>>>> list.)
>>>>
>>>> * A process calling sched_yield(2) will be put at the end of the list.
>>> How about the recent discussion regarding sched_yield(). Is this correct?
>>>
>>> lkml.kernel.org/r/[email protected]
>>>
>>> Is this the correct place to add a note explaining te potentional pitfalls
>>> using sched_yield?
>> I'm not sure; there's a SCHED_YIELD(2) manpage to fill with that
>> nonsense.
>>
>> Also; I realized I have not described the DEADLINE sched_yield()
>> behaviour.
>>
> So, for SCHED_DEADLINE we currently have this behaviour:
>
> /*
> * Yield task semantic for -deadline tasks is:
> *
> * get off from the CPU until our next instance, with
> * a new runtime. This is of little use now, since we
> * don't have a bandwidth reclaiming mechanism. Anyway,
> * bandwidth reclaiming is planned for the future, and
> * yield_task_dl will indicate that some spare budget
> * is available for other task instances to use it.
> */
>
> But, considering also the discussion above, I'm less sure now that's
> what we want. Still, I think we will want some way in the future to be
> able to say "I'm finished with my current job, give this remaining
> runtime to someone else", like another syscall or something.

Hi Juri, hi Peter,

my two cents:

A syscall to block the task until its next instance is definitely useful.
This way, a periodic task doesn't have to sleep anymore: the kernel
takes care of unblocking the task at the right moment.
This would be easier (for user-level) and more efficient too.
I don't know if using sched_yield() to get this behavior is a good
choice or not. You have ways more experience than me :)

Best,

Claudio

Subject: Re: sched_{set,get}attr() manpage

Hi Peter,

Following the review comments that one or two people sent, are you
planning to send in a revised version of this page? Also, is there any
test code lying about somewhere that I could play with?

Thanks,

Michael


On Wed, Apr 9, 2014 at 5:42 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 09, 2014 at 05:19:11PM +0200, Henrik Austad wrote:
>> > The following "real-time" policies are also supported, for
>>
>> why the "'s?
>
> I borrowed those from SCHED_SETSCHEDULER(2).
>
>> > sched_attr::sched_flags additional flags that can influence
>> > scheduling behaviour. Currently as per Linux kernel 3.14:
>> >
>> > SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
>> > to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
>> > on fork().
>> >
>> > is the only supported flag.
>
> ...
>
>> > The flags argument should be 0.
>>
>> What about SCHED_FLAG_RESET_ON_FOR?
>
> Different flags. The one is sched_attr::flags the other is
> sched_setattr(.flags).
>
>> > The other sched_attr fields are filled out as described in
>> > sched_setattr().
>> >
>> > Scheduling Policies
>> > The scheduler is the kernel component that decides which runnable
>> > process will be executed by the CPU next. Each process has an associ‐
>> > ated scheduling policy and a static scheduling priority, sched_prior‐
>> > ity; these are the settings that are modified by sched_setscheduler().
>> > The scheduler makes it decisions based on knowledge of the scheduling
>> > policy and static priority of all processes on the system.
>>
>> Isn't this last sentence redundant/sliglhtly repetitive?
>
> I borrowed that from SCHED_SETSCHEDULER(2) again.
>
>> > SCHED_DEADLINE: Sporadic task model deadline scheduling
>> > SCHED_DEADLINE is an implementation of GEDF (Global Earliest
>> > Deadline First) with additional CBS (Constant Bandwidth Server).
>> > The CBS guarantees that tasks that over-run their specified
>> > budget are throttled and do not affect the correct performance
>> > of other SCHED_DEADLINE tasks.
>> >
>> > SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
>> >
>> > Setting SCHED_DEADLINE can fail with -EINVAL when admission
>> > control tests fail.
>>
>> Perhaps add a note about the deadline-class having higher priority than the
>> other classes; i.e. if a deadline-task is runnable, it will preempt any
>> other SCHED_(RR|FIFO) regardless of priority?
>
> Yes, good point, will do.
>
>> > SCHED_FIFO: First In-First Out scheduling
>> > SCHED_FIFO can only be used with static priorities higher than 0, which
>> > means that when a SCHED_FIFO processes becomes runnable, it will always
>> > immediately preempt any currently running SCHED_OTHER, SCHED_BATCH, or
>> > SCHED_IDLE process. SCHED_FIFO is a simple scheduling algorithm with‐
>> > out time slicing. For processes scheduled under the SCHED_FIFO policy,
>> > the following rules apply:
>> >
>> > * A SCHED_FIFO process that has been preempted by another process of
>> > higher priority will stay at the head of the list for its priority
>> > and will resume execution as soon as all processes of higher prior‐
>> > ity are blocked again.
>> >
>> > * When a SCHED_FIFO process becomes runnable, it will be inserted at
>> > the end of the list for its priority.
>> >
>> > * A call to sched_setscheduler() or sched_setparam(2) will put the
>> > SCHED_FIFO (or SCHED_RR) process identified by pid at the start of
>> > the list if it was runnable. As a consequence, it may preempt the
>> > currently running process if it has the same priority.
>> > (POSIX.1-2001 specifies that the process should go to the end of the
>> > list.)
>> >
>> > * A process calling sched_yield(2) will be put at the end of the list.
>>
>> How about the recent discussion regarding sched_yield(). Is this correct?
>>
>> lkml.kernel.org/r/[email protected]
>>
>> Is this the correct place to add a note explaining te potentional pitfalls
>> using sched_yield?
>
> I'm not sure; there's a SCHED_YIELD(2) manpage to fill with that
> nonsense.
>
> Also; I realized I have not described the DEADLINE sched_yield()
> behaviour.
>
>> > No other events will move a process scheduled under the SCHED_FIFO pol‐
>> > icy in the wait list of runnable processes with equal static priority.
>> >
>> > A SCHED_FIFO process runs until either it is blocked by an I/O request,
>> > it is preempted by a higher priority process, or it calls
>> > sched_yield(2).
>> >
>> > SCHED_RR: Round Robin scheduling
>> > SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described
>> > above for SCHED_FIFO also applies to SCHED_RR, except that each process
>> > is only allowed to run for a maximum time quantum. If a SCHED_RR
>> > process has been running for a time period equal to or longer than the
>> > time quantum, it will be put at the end of the list for its priority.
>> > A SCHED_RR process that has been preempted by a higher priority process
>> > and subsequently resumes execution as a running process will complete
>> > the unexpired portion of its round robin time quantum. The length of
>> > the time quantum can be retrieved using sched_rr_get_interval(2).
>>
>> -> Default is 0.1HZ ms
>>
>> This is a question I get form time to time, having this in the manpage
>> would be helpful.
>
> Again, brazenly stolen from SCHED_SETSCHEDULER(2); but yes. Also I'm not
> sure I'd call RR an enhancement of anything much at all ;-)
>
>> > ERRORS
>> > EINVAL The scheduling policy is not one of the recognized policies,
>> > param is NULL, or param does not make sense for the policy.
>> >
>> > EPERM The calling process does not have appropriate privileges.
>> >
>> > ESRCH The process whose ID is pid could not be found.
>> >
>> > E2BIG The provided storage for struct sched_attr is either too
>> > big, see sched_setattr(), or too small, see sched_getattr().
>>
>> Where's the EBUSY? It can throw this from __sched_setscheduler() when it
>> checks if there's enough bandwidth to run the task.
>
> Uhhm.. it got lost :-) /me quickly adds.



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-04-27 19:35:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

On Sun, Apr 27, 2014 at 05:47:25PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
>
> Following the review comments that one or two people sent, are you
> planning to send in a revised version of this page?

Yes, I just suck at getting around to it :-(, I'll do it first thing
tomorrow.

> Also, is there any test code lying about somewhere that I could play with?

Juri?

2014-04-27 19:46:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

On Sun, 27 Apr 2014 21:34:49 +0200
Peter Zijlstra <[email protected]> wrote:

> > Also, is there any test code lying about somewhere that I could play with?

I have a deadline program you can play with too:

http://rostedt.homelinux.com/private/deadline.c

-- Steve

2014-04-28 07:38:30

by Juri Lelli

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

On Sun, 27 Apr 2014 21:34:49 +0200
Peter Zijlstra <[email protected]> wrote:

> On Sun, Apr 27, 2014 at 05:47:25PM +0200, Michael Kerrisk (man-pages) wrote:
> > Hi Peter,
> >
> > Following the review comments that one or two people sent, are you
> > planning to send in a revised version of this page?
>
> Yes, I just suck at getting around to it :-(, I'll do it first thing
> tomorrow.
>
> > Also, is there any test code lying about somewhere that I could play with?
>
> Juri?

Yes. I use this two tools:

- rt-app (to create periodic workload, also not RT/DL)
https://github.com/gbagnoli/rt-app

- schedtool-dl (patched version of schetool)
https://github.com/jlelli/schedtool-dl

Both are aligned to the last interface.

Best,

- Juri

2014-04-28 08:19:27

by Peter Zijlstra

[permalink] [raw]
Subject: sched_{set,get}attr() manpage

Hi Michael,

find below an updated manpage, I did not apply the comments on parts
that are identical to SCHED_SETSCHEDULER(2) in order to keep these texts
in alignment. I feel that if we change one we should also change the
other, and such a 'patch' is best done separate from the new manpage
itself.

I did add the missing EBUSY error, and amended the text where it said
we'd return EINVAL in that case.

I added a paragraph stating that SCHED_DEADLINE preempted anything else
userspace can do (with the explicit mention of userspace to leave me
wriggle room for the kernel's stop task :-).

I also did a short paragraph on the deadline sched_yield(). For further
deadline yield details we should maybe add to the SCHED_YIELD(2)
manpage.

Re juri/claudio; no I think sched_yield() as implemented for deadline
makes sense, no other yield semantics other than NOP makes sense for it,
and since we have the syscall already might as well make it do something
useful.


---

NAME
sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
#include <sched.h>

struct sched_attr {
u32 size;
u32 sched_policy;
u64 sched_flags;

/* SCHED_NORMAL, SCHED_BATCH */
s32 sched_nice;
/* SCHED_FIFO, SCHED_RR */
u32 sched_priority;
/* SCHED_DEADLINE */
u64 sched_runtime;
u64 sched_deadline;
u64 sched_period;
};
int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);

int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);

DESCRIPTION
sched_setattr() sets both the scheduling policy and the
associated attributes for the process whose ID is specified in
pid. If pid equals zero, the scheduling policy and attributes
of the calling process will be set. The interpretation of the
argument attr depends on the selected policy. Currently, Linux
supports the following "normal" (i.e., non-real-time) scheduling
policies:

SCHED_OTHER the standard "fair" time-sharing policy;

SCHED_BATCH for "batch" style execution of processes; and

SCHED_IDLE for running very low priority background jobs.

The following "real-time" policies are also supported, for
special time-critical applications that need precise control
over the way in which runnable processes are selected for
execution:

SCHED_FIFO a first-in, first-out policy;

SCHED_RR a round-robin policy; and

SCHED_DEADLINE a deadline policy.

The semantics of each of these policies are detailed below.

sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel
structure, the kernel verifies all additional fields are '0' if
not the syscall will fail with -E2BIG.

sched_attr::sched_policy the desired scheduling policy.

sched_attr::sched_flags additional flags that can influence
scheduling behaviour. Currently as per Linux kernel 3.14:

SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
on fork().

is the only supported flag.

sched_attr::sched_nice should only be set for SCHED_OTHER,
SCHED_BATCH, the desired nice value [-20,19], see NICE(2).

sched_attr::sched_priority should only be set for SCHED_FIFO,
SCHED_RR, the desired static priority [1,99].

sched_attr::sched_runtime
sched_attr::sched_deadline
sched_attr::sched_period should only be set for SCHED_DEADLINE
and are the traditional sporadic task model parameters.

The flags argument should be 0.

sched_getattr() queries the scheduling policy currently applied
to the process identified by pid. If pid equals zero, the
policy of the calling process will be retrieved.

The size argument should reflect the size of struct sched_attr
as known to userspace. The kernel fills out sched_attr::size to
the size of its sched_attr structure. If the user provided
structure is larger, additional fields are not touched. If the
user provided structure is smaller, but the kernel needs to
return values outside the provided space, the syscall will fail
with -E2BIG.

The flags argument should be 0.

The other sched_attr fields are filled out as described in
sched_setattr().

Scheduling Policies
The scheduler is the kernel component that decides which runnable
process will be executed by the CPU next. Each process has an associ‐
ated scheduling policy and a static scheduling priority, sched_prior‐
ity; these are the settings that are modified by sched_setscheduler().
The scheduler makes it decisions based on knowledge of the scheduling
policy and static priority of all processes on the system.

For processes scheduled under one of the normal scheduling policies
(SCHED_OTHER, SCHED_IDLE, SCHED_BATCH), sched_priority is not used in
scheduling decisions (it must be specified as 0).

Processes scheduled under one of the real-time policies (SCHED_FIFO,
SCHED_RR) have a sched_priority value in the range 1 (low) to 99
(high). (As the numbers imply, real-time processes always have higher
priority than normal processes.) Note well: POSIX.1-2001 only requires
an implementation to support a minimum 32 distinct priority levels for
the real-time policies, and some systems supply just this minimum.
Portable programs should use sched_get_priority_min(2) and
sched_get_priority_max(2) to find the range of priorities supported for
a particular policy.

Conceptually, the scheduler maintains a list of runnable processes for
each possible sched_priority value. In order to determine which
process runs next, the scheduler looks for the nonempty list with the
highest static priority and selects the process at the head of this
list.

A process's scheduling policy determines where it will be inserted into
the list of processes with equal static priority and how it will move
inside this list.

All scheduling is preemptive: if a process with a higher static prior‐
ity becomes ready to run, the currently running process will be pre‐
empted and returned to the wait list for its static priority level.
The scheduling policy only determines the ordering within the list of
runnable processes with equal static priority.

SCHED_DEADLINE: Sporadic task model deadline scheduling
SCHED_DEADLINE is an implementation of GEDF (Global Earliest
Deadline First) with additional CBS (Constant Bandwidth Server).
The CBS guarantees that tasks that over-run their specified
budget are throttled and do not affect the correct performance
of other SCHED_DEADLINE tasks.

SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN

Setting SCHED_DEADLINE can fail with -EBUSY when admission
control tests fail.

Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
highest priority (user controllable) tasks in the system, if any
SCHED_DEADLINE task is runnable it will preempt anything
FIFO/RR/OTHER/BATCH/IDLE task out there.

A SCHED_DEADLINE task calling sched_yield() will 'yield' the
current job and wait for a new period to begin.

SCHED_FIFO: First In-First Out scheduling
SCHED_FIFO can only be used with static priorities higher than 0, which
means that when a SCHED_FIFO processes becomes runnable, it will always
immediately preempt any currently running SCHED_OTHER, SCHED_BATCH, or
SCHED_IDLE process. SCHED_FIFO is a simple scheduling algorithm with‐
out time slicing. For processes scheduled under the SCHED_FIFO policy,
the following rules apply:

* A SCHED_FIFO process that has been preempted by another process of
higher priority will stay at the head of the list for its priority
and will resume execution as soon as all processes of higher prior‐
ity are blocked again.

* When a SCHED_FIFO process becomes runnable, it will be inserted at
the end of the list for its priority.

* A call to sched_setscheduler() or sched_setparam(2) will put the
SCHED_FIFO (or SCHED_RR) process identified by pid at the start of
the list if it was runnable. As a consequence, it may preempt the
currently running process if it has the same priority.
(POSIX.1-2001 specifies that the process should go to the end of the
list.)

* A process calling sched_yield(2) will be put at the end of the list.

No other events will move a process scheduled under the SCHED_FIFO pol‐
icy in the wait list of runnable processes with equal static priority.

A SCHED_FIFO process runs until either it is blocked by an I/O request,
it is preempted by a higher priority process, or it calls
sched_yield(2).

SCHED_RR: Round Robin scheduling
SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described
above for SCHED_FIFO also applies to SCHED_RR, except that each process
is only allowed to run for a maximum time quantum. If a SCHED_RR
process has been running for a time period equal to or longer than the
time quantum, it will be put at the end of the list for its priority.
A SCHED_RR process that has been preempted by a higher priority process
and subsequently resumes execution as a running process will complete
the unexpired portion of its round robin time quantum. The length of
the time quantum can be retrieved using sched_rr_get_interval(2).

SCHED_OTHER: Default Linux time-sharing scheduling
SCHED_OTHER can only be used at static priority 0. SCHED_OTHER is the
standard Linux time-sharing scheduler that is intended for all pro‐
cesses that do not require the special real-time mechanisms. The
process to run is chosen from the static priority 0 list based on a
dynamic priority that is determined only inside this list. The dynamic
priority is based on the nice value (set by nice(2) or setpriority(2))
and increased for each time quantum the process is ready to run, but
denied to run by the scheduler. This ensures fair progress among all
SCHED_OTHER processes.

SCHED_BATCH: Scheduling batch processes
(Since Linux 2.6.16.) SCHED_BATCH can only be used at static priority
0. This policy is similar to SCHED_OTHER in that it schedules the
process according to its dynamic priority (based on the nice value).
The difference is that this policy will cause the scheduler to always
assume that the process is CPU-intensive. Consequently, the scheduler
will apply a small scheduling penalty with respect to wakeup behaviour,
so that this process is mildly disfavored in scheduling decisions.

This policy is useful for workloads that are noninteractive, but do not
want to lower their nice value, and for workloads that want a determin‐
istic scheduling policy without interactivity causing extra preemptions
(between the workload's tasks).

SCHED_IDLE: Scheduling very low priority jobs
(Since Linux 2.6.23.) SCHED_IDLE can only be used at static priority
0; the process nice value has no influence for this policy.

This policy is intended for running jobs at extremely low priority
(lower even than a +19 nice value with the SCHED_OTHER or SCHED_BATCH
policies).

RETURN VALUE
On success, sched_setattr() and sched_getattr() return 0. On
error, -1 is returned, and errno is set appropriately.

ERRORS
EINVAL The scheduling policy is not one of the recognized policies,
param is NULL, or param does not make sense for the policy.

EPERM The calling process does not have appropriate privileges.

ESRCH The process whose ID is pid could not be found.

E2BIG The provided storage for struct sched_attr is either too
big, see sched_setattr(), or too small, see sched_getattr().

EBUSY SCHED_DEADLINE admission control failure

NOTES
While the text above (and in SCHED_SETSCHEDULER(2)) talks about
processes, in actual fact these system calls are thread specific.

Subject: Re: sched_{set,get}attr() manpage

Hi Peter,

On 04/28/2014 10:18 AM, Peter Zijlstra wrote:
> Hi Michael,
>
> find below an updated manpage, I did not apply the comments on parts
> that are identical to SCHED_SETSCHEDULER(2) in order to keep these texts
> in alignment. I feel that if we change one we should also change the
> other, and such a 'patch' is best done separate from the new manpage
> itself.
>
> I did add the missing EBUSY error, and amended the text where it said
> we'd return EINVAL in that case.
>
> I added a paragraph stating that SCHED_DEADLINE preempted anything else
> userspace can do (with the explicit mention of userspace to leave me
> wriggle room for the kernel's stop task :-).
>
> I also did a short paragraph on the deadline sched_yield(). For further
> deadline yield details we should maybe add to the SCHED_YIELD(2)
> manpage.
>
> Re juri/claudio; no I think sched_yield() as implemented for deadline
> makes sense, no other yield semantics other than NOP makes sense for it,
> and since we have the syscall already might as well make it do something
> useful.

Thanks for the updated page. Would you be willing
to revise as per the comments below.


> NAME
> sched_setattr, sched_getattr - set and get scheduling policy/attributes
>
> SYNOPSIS
> #include <sched.h>
>
> struct sched_attr {
> u32 size;
> u32 sched_policy;
> u64 sched_flags;
>
> /* SCHED_NORMAL, SCHED_BATCH */
> s32 sched_nice;
> /* SCHED_FIFO, SCHED_RR */
> u32 sched_priority;
> /* SCHED_DEADLINE */
> u64 sched_runtime;
> u64 sched_deadline;
> u64 sched_period;
> };
> int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
>
> int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
>
> DESCRIPTION
> sched_setattr() sets both the scheduling policy and the
> associated attributes for the process whose ID is specified in
> pid.

Around about here, I think there needs to be a sentence explaining
that sched_setattr() provides a superset of the functionality of
sched_setscheduler(2) and setpritority(2). I mean, it can do all that
those two calls can do, right?

> If pid equals zero, the scheduling policy and attributes
> of the calling process will be set. The interpretation of the
> argument attr depends on the selected policy. Currently, Linux
> supports the following "normal" (i.e., non-real-time) scheduling
> policies:
>
> SCHED_OTHER the standard "fair" time-sharing policy;
>
> SCHED_BATCH for "batch" style execution of processes; and
>
> SCHED_IDLE for running very low priority background jobs.
>
> The following "real-time" policies are also supported, for
> special time-critical applications that need precise control
> over the way in which runnable processes are selected for
> execution:
>
> SCHED_FIFO a first-in, first-out policy;
>
> SCHED_RR a round-robin policy; and
>
> SCHED_DEADLINE a deadline policy.
>
> The semantics of each of these policies are detailed below.

The semantics of each of these policies are detailed in sched(7).

[See my comments below]

>
> sched_attr::size must be set to the size of the structure, as in
> sizeof(struct sched_attr), if the provided structure is smaller
> than the kernel structure, any additional fields are assumed
> '0'. If the provided structure is larger than the kernel
> structure, the kernel verifies all additional fields are '0' if
> not the syscall will fail with -E2BIG.
>
> sched_attr::sched_policy the desired scheduling policy.
>
> sched_attr::sched_flags additional flags that can influence
> scheduling behaviour. Currently as per Linux kernel 3.14:
>
> SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> on fork().
>
> is the only supported flag.
>
> sched_attr::sched_nice should only be set for SCHED_OTHER,
> SCHED_BATCH, the desired nice value [-20,19], see NICE(2).
>
> sched_attr::sched_priority should only be set for SCHED_FIFO,
> SCHED_RR, the desired static priority [1,99].
>
> sched_attr::sched_runtime
> sched_attr::sched_deadline
> sched_attr::sched_period should only be set for SCHED_DEADLINE
> and are the traditional sporadic task model parameters.

Could you add (a lot ;-)) more detail on these three fields? Assume the
reader does not know about this traditional sporadic task model, and
then give some explanation of what these three fields do. Probably, at
this point you can work in some statement about the admission control
test.

[but, see my comment below. It may be that sched(7) is a better
place for this detail.

> The flags argument should be 0.
>
> sched_getattr() queries the scheduling policy currently applied
> to the process identified by pid. If pid equals zero, the
> policy of the calling process will be retrieved.
>
> The size argument should reflect the size of struct sched_attr
> as known to userspace. The kernel fills out sched_attr::size to
> the size of its sched_attr structure. If the user provided
> structure is larger, additional fields are not touched. If the
> user provided structure is smaller, but the kernel needs to
> return values outside the provided space, the syscall will fail
> with -E2BIG.
>
> The flags argument should be 0.
>
> The other sched_attr fields are filled out as described in
> sched_setattr().

I assume that everything between my [[[ and ]]] blocks below is taken straight
from sched_setscheduler(2). (If that is not true, please let me know.)
This reminds me that there is a structural fault in this part of man-pages ;-).
The problem is sched_setscheduler(2) currently tries to do two things:

[a] Document the sched_setscheduler() and sched_scheduler system calls
[b] Provide and overview od scheduling policies and parameters.

It should really only do the former. I have now gone through the task of
separating [b] out into a separate page, sched(7), which other pages,
such as sched_setscheduler(2) and sched_setattr(2) can refer to. You
can see the current versions of sched_setscheduelr.2 and sched.7 in Git
(https://www.kernel.org/doc/man-pages/download.html )

So, what I would ideally like to see

[1] A page describing the sched_setattr() and sched_getattr() APIs
[2] A piece of text describing the SCHED_DEADLINE policy, which I can
drop into sched(7).

Could you revise like that?

[[[[
> Scheduling Policies
> The scheduler is the kernel component that decides which runnable
> process will be executed by the CPU next. Each process has an associ‐
> ated scheduling policy and a static scheduling priority, sched_prior‐
> ity; these are the settings that are modified by sched_setscheduler().
> The scheduler makes it decisions based on knowledge of the scheduling
> policy and static priority of all processes on the system.
>
> For processes scheduled under one of the normal scheduling policies
> (SCHED_OTHER, SCHED_IDLE, SCHED_BATCH), sched_priority is not used in
> scheduling decisions (it must be specified as 0).
>
> Processes scheduled under one of the real-time policies (SCHED_FIFO,
> SCHED_RR) have a sched_priority value in the range 1 (low) to 99
> (high). (As the numbers imply, real-time processes always have higher
> priority than normal processes.) Note well: POSIX.1-2001 only requires
> an implementation to support a minimum 32 distinct priority levels for
> the real-time policies, and some systems supply just this minimum.
> Portable programs should use sched_get_priority_min(2) and
> sched_get_priority_max(2) to find the range of priorities supported for
> a particular policy.
>
> Conceptually, the scheduler maintains a list of runnable processes for
> each possible sched_priority value. In order to determine which
> process runs next, the scheduler looks for the nonempty list with the
> highest static priority and selects the process at the head of this
> list.
>
> A process's scheduling policy determines where it will be inserted into
> the list of processes with equal static priority and how it will move
> inside this list.
>
> All scheduling is preemptive: if a process with a higher static prior‐
> ity becomes ready to run, the currently running process will be pre‐
> empted and returned to the wait list for its static priority level.
> The scheduling policy only determines the ordering within the list of
> runnable processes with equal static priority.
]]]]

> SCHED_DEADLINE: Sporadic task model deadline scheduling
> SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> Deadline First) with additional CBS (Constant Bandwidth Server).
> The CBS guarantees that tasks that over-run their specified
> budget are throttled and do not affect the correct performance
> of other SCHED_DEADLINE tasks.
>
> SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
>
> Setting SCHED_DEADLINE can fail with -EBUSY when admission
> control tests fail.
>
> Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
> highest priority (user controllable) tasks in the system, if any
> SCHED_DEADLINE task is runnable it will preempt anything
> FIFO/RR/OTHER/BATCH/IDLE task out there.
>
> A SCHED_DEADLINE task calling sched_yield() will 'yield' the
> current job and wait for a new period to begin.

This is the piece that could go into sched(7), but I'd like it to include
a discussion of deadline, period, and runtime.

[[[[

> SCHED_FIFO: First In-First Out scheduling
> SCHED_FIFO can only be used with static priorities higher than 0, which
> means that when a SCHED_FIFO processes becomes runnable, it will always
> immediately preempt any currently running SCHED_OTHER, SCHED_BATCH, or
> SCHED_IDLE process. SCHED_FIFO is a simple scheduling algorithm with‐
> out time slicing. For processes scheduled under the SCHED_FIFO policy,
> the following rules apply:
>
> * A SCHED_FIFO process that has been preempted by another process of
> higher priority will stay at the head of the list for its priority
> and will resume execution as soon as all processes of higher prior‐
> ity are blocked again.
>
> * When a SCHED_FIFO process becomes runnable, it will be inserted at
> the end of the list for its priority.
>
> * A call to sched_setscheduler() or sched_setparam(2) will put the
> SCHED_FIFO (or SCHED_RR) process identified by pid at the start of
> the list if it was runnable. As a consequence, it may preempt the
> currently running process if it has the same priority.
> (POSIX.1-2001 specifies that the process should go to the end of the
> list.)
>
> * A process calling sched_yield(2) will be put at the end of the list.
>
> No other events will move a process scheduled under the SCHED_FIFO pol‐
> icy in the wait list of runnable processes with equal static priority.
>
> A SCHED_FIFO process runs until either it is blocked by an I/O request,
> it is preempted by a higher priority process, or it calls
> sched_yield(2).
>
> SCHED_RR: Round Robin scheduling
> SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described
> above for SCHED_FIFO also applies to SCHED_RR, except that each process
> is only allowed to run for a maximum time quantum. If a SCHED_RR
> process has been running for a time period equal to or longer than the
> time quantum, it will be put at the end of the list for its priority.
> A SCHED_RR process that has been preempted by a higher priority process
> and subsequently resumes execution as a running process will complete
> the unexpired portion of its round robin time quantum. The length of
> the time quantum can be retrieved using sched_rr_get_interval(2).
>
> SCHED_OTHER: Default Linux time-sharing scheduling
> SCHED_OTHER can only be used at static priority 0. SCHED_OTHER is the
> standard Linux time-sharing scheduler that is intended for all pro‐
> cesses that do not require the special real-time mechanisms. The
> process to run is chosen from the static priority 0 list based on a
> dynamic priority that is determined only inside this list. The dynamic
> priority is based on the nice value (set by nice(2) or setpriority(2))
> and increased for each time quantum the process is ready to run, but
> denied to run by the scheduler. This ensures fair progress among all
> SCHED_OTHER processes.
>
> SCHED_BATCH: Scheduling batch processes
> (Since Linux 2.6.16.) SCHED_BATCH can only be used at static priority
> 0. This policy is similar to SCHED_OTHER in that it schedules the
> process according to its dynamic priority (based on the nice value).
> The difference is that this policy will cause the scheduler to always
> assume that the process is CPU-intensive. Consequently, the scheduler
> will apply a small scheduling penalty with respect to wakeup behaviour,
> so that this process is mildly disfavored in scheduling decisions.
>
> This policy is useful for workloads that are noninteractive, but do not
> want to lower their nice value, and for workloads that want a determin‐
> istic scheduling policy without interactivity causing extra preemptions
> (between the workload's tasks).
>
> SCHED_IDLE: Scheduling very low priority jobs
> (Since Linux 2.6.23.) SCHED_IDLE can only be used at static priority
> 0; the process nice value has no influence for this policy.
>
> This policy is intended for running jobs at extremely low priority
> (lower even than a +19 nice value with the SCHED_OTHER or SCHED_BATCH
> policies).
]]]]

> RETURN VALUE
> On success, sched_setattr() and sched_getattr() return 0. On
> error, -1 is returned, and errno is set appropriately.
>
> ERRORS
> EINVAL The scheduling policy is not one of the recognized policies,
> param is NULL, or param does not make sense for the policy.
>
> EPERM The calling process does not have appropriate privileges.
>
> ESRCH The process whose ID is pid could not be found.
>
> E2BIG The provided storage for struct sched_attr is either too
> big, see sched_setattr(), or too small, see sched_getattr().
>
> EBUSY SCHED_DEADLINE admission control failure

The above is the only place on the page that mentions admission control.
As well as the suggestions above, it would be nice to have somewhere a
summary of how admission control is calculated.

> NOTES
> While the text above (and in SCHED_SETSCHEDULER(2)) talks about
> processes, in actual fact these system calls are thread specific.
>

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-04-29 14:22:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

On Tue, Apr 29, 2014 at 03:08:55PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
>
> On 04/28/2014 10:18 AM, Peter Zijlstra wrote:
> > Hi Michael,
> >
> > find below an updated manpage, I did not apply the comments on parts
> > that are identical to SCHED_SETSCHEDULER(2) in order to keep these texts
> > in alignment. I feel that if we change one we should also change the
> > other, and such a 'patch' is best done separate from the new manpage
> > itself.
> >
> > I did add the missing EBUSY error, and amended the text where it said
> > we'd return EINVAL in that case.
> >
> > I added a paragraph stating that SCHED_DEADLINE preempted anything else
> > userspace can do (with the explicit mention of userspace to leave me
> > wriggle room for the kernel's stop task :-).
> >
> > I also did a short paragraph on the deadline sched_yield(). For further
> > deadline yield details we should maybe add to the SCHED_YIELD(2)
> > manpage.
> >
> > Re juri/claudio; no I think sched_yield() as implemented for deadline
> > makes sense, no other yield semantics other than NOP makes sense for it,
> > and since we have the syscall already might as well make it do something
> > useful.
>
> Thanks for the updated page. Would you be willing
> to revise as per the comments below.

Ok.

>
> > NAME
> > sched_setattr, sched_getattr - set and get scheduling policy/attributes
> >
> > SYNOPSIS
> > #include <sched.h>
> >
> > struct sched_attr {
> > u32 size;
> > u32 sched_policy;
> > u64 sched_flags;
> >
> > /* SCHED_NORMAL, SCHED_BATCH */
> > s32 sched_nice;
> > /* SCHED_FIFO, SCHED_RR */
> > u32 sched_priority;
> > /* SCHED_DEADLINE */
> > u64 sched_runtime;
> > u64 sched_deadline;
> > u64 sched_period;
> > };
> > int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
> >
> > int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
> >
> > DESCRIPTION
> > sched_setattr() sets both the scheduling policy and the
> > associated attributes for the process whose ID is specified in
> > pid.
>
> Around about here, I think there needs to be a sentence explaining
> that sched_setattr() provides a superset of the functionality of
> sched_setscheduler(2) and setpritority(2). I mean, it can do all that
> those two calls can do, right?

Almost; setpriority() has the .which argument which we don't have. So
while that syscall can change the nice value for an entire process group
or user, sched_setattr() can only change the nice value for 1 task.

But yes, I can mention something along those lines.

> > If pid equals zero, the scheduling policy and attributes
> > of the calling process will be set. The interpretation of the
> > argument attr depends on the selected policy. Currently, Linux
> > supports the following "normal" (i.e., non-real-time) scheduling
> > policies:
> >
> > SCHED_OTHER the standard "fair" time-sharing policy;
> >
> > SCHED_BATCH for "batch" style execution of processes; and
> >
> > SCHED_IDLE for running very low priority background jobs.
> >
> > The following "real-time" policies are also supported, for
> > special time-critical applications that need precise control
> > over the way in which runnable processes are selected for
> > execution:
> >
> > SCHED_FIFO a first-in, first-out policy;
> >
> > SCHED_RR a round-robin policy; and
> >
> > SCHED_DEADLINE a deadline policy.
> >
> > The semantics of each of these policies are detailed below.
>
> The semantics of each of these policies are detailed in sched(7).

I don't appear to have SCHED(7), how new is that?

> [See my comments below]
>
> >
> > sched_attr::size must be set to the size of the structure, as in
> > sizeof(struct sched_attr), if the provided structure is smaller
> > than the kernel structure, any additional fields are assumed
> > '0'. If the provided structure is larger than the kernel
> > structure, the kernel verifies all additional fields are '0' if
> > not the syscall will fail with -E2BIG.
> >
> > sched_attr::sched_policy the desired scheduling policy.
> >
> > sched_attr::sched_flags additional flags that can influence
> > scheduling behaviour. Currently as per Linux kernel 3.14:
> >
> > SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> > to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> > on fork().
> >
> > is the only supported flag.
> >
> > sched_attr::sched_nice should only be set for SCHED_OTHER,
> > SCHED_BATCH, the desired nice value [-20,19], see NICE(2).
> >
> > sched_attr::sched_priority should only be set for SCHED_FIFO,
> > SCHED_RR, the desired static priority [1,99].
> >
> > sched_attr::sched_runtime
> > sched_attr::sched_deadline
> > sched_attr::sched_period should only be set for SCHED_DEADLINE
> > and are the traditional sporadic task model parameters.
>
> Could you add (a lot ;-)) more detail on these three fields? Assume the
> reader does not know about this traditional sporadic task model, and
> then give some explanation of what these three fields do. Probably, at
> this point you can work in some statement about the admission control
> test.
>
> [but, see my comment below. It may be that sched(7) is a better
> place for this detail.

Yes, I think SCHED(7) would be a better place; also I think I forgot to
put a reference in to Documentation/scheduler/sched-deadline.txt

I'll try and write something concise. This is the stuff of books, not
paragraphs :/

> > The flags argument should be 0.
> >
> > sched_getattr() queries the scheduling policy currently applied
> > to the process identified by pid. If pid equals zero, the
> > policy of the calling process will be retrieved.
> >
> > The size argument should reflect the size of struct sched_attr
> > as known to userspace. The kernel fills out sched_attr::size to
> > the size of its sched_attr structure. If the user provided
> > structure is larger, additional fields are not touched. If the
> > user provided structure is smaller, but the kernel needs to
> > return values outside the provided space, the syscall will fail
> > with -E2BIG.
> >
> > The flags argument should be 0.
> >
> > The other sched_attr fields are filled out as described in
> > sched_setattr().
>
> I assume that everything between my [[[ and ]]] blocks below is taken straight
> from sched_setscheduler(2). (If that is not true, please let me know.)

That did indeed look about right.

> This reminds me that there is a structural fault in this part of man-pages ;-).
> The problem is sched_setscheduler(2) currently tries to do two things:
>
> [a] Document the sched_setscheduler() and sched_scheduler system calls
> [b] Provide and overview od scheduling policies and parameters.
>
> It should really only do the former. I have now gone through the task of
> separating [b] out into a separate page, sched(7), which other pages,
> such as sched_setscheduler(2) and sched_setattr(2) can refer to. You
> can see the current versions of sched_setscheduelr.2 and sched.7 in Git
> (https://www.kernel.org/doc/man-pages/download.html )
>
> So, what I would ideally like to see
>
> [1] A page describing the sched_setattr() and sched_getattr() APIs
> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> drop into sched(7).
>
> Could you revise like that?

ACK.

> [[[[

> ]]]]
>
> > SCHED_DEADLINE: Sporadic task model deadline scheduling
> > SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > Deadline First) with additional CBS (Constant Bandwidth Server).
> > The CBS guarantees that tasks that over-run their specified
> > budget are throttled and do not affect the correct performance
> > of other SCHED_DEADLINE tasks.
> >
> > SCHED_DEADLINE tasks will fail FORK(2) with -EAGAIN
> >
> > Setting SCHED_DEADLINE can fail with -EBUSY when admission
> > control tests fail.
> >
> > Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
> > highest priority (user controllable) tasks in the system, if any
> > SCHED_DEADLINE task is runnable it will preempt anything
> > FIFO/RR/OTHER/BATCH/IDLE task out there.
> >
> > A SCHED_DEADLINE task calling sched_yield() will 'yield' the
> > current job and wait for a new period to begin.
>
> This is the piece that could go into sched(7), but I'd like it to include
> a discussion of deadline, period, and runtime.
>
> [[[[

> ]]]]
>
> > RETURN VALUE
> > On success, sched_setattr() and sched_getattr() return 0. On
> > error, -1 is returned, and errno is set appropriately.
> >
> > ERRORS
> > EINVAL The scheduling policy is not one of the recognized policies,
> > param is NULL, or param does not make sense for the policy.
> >
> > EPERM The calling process does not have appropriate privileges.
> >
> > ESRCH The process whose ID is pid could not be found.
> >
> > E2BIG The provided storage for struct sched_attr is either too
> > big, see sched_setattr(), or too small, see sched_getattr().
> >
> > EBUSY SCHED_DEADLINE admission control failure
>
> The above is the only place on the page that mentions admission control.
> As well as the suggestions above, it would be nice to have somewhere a
> summary of how admission control is calculated.

I think I'll write down what admission control is without specifics.
Giving specifics pins you down on the implementation. In general
admission control enforces a bound on the schedulability of the task
set. New and interesting ways of computing schedulability are the
subject of papers each year.

2014-04-29 16:05:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

On Tue, Apr 29, 2014 at 03:08:55PM +0200, Michael Kerrisk (man-pages) wrote:

Juri, Dario, Can you have a look at the 2nd part; I'm not at all sure I
got the activate/release the right way around.

My current thinking was that we activate first, and then release it to
go run. But googling the terms only confused me more. I suppose its one
of those things that's not actually _that_ well defined. And I hope the
ASCII art actually clarifies things better than the terms used.

> [1] A page describing the sched_setattr() and sched_getattr() APIs

NAME
sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
#include <sched.h>

struct sched_attr {
u32 size;
u32 sched_policy;
u64 sched_flags;

/* SCHED_NORMAL, SCHED_BATCH */
s32 sched_nice;

/* SCHED_FIFO, SCHED_RR */
u32 sched_priority;

/* SCHED_DEADLINE */
u64 sched_runtime;
u64 sched_deadline;
u64 sched_period;
};

int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);

int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);

DESCRIPTION
sched_setattr() sets both the scheduling policy and the
associated attributes for the process whose ID is specified in
pid.

sched_setattr() replaces sched_setscheduler(), sched_setparam(),
nice() and some of setpriority().

If pid equals zero, the scheduling policy and attributes
of the calling process will be set. The interpretation of the
argument attr depends on the selected policy. Currently, Linux
supports the following "normal" (i.e., non-real-time) scheduling
policies:

SCHED_OTHER the standard "fair" time-sharing policy;

SCHED_BATCH for "batch" style execution of processes; and

SCHED_IDLE for running very low priority background jobs.

The following "real-time" policies are also supported, for
special time-critical applications that need precise control
over the way in which runnable processes are selected for
execution:

SCHED_FIFO a static priority first-in, first-out policy;

SCHED_RR a static priority round-robin policy; and

SCHED_DEADLINE a dynamic priority deadline policy.

The semantics of each of these policies are detailed in
sched(7).

sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel
structure, the kernel verifies all additional fields are '0' if
not the syscall will fail with -E2BIG.

sched_attr::sched_policy the desired scheduling policy.

sched_attr::sched_flags additional flags that can influence
scheduling behaviour. Currently as per Linux kernel 3.14:

SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
on fork().

is the only supported flag.

sched_attr::sched_nice should only be set for SCHED_OTHER,
SCHED_BATCH, the desired nice value [-20,19], see sched(7).

sched_attr::sched_priority should only be set for SCHED_FIFO,
SCHED_RR, the desired static priority [1,99], see sched(7).

sched_attr::sched_runtime
sched_attr::sched_deadline
sched_attr::sched_period should only be set for SCHED_DEADLINE
and are the traditional sporadic task model parameters, see
sched(7).

The flags argument should be 0.

sched_getattr() queries the scheduling policy currently applied
to the process identified by pid.

Similar to sched_setattr(), sched_getattr() replaces
sched_getscheduler(), sched_getparam() and some of
getpriority().

If pid equals zero, the policy of the calling process will be
retrieved.

The size argument should reflect the size of struct sched_attr
as known to userspace. The kernel fills out sched_attr::size to
the size of its sched_attr structure. If the user provided
structure is larger, additional fields are not touched. If the
user provided structure is smaller, but the kernel needs to
return values outside the provided space, the syscall will fail
with -E2BIG.

The flags argument should be 0.

The other sched_attr fields are filled out as described in
sched_setattr().

RETURN VALUE
On success, sched_setattr() and sched_getattr() return 0. On
error, -1 is returned, and errno is set appropriately.

ERRORS
EINVAL The scheduling policy is not one of the recognized policies,
param is NULL, or param does not make sense for the selected
policy.

EPERM The calling process does not have appropriate privileges.

ESRCH The process whose ID is pid could not be found.

E2BIG The provided storage for struct sched_attr is either too
big, see sched_setattr(), or too small, see sched_getattr().

EBUSY SCHED_DEADLINE admission control failure, see sched(7).

NOTES
While the text above (and in sched_setscheduler(2)) talks about
processes, in actual fact these system calls are thread specific.

> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> drop into sched(7).

SCHED_DEADLINE: Sporadic task model deadline scheduling
SCHED_DEADLINE is an implementation of GEDF (Global Earliest
Deadline First) with additional CBS (Constant Bandwidth Server).

A sporadic task is on that has a sequence of jobs, where each job
is activated at most once per period [us]. Each job will have an
absolute deadline relative to its activation before which it must
finish its execution, and it shall at no time run longer
than runtime [us] after its release.

activation/wakeup absolute deadline
| release |
v v v
-------x--------x--------------x--------x-------
|<- Runtime -->|
|<---------- Deadline ->|
|<---------- Period ----------->|

This gives: runtime <= (rel) deadline <= period.

The CBS guarantees that tasks that over-run their specified
runtime are throttled and do not affect the correct performance
of other SCHED_DEADLINE tasks.

In general a task set of such tasks it not feasible/schedulable
within the given constraints. Therefore we must do an admittance
test on setting/changing SCHED_DEADLINE policy/attributes.

This admission test calculates that the task set is
feasible/schedulable, failing this, sched_setattr() will return
-EBUSY.

For example, it is required (but not sufficient) for the total
utilization to be less or equal to the total amount of cpu time
available. That is, since each job can maximally run for runtime
[us] per period [us], that task's utilization is runtime/period.
Summing this over all tasks must be less than the total amount of
CPUs present.

SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN.

Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
highest priority (user controllable) tasks in the system, if any
SCHED_DEADLINE task is runnable it will preempt anything
FIFO/RR/OTHER/BATCH/IDLE task out there.

A SCHED_DEADLINE task calling sched_yield() will 'yield' the
current job and wait for a new period to begin.

Subject: Re: sched_{set,get}attr() manpage

Hi Peter,

Thanks for the revision. More comments below. Could you revise in
the light of those comments, and hopefully also after feedback from
Juri and Dario?

On 04/29/2014 06:04 PM, Peter Zijlstra wrote:
> On Tue, Apr 29, 2014 at 03:08:55PM +0200, Michael Kerrisk (man-pages) wrote:
>
> Juri, Dario, Can you have a look at the 2nd part; I'm not at all sure I
> got the activate/release the right way around.
>
> My current thinking was that we activate first, and then release it to
> go run. But googling the terms only confused me more. I suppose its one
> of those things that's not actually _that_ well defined. And I hope the
> ASCII art actually clarifies things better than the terms used.
>
>> [1] A page describing the sched_setattr() and sched_getattr() APIs
>
> NAME
> sched_setattr, sched_getattr - set and get scheduling policy/attributes
>
> SYNOPSIS
> #include <sched.h>
>
> struct sched_attr {
> u32 size;
> u32 sched_policy;
> u64 sched_flags;
>
> /* SCHED_NORMAL, SCHED_BATCH */
> s32 sched_nice;
>
> /* SCHED_FIFO, SCHED_RR */
> u32 sched_priority;
>
> /* SCHED_DEADLINE */
> u64 sched_runtime;
> u64 sched_deadline;
> u64 sched_period;
> };
>
> int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);
>
> int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);
>
> DESCRIPTION
> sched_setattr() sets both the scheduling policy and the
> associated attributes for the process whose ID is specified in
> pid.
>
> sched_setattr() replaces sched_setscheduler(), sched_setparam(),
> nice() and some of setpriority().
>
> If pid equals zero, the scheduling policy and attributes
> of the calling process will be set. The interpretation of the
> argument attr depends on the selected policy. Currently, Linux
> supports the following "normal" (i.e., non-real-time) scheduling
> policies:
>
> SCHED_OTHER the standard "fair" time-sharing policy;
>
> SCHED_BATCH for "batch" style execution of processes; and
>
> SCHED_IDLE for running very low priority background jobs.
>
> The following "real-time" policies are also supported, for
> special time-critical applications that need precise control
> over the way in which runnable processes are selected for
> execution:
>
> SCHED_FIFO a static priority first-in, first-out policy;
>
> SCHED_RR a static priority round-robin policy; and
>
> SCHED_DEADLINE a dynamic priority deadline policy.
>
> The semantics of each of these policies are detailed in
> sched(7).
>
> sched_attr::size must be set to the size of the structure, as in
> sizeof(struct sched_attr), if the provided structure is smaller
> than the kernel structure, any additional fields are assumed
> '0'. If the provided structure is larger than the kernel
> structure, the kernel verifies all additional fields are '0' if
> not the syscall will fail with -E2BIG.
>
> sched_attr::sched_policy the desired scheduling policy.
>
> sched_attr::sched_flags additional flags that can influence
> scheduling behaviour. Currently as per Linux kernel 3.14:
>
> SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
> to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
> on fork().
>
> is the only supported flag.
>
> sched_attr::sched_nice should only be set for SCHED_OTHER,
> SCHED_BATCH, the desired nice value [-20,19], see sched(7).
>
> sched_attr::sched_priority should only be set for SCHED_FIFO,
> SCHED_RR, the desired static priority [1,99], see sched(7).
>
> sched_attr::sched_runtime
> sched_attr::sched_deadline
> sched_attr::sched_period should only be set for SCHED_DEADLINE
> and are the traditional sporadic task model parameters, see
> sched(7).

So, are there fields expressed in some unit (presumably microseconds)?
Best to mention that here.

> The flags argument should be 0.
>
> sched_getattr() queries the scheduling policy currently applied
> to the process identified by pid.
>
> Similar to sched_setattr(), sched_getattr() replaces
> sched_getscheduler(), sched_getparam() and some of
> getpriority().
>
> If pid equals zero, the policy of the calling process will be
> retrieved.
>
> The size argument should reflect the size of struct sched_attr
> as known to userspace. The kernel fills out sched_attr::size to
> the size of its sched_attr structure. If the user provided
> structure is larger, additional fields are not touched. If the
> user provided structure is smaller, but the kernel needs to
> return values outside the provided space, the syscall will fail
> with -E2BIG.
>
> The flags argument should be 0.
>
> The other sched_attr fields are filled out as described in
> sched_setattr().
>
> RETURN VALUE
> On success, sched_setattr() and sched_getattr() return 0. On
> error, -1 is returned, and errno is set appropriately.
>
> ERRORS
> EINVAL The scheduling policy is not one of the recognized policies,
> param is NULL, or param does not make sense for the selected
> policy.
>
> EPERM The calling process does not have appropriate privileges.
>
> ESRCH The process whose ID is pid could not be found.
>
> E2BIG The provided storage for struct sched_attr is either too
> big, see sched_setattr(), or too small, see sched_getattr().
>
> EBUSY SCHED_DEADLINE admission control failure, see sched(7).
>
> NOTES
> While the text above (and in sched_setscheduler(2)) talks about
> processes, in actual fact these system calls are thread specific.
>
>> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
>> drop into sched(7).
>
> SCHED_DEADLINE: Sporadic task model deadline scheduling
> SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> Deadline First) with additional CBS (Constant Bandwidth Server).
>
> A sporadic task is on that has a sequence of jobs, where each job
> is activated at most once per period [us]. Each job will have an
> absolute deadline relative to its activation before which it must
> finish its execution, and it shall at no time run longer
> than runtime [us] after its release.
>
> activation/wakeup absolute deadline
> | release |
> v v v
> -------x--------x--------------x--------x-------
> |<- Runtime -->|
> |<---------- Deadline ->|
> |<---------- Period ----------->|
>
> This gives: runtime <= (rel) deadline <= period.

So, the 'sched_deadline' field in the 'sched_attr' expresses the release
deadline? (I had initially thought it was the "absolute deadline".
Could you make this clearer in the text please.

> The CBS guarantees that tasks that over-run their specified
> runtime are throttled and do not affect the correct performance
> of other SCHED_DEADLINE tasks.
>
> In general a task set of such tasks it not feasible/schedulable

That last line is garbled. Could you fix, please.

Also, could you add some words to explain what you mean by 'task set'.

> within the given constraints. Therefore we must do an admittance
> test on setting/changing SCHED_DEADLINE policy/attributes.
>
> This admission test calculates that the task set is
> feasible/schedulable, failing this, sched_setattr() will return
> -EBUSY.
>
> For example, it is required (but not sufficient) for the total
> utilization to be less or equal to the total amount of cpu time
> available. That is, since each job can maximally run for runtime
> [us] per period [us], that task's utilization is runtime/period.
> Summing this over all tasks must be less than the total amount of
> CPUs present.
>
> SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN.

Except if SCHED_RESET_ON_FORK was set, right? If yes, that should be
mentioned here.

> Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
> highest priority (user controllable) tasks in the system, if any
> SCHED_DEADLINE task is runnable it will preempt anything
> FIFO/RR/OTHER/BATCH/IDLE task out there.
>
> A SCHED_DEADLINE task calling sched_yield() will 'yield' the
> current job and wait for a new period to begin.

So, I'm trying to naively understand how this all works. If different
processes specify different deadline periods, how does the kernel deal
with that? Is it worth adding some detail on this point?

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-04-30 12:36:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

On Wed, Apr 30, 2014 at 01:09:25PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
>
> Thanks for the revision. More comments below. Could you revise in
> the light of those comments, and hopefully also after feedback from
> Juri and Dario?
>
> >
> > sched_attr::sched_runtime
> > sched_attr::sched_deadline
> > sched_attr::sched_period should only be set for SCHED_DEADLINE
> > and are the traditional sporadic task model parameters, see
> > sched(7).
>
> So, are there fields expressed in some unit (presumably microseconds)?
> Best to mention that here.

Oh wait, no its nanoseconds. Which means I should amend the text below.

> >> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> >> drop into sched(7).
> >
> > SCHED_DEADLINE: Sporadic task model deadline scheduling
> > SCHED_DEADLINE is an implementation of GEDF (Global Earliest
> > Deadline First) with additional CBS (Constant Bandwidth Server).
> >
> > A sporadic task is on that has a sequence of jobs, where each job
> > is activated at most once per period [us]. Each job will have an
> > absolute deadline relative to its activation before which it must

(A)

> > finish its execution, and it shall at no time run longer
> > than runtime [us] after its release.
> >
> > activation/wakeup absolute deadline
> > | release |
> > v v v
> > -------x--------x--------------x--------x-------
> > |<- Runtime -->|
> > |<---------- Deadline ->|
> > |<---------- Period ----------->|
> >
> > This gives: runtime <= (rel) deadline <= period.
>
> So, the 'sched_deadline' field in the 'sched_attr' expresses the release
> deadline? (I had initially thought it was the "absolute deadline".
> Could you make this clearer in the text please.

No, and yes, sched_attr::sched_deadline is a relative deadline wrt to
the activation. Like said at (A).

So we get: absolute deadline = activation + relative deadline.

And we must be done running at that point, so the very last possible
release moment is: absolute deadline - runtime.

And therefore, it too is a release deadline, since we must not release
later than that.

> > The CBS guarantees that tasks that over-run their specified
> > runtime are throttled and do not affect the correct performance
> > of other SCHED_DEADLINE tasks.
> >
> > In general a task set of such tasks it not feasible/schedulable
>
> That last line is garbled. Could you fix, please.

s/it/is/

> Also, could you add some words to explain what you mean by 'task set'.

A set of tasks? :-) In particular all tasks in the system of
SCHED_DEADLINE, indicated by 'of such'.

> > within the given constraints. Therefore we must do an admittance
> > test on setting/changing SCHED_DEADLINE policy/attributes.
> >
> > This admission test calculates that the task set is
> > feasible/schedulable, failing this, sched_setattr() will return
> > -EBUSY.
> >
> > For example, it is required (but not sufficient) for the total
> > utilization to be less or equal to the total amount of cpu time
> > available. That is, since each job can maximally run for runtime
> > [us] per period [us], that task's utilization is runtime/period.
> > Summing this over all tasks must be less than the total amount of
> > CPUs present.
> >
> > SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN.
>
> Except if SCHED_RESET_ON_FORK was set, right? If yes, that should be
> mentioned here.

Ah, indeed.

> > Because of the nature of (G)EDF, SCHED_DEADLINE tasks are the
> > highest priority (user controllable) tasks in the system, if any
> > SCHED_DEADLINE task is runnable it will preempt anything
> > FIFO/RR/OTHER/BATCH/IDLE task out there.
> >
> > A SCHED_DEADLINE task calling sched_yield() will 'yield' the
> > current job and wait for a new period to begin.
>
> So, I'm trying to naively understand how this all works. If different
> processes specify different deadline periods, how does the kernel deal
> with that? Is it worth adding some detail on this point?

Userspace should not rely on any implementation details there. Saying
its a (G)EDF scheduler is maybe already too much. All userspace should
really care about is that its tasks _should_ be scheduled such that it
meets the specified requirements.

There are multiple scheduling algorithms that can be employed to make it
so, and I don't want to pin us to whatever we chose to implement this
time.

That said, the current (G)EDF is a soft realtime scheduler in that it
guarantees a bounded tardiness (which is the time we can miss the
deadline by) but not a hard realtime, since the bound is not 0.

Anyway, for your elucidation; assuming no overhead and a UP system
(SMP is a right head-ache), and a further assumption that deadline ==
period. It is reasonable straight forward to see that scheduling the
task with the earliest deadline will satisfy the constraints IFF the
total utilization (\Sum runtime_i / deadline_i) <= 1.

Suppose two tasks: A := { 5, 10 } and B := { 10, 20 } with strict
periodic activation:

A1,B1 A2 Ad2
| Ad1 Bd1
v v v
--AAAAABBBBBAAAAABBBBBx--
--AAAAABBBBBBBBBBAAAAAx--

Where A# is the #th activation, Ad# is the corresponding #th deadline
before which we must have sufficient time.

Since we're perfectly synced up there is a tie and we get two possible
outcomes. But note that in either case A has gotten 2x its 5 As and B
has gotten its 10 Bs.

Non-periodic activation, and deadline != period make the thing more
interesting, but at that point I would ask Juri (or others) to refer you
to a paper/book.

Now, let me go update the texts yet again :-)

2014-04-30 13:09:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched_{set,get}attr() manpage

On Wed, Apr 30, 2014 at 01:09:25PM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Peter,
>
> Thanks for the revision. More comments below. Could you revise in
> the light of those comments, and hopefully also after feedback from
> Juri and Dario?

New text below; hopefully a little clearer. If not, do holler.

---
> [1] A page describing the sched_setattr() and sched_getattr() APIs

NAME
sched_setattr, sched_getattr - set and get scheduling policy/attributes

SYNOPSIS
#include <sched.h>

struct sched_attr {
u32 size;
u32 sched_policy;
u64 sched_flags;

/* SCHED_NORMAL, SCHED_BATCH */
s32 sched_nice;

/* SCHED_FIFO, SCHED_RR */
u32 sched_priority;

/* SCHED_DEADLINE */
u64 sched_runtime;
u64 sched_deadline;
u64 sched_period;
};

int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags);

int sched_getattr(pid_t pid, const struct sched_attr *attr, unsigned int size, unsigned int flags);

DESCRIPTION
sched_setattr() sets both the scheduling policy and the
associated attributes for the process whose ID is specified in
pid.

sched_setattr() replaces sched_setscheduler(), sched_setparam(),
nice() and some of setpriority().

If pid equals zero, the scheduling policy and attributes
of the calling process will be set. The interpretation of the
argument attr depends on the selected policy. Currently, Linux
supports the following "normal" (i.e., non-real-time) scheduling
policies:

SCHED_OTHER the standard "fair" time-sharing policy;

SCHED_BATCH for "batch" style execution of processes; and

SCHED_IDLE for running very low priority background jobs.

The following "real-time" policies are also supported, for
special time-critical applications that need precise control
over the way in which runnable processes are selected for
execution:

SCHED_FIFO a static priority first-in, first-out policy;

SCHED_RR a static priority round-robin policy; and

SCHED_DEADLINE a dynamic priority deadline policy.

The semantics of each of these policies are detailed in
sched(7).

sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel
structure, the kernel verifies all additional fields are '0' if
not the syscall will fail with -E2BIG.

sched_attr::sched_policy the desired scheduling policy.

sched_attr::sched_flags additional flags that can influence
scheduling behaviour. Currently as per Linux kernel 3.14:

SCHED_FLAG_RESET_ON_FORK - resets the scheduling policy
to: (struct sched_attr){ .sched_policy = SCHED_OTHER, }
on fork().

is the only supported flag.

sched_attr::sched_nice should only be set for SCHED_OTHER,
SCHED_BATCH, the desired nice value [-20,19], see sched(7).

sched_attr::sched_priority should only be set for SCHED_FIFO,
SCHED_RR, the desired static priority [1,99], see sched(7).

sched_attr::sched_runtime in nanoseconds,
sched_attr::sched_deadline in nanoseconds,
sched_attr::sched_period in nanoseconds, should only be set for
SCHED_DEADLINE and are the traditional sporadic task model
parameters, see sched(7).

The flags argument should be 0.

sched_getattr() queries the scheduling policy currently applied
to the process identified by pid.

Similar to sched_setattr(), sched_getattr() replaces
sched_getscheduler(), sched_getparam() and some of
getpriority().

If pid equals zero, the policy of the calling process will be
retrieved.

The size argument should reflect the size of struct sched_attr
as known to userspace. The kernel fills out sched_attr::size to
the size of its sched_attr structure. If the user provided
structure is larger, additional fields are not touched. If the
user provided structure is smaller, but the kernel needs to
return values outside the provided space, the syscall will fail
with -E2BIG.

The flags argument should be 0.

The other sched_attr fields are filled out as described in
sched_setattr().

RETURN VALUE
On success, sched_setattr() and sched_getattr() return 0. On
error, -1 is returned, and errno is set appropriately.

ERRORS
EINVAL The scheduling policy is not one of the recognized policies,
param is NULL, or param does not make sense for the selected
policy.

EPERM The calling process does not have appropriate privileges.

ESRCH The process whose ID is pid could not be found.

E2BIG The provided storage for struct sched_attr is either too
big, see sched_setattr(), or too small, see sched_getattr().

EBUSY SCHED_DEADLINE admission control failure, see sched(7).

NOTES
While the text above (and in sched_setscheduler(2)) talks about
processes, in actual fact these system calls are thread specific.

While the SCHED_DEADLINE parameters are in nanoseconds, current
kernels truncate the lower 10 bits and we get an effective
microsecond resolution.

> [2] A piece of text describing the SCHED_DEADLINE policy, which I can
> drop into sched(7).

SCHED_DEADLINE: Sporadic task model deadline scheduling
SCHED_DEADLINE is currently implemented using GEDF (Global
Earliest Deadline First) with additional CBS (Constant Bandwidth
Server).

A sporadic task is on that has a sequence of jobs, where each job
is activated at most once per period [ns]. Each job will have an
absolute deadline relative to its activation before which it must
finish its execution, and it shall at no time run longer
than runtime [ns] after its release.

activation/wakeup absolute deadline
| release |
v v v
-------x--------x--------------x--------x-------
|<- Runtime -->|
|<---------- Deadline ->|
|<---------- Period ----------->|

This gives: runtime <= (rel) deadline <= period.

The CBS guarantees non-interference between tasks, by throttling
tasks that attempt to over-run their specified runtime.

In general the set of all SCHED_DEADLINE tasks is not
feasible/schedulable within the given constraints. Therefore we
must do an admittance test on setting/changing SCHED_DEADLINE
policy/attributes.

This admission test calculates that the task set is
feasible/schedulable, failing this, sched_setattr() will return
-EBUSY.

For example, it is required (but not necessarily sufficient) for
the total utilization to be less or equal to the total amount of
CPUs available, where, since each task can maximally run for
runtime [us] per period [us], that task's utilization is its
runtime/period.

Because we must be able to calculate admittance SCHED_DEADLINE
tasks are the highest priority (user controllable) tasks in the
system, if any SCHED_DEADLINE task is runnable it will preempt
any FIFO/RR/OTHER/BATCH/IDLE task.

SCHED_DEADLINE tasks will fail fork(2) with -EAGAIN, except when
the forking task has SCHED_FLAG_RESET_ON_FORK set.

A SCHED_DEADLINE task calling sched_yield() will 'yield' the
current job and wait for a new period to begin.