2012-10-24 21:54:51

by Juri Lelli

[permalink] [raw]
Subject: [RFC][PATCH 00/16] sched: SCHED_DEADLINE v6

Hello everyone,

This is the take 6 for the SCHED_DEADLINE patchset.

The patchset introduces a new deadline based real-time task scheduling
policy --called SCHED_DEADLINE-- with bandwidth isolation (aka "resource
reservation") capabilities. It supports global/clustered multiprocessor
scheduling through dynamic task migrations.

From the previous releases[1]:
- comments and fixes coming from the reviews we got have been considered
and applied;

- the use of nr_cpus_allowed has been unified;

- this release is on top of tip/master (as of today), so this is also
a rebase on top of 3.7-rc2;

- patch 14/16 modifies real-time bandwidth management and makes dl_bw
a subquota of rt_bw (comments on this are very welcome!);

- tested on ARM (thanks to Claudio Scordino for testing and and patches).

(My) TODOs:
- keep up with mainline (as usual);
- rebase on top of 3.6.2-rt4 PREEMPT_RT patchset;
- provide details and numbers about possible use-cases;
- power aware scheduling for HMP (some code is there, but not yet
ready for submission);
- setup a website to collect all information regarding the project
in just one single place.

The development is taking place at:
https://github.com/jlelli/sched-deadline

Main branches:

- mainline-dl: tracking tip/master (raw commits);
- linux-rt-dl: tracking PREEMPT_RT releases (outdated);
- sched-dl-V6: this patchset on top of tip/master.

Check the repositories frequently if you're interested, and feel free to
e-mail me for any issue you run into.

Test application:
https://github.com/gbagnoli/rt-app

Development mailing list: linux-dl; you can subscribe from here:
http://feanor.sssup.it/mailman/listinfo/linux-dl
or via e-mail (send a message to [email protected] with
just the word `help' as subject or in the body to receive info).

There is also a parallel branch maintained by Insop Song (from Ericsson)
(https://github.com/insop/sched-deadline2). Ericsson is in fact evaluating
the use of SCHED_DEADLINE for CPE (Customer Premise Equipment) devices in
order to reserve CPU bandwidth to processes.

The code was being jointly developed by ReTiS Lab (http://retis.sssup.it)
and Evidence S.r.l (http://www.evidence.eu.com) in the context of the ACTORS
EU-funded project (http://www.actors-project.eu). It is now also supported by
the S(o)OS EU-funded project (http://www.soos-project.eu/).
It has also some users, both in academic and applied research. We got
positive feedbacks from Ericsson (see above), Wind River, Porto (ISEP), Trento,
Lund and Malardalen universities.

Acknowledgements:
I owe special thanks to Fabio Checconi and Dario Faggioli for technical (and
moral :P) support. Thanks to Peter Zijlstra, Steven Rostedt and all the others
that reviewed and/or contributed to improve the patchset quality. Thanks also
to Insop Song for trying to give the project an "industrial" use case and to
Claudio Scordino for testing and advertisement.

As usual, any kind of feedback is welcome and appreciated.

Thanks in advice and regards,

- Juri

Dario Faggioli (9):
sched: add sched_class->task_dead.
sched: add extended scheduling interface.
sched: SCHED_DEADLINE structures & implementation.
sched: SCHED_DEADLINE avg_update accounting.
sched: add schedstats for -deadline tasks.
sched: add latency tracing for -deadline tasks.
sched: drafted deadline inheritance logic.
sched: add bandwidth management for sched_dl.
sched: add sched_dl documentation.

Harald Gustafsson (1):
sched: add period support for -deadline tasks.

Juri Lelli (3):
sched: SCHED_DEADLINE SMP-related data structures & logic.
sched: make dl_bw a sub-quota of rt_bw
sched: speed up -dl pushes with a push-heap.

Peter Zijlstra (3):
math128: Introduce various 128bit primitives
math128, x86_64: Implement {mul,add}_u128 in 64bit asm
rtmutex: turn the plist into an rb-tree.

Documentation/scheduler/sched-deadline.txt | 164 +++
arch/alpha/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm/include/asm/unistd.h | 2 +-
arch/arm/include/uapi/asm/unistd.h | 3 +
arch/arm/kernel/calls.S | 3 +
arch/avr32/include/asm/Kbuild | 2 +
arch/blackfin/include/asm/Kbuild | 1 +
arch/c6x/include/asm/Kbuild | 1 +
arch/cris/include/asm/Kbuild | 1 +
arch/frv/include/asm/Kbuild | 3 +
arch/h8300/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/ia64/include/asm/Kbuild | 1 +
arch/m32r/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/mn10300/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 2 +-
arch/powerpc/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 2 +-
arch/score/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/tile/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 2 +-
arch/unicore32/include/asm/Kbuild | 1 +
arch/x86/include/asm/Kbuild | 1 +
arch/x86/include/asm/math128.h | 39 +
arch/x86/syscalls/syscall_32.tbl | 3 +
arch/x86/syscalls/syscall_64.tbl | 4 +-
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/math128.h | 4 +
include/linux/init_task.h | 10 +
include/linux/math128.h | 180 +++
include/linux/rtmutex.h | 18 +-
include/linux/sched.h | 152 ++-
include/linux/syscalls.h | 7 +
include/uapi/linux/sched.h | 1 +
kernel/fork.c | 8 +-
kernel/futex.c | 2 +
kernel/hrtimer.c | 2 +-
kernel/rtmutex-debug.c | 10 +-
kernel/rtmutex.c | 163 ++-
kernel/rtmutex_common.h | 22 +-
kernel/sched/Makefile | 4 +-
kernel/sched/core.c | 645 ++++++++++-
kernel/sched/cpudl.c | 208 ++++
kernel/sched/cpudl.h | 33 +
kernel/sched/debug.c | 46 +
kernel/sched/dl.c | 1650 ++++++++++++++++++++++++++++
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 139 +++
kernel/sched/stop_task.c | 2 +-
kernel/sysctl.c | 7 +
kernel/trace/trace_sched_wakeup.c | 44 +-
kernel/trace/trace_selftest.c | 28 +-
lib/Makefile | 2 +-
lib/math128.c | 40 +
61 files changed, 3555 insertions(+), 125 deletions(-)
create mode 100644 Documentation/scheduler/sched-deadline.txt
create mode 100644 arch/x86/include/asm/math128.h
create mode 100644 include/asm-generic/math128.h
create mode 100644 include/linux/math128.h
create mode 100644 kernel/sched/cpudl.c
create mode 100644 kernel/sched/cpudl.h
create mode 100644 kernel/sched/dl.c
create mode 100644 lib/math128.c

--
1.7.9.5


2012-10-24 21:55:14

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 01/16] math128: Introduce various 128bit primitives

From: Peter Zijlstra <[email protected]>

Grow rudimentary u128 support without relying on gcc/libgcc.

Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
---
arch/alpha/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/avr32/include/asm/Kbuild | 2 +
arch/blackfin/include/asm/Kbuild | 1 +
arch/c6x/include/asm/Kbuild | 1 +
arch/cris/include/asm/Kbuild | 1 +
arch/frv/include/asm/Kbuild | 3 +
arch/h8300/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/ia64/include/asm/Kbuild | 1 +
arch/m32r/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/mn10300/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 2 +-
arch/powerpc/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 2 +-
arch/score/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/tile/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 2 +-
arch/unicore32/include/asm/Kbuild | 1 +
arch/x86/include/asm/Kbuild | 1 +
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/math128.h | 4 +
include/linux/math128.h | 180 ++++++++++++++++++++++++++++++++++++
lib/Makefile | 2 +-
lib/math128.c | 40 ++++++++
31 files changed, 255 insertions(+), 4 deletions(-)
create mode 100644 include/asm-generic/math128.h
create mode 100644 include/linux/math128.h
create mode 100644 lib/math128.c

diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 64ffc9e..e012ed5 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -11,3 +11,4 @@ header-y += reg.h
header-y += regdef.h
header-y += sysinfo.h
generic-y += exec.h
+generic-y += math128.h
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index f70ae17..07023d4 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -33,3 +33,4 @@ generic-y += termios.h
generic-y += timex.h
generic-y += types.h
generic-y += unaligned.h
+generic-y += math128.h
diff --git a/arch/avr32/include/asm/Kbuild b/arch/avr32/include/asm/Kbuild
index 4807ded..4384224 100644
--- a/arch/avr32/include/asm/Kbuild
+++ b/arch/avr32/include/asm/Kbuild
@@ -1,3 +1,5 @@

generic-y += clkdev.h
generic-y += exec.h
+generic-y += math128.h
+header-y += cachectl.h
diff --git a/arch/blackfin/include/asm/Kbuild b/arch/blackfin/include/asm/Kbuild
index 5a0625a..6836e68 100644
--- a/arch/blackfin/include/asm/Kbuild
+++ b/arch/blackfin/include/asm/Kbuild
@@ -47,3 +47,4 @@ generic-y += xor.h
header-y += bfin_sport.h
header-y += cachectl.h
header-y += fixed_code.h
+generic-y += math128.h
diff --git a/arch/c6x/include/asm/Kbuild b/arch/c6x/include/asm/Kbuild
index 112a496..ab11744 100644
--- a/arch/c6x/include/asm/Kbuild
+++ b/arch/c6x/include/asm/Kbuild
@@ -53,3 +53,4 @@ generic-y += types.h
generic-y += ucontext.h
generic-y += user.h
generic-y += vga.h
+generic-y += math128.h
diff --git a/arch/cris/include/asm/Kbuild b/arch/cris/include/asm/Kbuild
index 6d43a95..7674e82 100644
--- a/arch/cris/include/asm/Kbuild
+++ b/arch/cris/include/asm/Kbuild
@@ -11,3 +11,4 @@ header-y += sync_serial.h
generic-y += clkdev.h
generic-y += exec.h
generic-y += module.h
+generic-y += math128.h
diff --git a/arch/frv/include/asm/Kbuild b/arch/frv/include/asm/Kbuild
index 4a159da..732d864 100644
--- a/arch/frv/include/asm/Kbuild
+++ b/arch/frv/include/asm/Kbuild
@@ -1,3 +1,6 @@

generic-y += clkdev.h
generic-y += exec.h
+generic-y += math128.h
+header-y += registers.h
+header-y += termios.h
diff --git a/arch/h8300/include/asm/Kbuild b/arch/h8300/include/asm/Kbuild
index 50bbf38..1270ae0 100644
--- a/arch/h8300/include/asm/Kbuild
+++ b/arch/h8300/include/asm/Kbuild
@@ -3,3 +3,4 @@ include include/asm-generic/Kbuild.asm
generic-y += clkdev.h
generic-y += exec.h
generic-y += module.h
+generic-y += math128.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 3bfa9b3..8c179f4 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -52,3 +52,4 @@ generic-y += types.h
generic-y += ucontext.h
generic-y += unaligned.h
generic-y += xor.h
+generic-y += math128.h
diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild
index dd02f09..f10618b 100644
--- a/arch/ia64/include/asm/Kbuild
+++ b/arch/ia64/include/asm/Kbuild
@@ -2,3 +2,4 @@
generic-y += clkdev.h
generic-y += exec.h
generic-y += kvm_para.h
+generic-y += math128.h
diff --git a/arch/m32r/include/asm/Kbuild b/arch/m32r/include/asm/Kbuild
index 50bbf38..1270ae0 100644
--- a/arch/m32r/include/asm/Kbuild
+++ b/arch/m32r/include/asm/Kbuild
@@ -3,3 +3,4 @@ include include/asm-generic/Kbuild.asm
generic-y += clkdev.h
generic-y += exec.h
generic-y += module.h
+generic-y += math128.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index 88fa3ac..46d4b99 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -27,3 +27,4 @@ generic-y += topology.h
generic-y += types.h
generic-y += word-at-a-time.h
generic-y += xor.h
+generic-y += math128.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index 8653072..4809e13 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -3,3 +3,4 @@ include include/asm-generic/Kbuild.asm
header-y += elf.h
generic-y += clkdev.h
generic-y += exec.h
+generic-y += math128.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index 533053d..0de09e8 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -1 +1,2 @@
# MIPS headers
+generic-y += math128.h
diff --git a/arch/mn10300/include/asm/Kbuild b/arch/mn10300/include/asm/Kbuild
index 4a159da..6b54375 100644
--- a/arch/mn10300/include/asm/Kbuild
+++ b/arch/mn10300/include/asm/Kbuild
@@ -1,3 +1,4 @@

generic-y += clkdev.h
generic-y += exec.h
+generic-y += math128.h
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index 78de680..fa6fa87 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -64,3 +64,4 @@ generic-y += types.h
generic-y += ucontext.h
generic-y += user.h
generic-y += word-at-a-time.h
+generic-y += math128.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index bac8deb..cab5ff7 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -2,4 +2,4 @@
generic-y += word-at-a-time.h auxvec.h user.h cputime.h emergency-restart.h \
segment.h topology.h vga.h device.h percpu.h hw_irq.h mutex.h \
div64.h irq_regs.h kdebug.h kvm_para.h local64.h local.h param.h \
- poll.h xor.h clkdev.h exec.h
+ poll.h xor.h clkdev.h exec.h math128.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index a4fe15e..61d8f6e 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -2,3 +2,4 @@

generic-y += clkdev.h
generic-y += rwsem.h
+generic-y += math128.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 0633dc6..daa2d19 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -1,3 +1,3 @@

-
generic-y += clkdev.h
+generic-y += math128.h
diff --git a/arch/score/include/asm/Kbuild b/arch/score/include/asm/Kbuild
index ec697ae..e14c1ed 100644
--- a/arch/score/include/asm/Kbuild
+++ b/arch/score/include/asm/Kbuild
@@ -3,3 +3,4 @@ include include/asm-generic/Kbuild.asm
header-y +=

generic-y += clkdev.h
+generic-y += math128.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index 29f83be..2cf354a 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -33,3 +33,4 @@ generic-y += termbits.h
generic-y += termios.h
generic-y += ucontext.h
generic-y += xor.h
+generic-y += math128.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 645a58d..ba284f9 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -9,3 +9,4 @@ generic-y += irq_regs.h
generic-y += local.h
generic-y += module.h
generic-y += word-at-a-time.h
+generic-y += math128.h
diff --git a/arch/tile/include/asm/Kbuild b/arch/tile/include/asm/Kbuild
index 6948015..e3a37ac 100644
--- a/arch/tile/include/asm/Kbuild
+++ b/arch/tile/include/asm/Kbuild
@@ -36,3 +36,4 @@ generic-y += termbits.h
generic-y += termios.h
generic-y += types.h
generic-y += xor.h
+generic-y += math128.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 0f6e7b3..f1a5a8f 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -1,4 +1,4 @@
generic-y += bug.h cputime.h device.h emergency-restart.h futex.h hardirq.h
generic-y += hw_irq.h irq_regs.h kdebug.h percpu.h sections.h topology.h xor.h
generic-y += ftrace.h pci.h io.h param.h delay.h mutex.h current.h exec.h
-generic-y += switch_to.h clkdev.h
+generic-y += switch_to.h clkdev.h math128.h
diff --git a/arch/unicore32/include/asm/Kbuild b/arch/unicore32/include/asm/Kbuild
index c910c98..3a5e70e 100644
--- a/arch/unicore32/include/asm/Kbuild
+++ b/arch/unicore32/include/asm/Kbuild
@@ -60,3 +60,4 @@ generic-y += unaligned.h
generic-y += user.h
generic-y += vga.h
generic-y += xor.h
+generic-y += math128.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 66e5f0e..0a34aef 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -28,3 +28,4 @@ genhdr-y += unistd_64.h
genhdr-y += unistd_x32.h

generic-y += clkdev.h
+generic-y += math128.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index 6d13027..edb183d 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -26,3 +26,4 @@ generic-y += statfs.h
generic-y += termios.h
generic-y += topology.h
generic-y += xor.h
+generic-y += math128.h
diff --git a/include/asm-generic/math128.h b/include/asm-generic/math128.h
new file mode 100644
index 0000000..3582691
--- /dev/null
+++ b/include/asm-generic/math128.h
@@ -0,0 +1,4 @@
+#ifndef _ASM_GENERIC_MATH128_H
+#define _ASM_GENERIC_MATH128_H
+
+#endif /*_ASM_GENERIC_MATH128_H */
diff --git a/include/linux/math128.h b/include/linux/math128.h
new file mode 100644
index 0000000..5b0eef6
--- /dev/null
+++ b/include/linux/math128.h
@@ -0,0 +1,180 @@
+#ifndef _LINUX_MATH128_H
+#define _LINUX_MATH128_H
+
+#include <linux/types.h>
+
+typedef union {
+ struct {
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+ u64 lo, hi;
+#else
+ u64 hi, lo;
+#endif
+ };
+#ifdef __SIZEOF_INT128__ /* gcc-4.6+ */
+ unsigned __int128 val;
+#endif
+} u128;
+
+#define U128_INIT(_hi, _lo) (u128){{ .hi = (_hi), .lo = (_lo) }}
+
+#include <asm/math128.h>
+
+/*
+ * Make usage of __int128 dependent on arch code so they can
+ * judge if gcc is doing the right thing for them and can over-ride
+ * any funnies.
+ */
+
+#ifndef ARCH_HAS_INT128
+
+#ifndef add_u128
+static inline u128 add_u128(u128 a, u128 b)
+{
+ a.hi += b.hi;
+ a.lo += b.lo;
+ if (a.lo < b.lo)
+ a.hi++;
+
+ return a;
+}
+#endif /* add_u128 */
+
+#ifndef mul_u64_u64
+extern u128 mul_u64_u64(u64 a, u64 b);
+#endif
+
+#ifndef mul_u64_u32_shr
+static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
+{
+ u32 ah, al;
+ u64 t1, t2;
+
+ ah = a >> 32;
+ al = a;
+
+ t1 = ((u64)al * mul) >> shift;
+ t2 = ((u64)ah * mul) << (32 - shift);
+
+ return t1 + t2;
+}
+#endif /* mul_u64_u32_shr */
+
+#ifndef shl_u128
+static inline u128 shl_u128(u128 x, unsigned int n)
+{
+ u128 res;
+
+ if (!n)
+ return x;
+
+ if (n < 64) {
+ res.hi = x.hi << n;
+ res.hi |= x.lo >> (64 - n);
+ res.lo = x.lo << n;
+ } else {
+ res.lo = 0;
+ res.hi = x.lo << (n - 64);
+ }
+
+ return res;
+}
+#endif /* shl_u128 */
+
+#ifndef shr_u128
+static inline u128 shr_u128(u128 x, unsigned int n)
+{
+ u128 res;
+
+ if (!n)
+ return x;
+
+ if (n < 64) {
+ res.lo = x.lo >> n;
+ res.lo |= x.hi << (64 - n);
+ res.hi = x.hi >> n;
+ } else {
+ res.hi = 0;
+ res.lo = x.hi >> (n - 64);
+ }
+
+ return res;
+}
+#endif /* shr_u128 */
+
+#ifndef cmp_u128
+static inline int cmp_u128(u128 a, u128 b)
+{
+ if (a.hi > b.hi)
+ return 1;
+ if (a.hi < b.hi)
+ return -1;
+ if (a.lo > b.lo)
+ return 1;
+ if (a.lo < b.lo)
+ return -1;
+
+ return 0;
+}
+#endif /* cmp_u128 */
+
+#else /* ARCH_HAS_INT128 */
+
+#ifndef add_u128
+static inline u128 add_u128(u128 a, u128 b)
+{
+ a.val += b.val;
+ return a;
+}
+#endif /* add_u128 */
+
+#ifndef mul_u64_u64
+static inline u128 mul_u64_u64(u64 a, u64 b)
+{
+ u128 res;
+
+ res.val = a;
+ res.val *= b;
+
+ return res;
+}
+#define mul_u64_u64 mul_u64_u64
+#endif
+
+#ifndef mul_u64_u32_shr
+static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
+{
+ return (u64)(((unsigned __int128)a * mul) >> shift);
+}
+#endif /* mul_u64_u32_shr */
+
+#ifndef shl_u128
+static inline u128 shl_u128(u128 x, unsigned int n)
+{
+ x.val <<= n;
+ return x;
+}
+#endif /* shl_u128 */
+
+#ifndef shr_u128
+static inline u128 shr_u128(u128 x, unsigned int n)
+{
+ x.val >>= n;
+ return x;
+}
+#endif /* shr_u128 */
+
+#ifndef cmp_u128
+static inline int cmp_u128(u128 a, u128 b)
+{
+ if (a.val < b.val)
+ return -1;
+ if (a.val > b.val)
+ return 1;
+ return 0;
+}
+#endif /* cmp_u128 */
+
+#endif /* ARCH_HAS_INT128 */
+
+#endif /* _LINUX_MATH128_H */
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..367c62c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
idr.o int_sqrt.o extable.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o
+ is_single_threaded.o plist.o decompress.o math128.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/math128.c b/lib/math128.c
new file mode 100644
index 0000000..55b123a
--- /dev/null
+++ b/lib/math128.c
@@ -0,0 +1,40 @@
+#include <linux/math128.h>
+
+#ifndef mul_u64_u64
+/*
+ * a * b = (ah * 2^32 + al) * (bh * 2^32 + bl) =
+ * ah*bh * 2^64 + (ah*bl + bh*al) * 2^32 + al*bl
+ */
+u128 mul_u64_u64(u64 a, u64 b)
+{
+ u128 t1, t2, t3, t4;
+ u32 ah, al;
+ u32 bh, bl;
+
+ ah = a >> 32;
+ al = a;
+
+ bh = b >> 32;
+ bl = b;
+
+ t1.lo = 0;
+ t1.hi = (u64)ah * bh;
+
+ t2.lo = (u64)ah * bl;
+ t2.hi = t2.lo >> 32;
+ t2.lo <<= 32;
+
+ t3.lo = (u64)al * bh;
+ t3.hi = t3.lo >> 32;
+ t3.lo <<= 32;
+
+ t4.lo = (u64)al * bl;
+ t4.hi = 0;
+
+ t1 = add_u128(t1, t2);
+ t1 = add_u128(t1, t3);
+ t1 = add_u128(t1, t4);
+
+ return t1;
+}
+#endif /* mul_u64_u64 */
--
1.7.9.5

2012-10-24 21:55:32

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 02/16] math128, x86_64: Implement {mul,add}_u128 in 64bit asm

From: Peter Zijlstra <[email protected]>

Enable __int128 usage when available, if not, provide asm versions of
mul_u64_u64 and add_u128.

Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
---
arch/x86/include/asm/math128.h | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
create mode 100644 arch/x86/include/asm/math128.h

diff --git a/arch/x86/include/asm/math128.h b/arch/x86/include/asm/math128.h
new file mode 100644
index 0000000..c0e2a6c
--- /dev/null
+++ b/arch/x86/include/asm/math128.h
@@ -0,0 +1,39 @@
+#ifndef _ASM_MATH128_H
+#define _ASM_MATH128_H
+
+#ifdef CONFIG_X86_64
+
+#ifdef __SIZEOF_INT128__
+#define ARCH_HAS_INT128
+#endif
+
+#ifndef ARCH_HAS_INT128
+
+static inline u128 mul_u64_u64(u64 a, u64 b)
+{
+ u128 res;
+
+ asm("mulq %2"
+ : "=a" (res.lo), "=d" (res.hi)
+ : "rm" (b), "0" (a));
+
+ return res;
+}
+#define mul_u64_u64 mul_u64_u64
+
+static inline u128 add_u128(u128 a, u128 b)
+{
+ u128 res;
+
+ asm("addq %2,%0;\n"
+ "adcq %3,%1;\n"
+ : "=rm" (res.lo), "=rm" (res.hi)
+ : "r" (b.lo), "r" (b.hi), "0" (a.lo), "1" (a.hi));
+
+ return res;
+}
+#define add_u128 add_u128
+
+#endif /* ARCH_HAS_INT128 */
+#endif /* CONFIG_X86_64 */
+#endif /* _ASM_MATH128_H */
--
1.7.9.5

2012-10-24 21:55:46

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 03/16] sched: add sched_class->task_dead.

From: Dario Faggioli <[email protected]>

Add a new function to the scheduling class interface. It is called
at the end of a context switch, if the prev task is in TASK_DEAD state.

It might be useful for the scheduling classes that want to be notified
when one of their task dies, e.g. to perform some cleanup actions.

Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 3 +++
2 files changed, 4 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0dd42a0..c8955a2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1077,6 +1077,7 @@ struct sched_class {
void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
void (*task_fork) (struct task_struct *p);
+ void (*task_dead) (struct task_struct *p);

void (*switched_from) (struct rq *this_rq, struct task_struct *task);
void (*switched_to) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc0f7d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1774,6 +1774,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ if (prev->sched_class->task_dead)
+ prev->sched_class->task_dead(prev);
+
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
--
1.7.9.5

2012-10-24 21:56:03

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 04/16] sched: add extended scheduling interface.

From: Dario Faggioli <[email protected]>

Add the interface bits needed for supporting scheduling algorithms
with extended parameters (e.g., SCHED_DEADLINE).

In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.

Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.

For these reasons, this patch:
- defines the new struct sched_param2, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setscheduler2(), sched_setparam2()
and sched_getparam2().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.

Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the *2() calls accordingly with their own purposes.

Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
arch/arm/include/asm/unistd.h | 2 +-
arch/arm/include/uapi/asm/unistd.h | 3 +
arch/arm/kernel/calls.S | 3 +
arch/x86/syscalls/syscall_32.tbl | 3 +
arch/x86/syscalls/syscall_64.tbl | 4 +-
include/linux/sched.h | 50 ++++++++++++++++
include/linux/syscalls.h | 7 +++
kernel/sched/core.c | 110 +++++++++++++++++++++++++++++++++++-
8 files changed, 177 insertions(+), 5 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 8f60b6e..a408d0c 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -15,7 +15,7 @@

#include <uapi/asm/unistd.h>

-#define __NR_syscalls (380)
+#define __NR_syscalls (383)
#define __ARM_NR_cmpxchg (__ARM_NR_BASE+0x00fff0)

#define __ARCH_WANT_STAT64
diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index ac03bdb..81e792d 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -405,6 +405,9 @@
#define __NR_process_vm_readv (__NR_SYSCALL_BASE+376)
#define __NR_process_vm_writev (__NR_SYSCALL_BASE+377)
/* 378 for kcmp */
+#define __NR_sched_setscheduler2 (__NR_SYSCALL_BASE+379)
+#define __NR_sched_setparam2 (__NR_SYSCALL_BASE+380)
+#define __NR_sched_getparam2 (__NR_SYSCALL_BASE+381)

/*
* This may need to be greater than __NR_last_syscall+1 in order to
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 831cd38..388d53d 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -388,6 +388,9 @@
CALL(sys_process_vm_readv)
CALL(sys_process_vm_writev)
CALL(sys_ni_syscall) /* reserved for sys_kcmp */
+ CALL(sys_sched_setscheduler2)
+/* 380 */ CALL(sys_sched_setparam2)
+ CALL(sys_sched_getparam2)
#ifndef syscalls_counted
.equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
#define syscalls_counted
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index a47103f..6c9b93f 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -356,3 +356,6 @@
347 i386 process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv
348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev
349 i386 kcmp sys_kcmp
+350 i386 sched_setparam2 sys_sched_setparam2
+351 i386 sched_getparam2 sys_sched_getparam2
+352 i386 sched_setscheduler2 sys_sched_setscheduler2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a582bfe..a35c02d 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -319,7 +319,9 @@
310 64 process_vm_readv sys_process_vm_readv
311 64 process_vm_writev sys_process_vm_writev
312 common kcmp sys_kcmp
-
+313 common sched_setparam2 sys_sched_setparam2
+314 common sched_getparam2 sys_sched_getparam2
+315 common sched_setscheduler2 sys_sched_setscheduler2
#
# x32-specific system call numbers start at 512 to avoid cache impact
# for native 64-bit operation.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c8955a2..2bc420c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -54,6 +54,54 @@ struct sched_param {

#include <asm/processor.h>

+/*
+ * Extended scheduling parameters data structure.
+ *
+ * This is needed because the original struct sched_param can not be
+ * altered without introducing ABI issues with legacy applications
+ * (e.g., in sched_getparam()).
+ *
+ * However, the possibility of specifying more than just a priority for
+ * the tasks may be useful for a wide variety of application fields, e.g.,
+ * multimedia, streaming, automation and control, and many others.
+ *
+ * This variant (sched_param2) is meant at describing a so-called
+ * sporadic time-constrained task. In such model a task is specified by:
+ * - the activation period or minimum instance inter-arrival time;
+ * - the maximum (or average, depending on the actual scheduling
+ * discipline) computation time of all instances, a.k.a. runtime;
+ * - the deadline (relative to the actual activation time) of each
+ * instance.
+ * Very briefly, a periodic (sporadic) task asks for the execution of
+ * some specific computation --which is typically called an instance--
+ * (at most) every period. Moreover, each instance typically lasts no more
+ * than the runtime and must be completed by time instant t equal to
+ * the instance activation time + the deadline.
+ *
+ * This is reflected by the actual fields of the sched_param2 structure:
+ *
+ * @sched_priority task's priority (might still be useful)
+ * @sched_deadline representative of the task's deadline
+ * @sched_runtime representative of the task's runtime
+ * @sched_period representative of the task's period
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Given this task model, there are a multiplicity of scheduling algorithms
+ * and policies, that can be used to ensure all the tasks will make their
+ * timing constraints.
+ *
+ * @__unused padding to allow future expansion without ABI issues
+ */
+struct sched_param2 {
+ int sched_priority;
+ unsigned int sched_flags;
+ u64 sched_runtime;
+ u64 sched_deadline;
+ u64 sched_period;
+
+ u64 __unused[12];
+};
+
struct exec_domain;
struct futex_pi_state;
struct robust_list_head;
@@ -2073,6 +2121,8 @@ extern int sched_setscheduler(struct task_struct *, int,
const struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int,
const struct sched_param *);
+extern int sched_setscheduler2(struct task_struct *, int,
+ const struct sched_param2 *);
extern struct task_struct *idle_task(int cpu);
/**
* is_idle_task - is the specified task an idle task?
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 727f0cd..7c2a981 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -38,6 +38,7 @@ struct rlimit;
struct rlimit64;
struct rusage;
struct sched_param;
+struct sched_param2;
struct sel_arg_struct;
struct semaphore;
struct sembuf;
@@ -328,11 +329,17 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
asmlinkage long sys_nice(int increment);
asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
struct sched_param __user *param);
+asmlinkage long sys_sched_setscheduler2(pid_t pid, int policy,
+ struct sched_param2 __user *param);
asmlinkage long sys_sched_setparam(pid_t pid,
struct sched_param __user *param);
+asmlinkage long sys_sched_setparam2(pid_t pid,
+ struct sched_param2 __user *param);
asmlinkage long sys_sched_getscheduler(pid_t pid);
asmlinkage long sys_sched_getparam(pid_t pid,
struct sched_param __user *param);
+asmlinkage long sys_sched_getparam2(pid_t pid,
+ struct sched_param2 __user *param);
asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
unsigned long __user *user_mask_ptr);
asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc0f7d8..11f69ea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3707,7 +3707,8 @@ static bool check_same_owner(struct task_struct *p)
}

static int __sched_setscheduler(struct task_struct *p, int policy,
- const struct sched_param *param, bool user)
+ const struct sched_param2 *param,
+ bool user)
{
int retval, oldprio, oldpolicy = -1, on_rq, running;
unsigned long flags;
@@ -3870,10 +3871,20 @@ recheck:
int sched_setscheduler(struct task_struct *p, int policy,
const struct sched_param *param)
{
- return __sched_setscheduler(p, policy, param, true);
+ struct sched_param2 param2 = {
+ .sched_priority = param->sched_priority
+ };
+ return __sched_setscheduler(p, policy, &param2, true);
}
EXPORT_SYMBOL_GPL(sched_setscheduler);

+int sched_setscheduler2(struct task_struct *p, int policy,
+ const struct sched_param2 *param2)
+{
+ return __sched_setscheduler(p, policy, param2, true);
+}
+EXPORT_SYMBOL_GPL(sched_setscheduler2);
+
/**
* sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
* @p: the task in question.
@@ -3888,7 +3899,10 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
int sched_setscheduler_nocheck(struct task_struct *p, int policy,
const struct sched_param *param)
{
- return __sched_setscheduler(p, policy, param, false);
+ struct sched_param2 param2 = {
+ .sched_priority = param->sched_priority
+ };
+ return __sched_setscheduler(p, policy, &param2, false);
}

static int
@@ -3913,6 +3927,31 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
return retval;
}

+static int
+do_sched_setscheduler2(pid_t pid, int policy,
+ struct sched_param2 __user *param2)
+{
+ struct sched_param2 lparam2;
+ struct task_struct *p;
+ int retval;
+
+ if (!param2 || pid < 0)
+ return -EINVAL;
+
+ memset(&lparam2, 0, sizeof(struct sched_param2));
+ if (copy_from_user(&lparam2, param2, sizeof(struct sched_param2)))
+ return -EFAULT;
+
+ rcu_read_lock();
+ retval = -ESRCH;
+ p = find_process_by_pid(pid);
+ if (p != NULL)
+ retval = sched_setscheduler2(p, policy, &lparam2);
+ rcu_read_unlock();
+
+ return retval;
+}
+
/**
* sys_sched_setscheduler - set/change the scheduler policy and RT priority
* @pid: the pid in question.
@@ -3930,6 +3969,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
}

/**
+ * sys_sched_setscheduler2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @policy: new policy (could use extended sched_param).
+ * @param: structure containg the extended parameters.
+ */
+SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
+ struct sched_param2 __user *, param2)
+{
+ if (policy < 0)
+ return -EINVAL;
+
+ return do_sched_setscheduler2(pid, policy, param2);
+}
+
+/**
* sys_sched_setparam - set/change the RT priority of a thread
* @pid: the pid in question.
* @param: structure containing the new RT priority.
@@ -3940,6 +3994,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
}

/**
+ * sys_sched_setparam2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @param2: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_setparam2, pid_t, pid,
+ struct sched_param2 __user *, param2)
+{
+ return do_sched_setscheduler2(pid, -1, param2);
+}
+
+/**
* sys_sched_getscheduler - get the policy (scheduling class) of a thread
* @pid: the pid in question.
*/
@@ -4003,6 +4068,45 @@ out_unlock:
return retval;
}

+/**
+ * sys_sched_getparam2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @param2: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
+ struct sched_param2 __user *, param2)
+{
+ struct sched_param2 lp;
+ struct task_struct *p;
+ int retval;
+
+ if (!param2 || pid < 0)
+ return -EINVAL;
+
+ rcu_read_lock();
+ p = find_process_by_pid(pid);
+ retval = -ESRCH;
+ if (!p)
+ goto out_unlock;
+
+ retval = security_task_getscheduler(p);
+ if (retval)
+ goto out_unlock;
+
+ lp.sched_priority = p->rt_priority;
+ rcu_read_unlock();
+
+ retval = copy_to_user(param2, &lp,
+ sizeof(struct sched_param2)) ? -EFAULT : 0;
+
+ return retval;
+
+out_unlock:
+ rcu_read_unlock();
+ return retval;
+
+}
+
long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
{
cpumask_var_t cpus_allowed, new_mask;
--
1.7.9.5

2012-10-24 21:56:17

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 05/16] sched: SCHED_DEADLINE structures & implementation.

From: Dario Faggioli <[email protected]>

Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.

Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.

Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.

Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Michael Trimarchi <[email protected]>
Signed-off-by: Fabio Checconi <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 69 ++++-
include/uapi/linux/sched.h | 1 +
kernel/fork.c | 4 +-
kernel/hrtimer.c | 2 +-
kernel/sched/Makefile | 2 +-
kernel/sched/core.c | 111 ++++++-
kernel/sched/dl.c | 691 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 26 ++
kernel/sched/stop_task.c | 2 +-
9 files changed, 886 insertions(+), 22 deletions(-)
create mode 100644 kernel/sched/dl.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2bc420c..85d33f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,6 +91,10 @@ struct sched_param {
* timing constraints.
*
* @__unused padding to allow future expansion without ABI issues
+ *
+ * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
+ * only user of this new interface. More information about the algorithm
+ * available in the scheduling class file or in Documentation/.
*/
struct sched_param2 {
int sched_priority;
@@ -1091,6 +1095,7 @@ struct sched_domain;
#else
#define ENQUEUE_WAKING 0
#endif
+#define ENQUEUE_REPLENISH 8

#define DEQUEUE_SLEEP 1

@@ -1221,6 +1226,47 @@ struct sched_rt_entity {
#endif
};

+struct sched_dl_entity {
+ struct rb_node rb_node;
+ int nr_cpus_allowed;
+
+ /*
+ * Original scheduling parameters. Copied here from sched_param2
+ * during sched_setscheduler2(), they will remain the same until
+ * the next sched_setscheduler2().
+ */
+ u64 dl_runtime; /* maximum runtime for each instance */
+ u64 dl_deadline; /* relative deadline of each instance */
+
+ /*
+ * Actual scheduling parameters. Initialized with the values above,
+ * they are continously updated during task execution. Note that
+ * the remaining runtime could be < 0 in case we are in overrun.
+ */
+ s64 runtime; /* remaining runtime for this instance */
+ u64 deadline; /* absolute deadline for this instance */
+ unsigned int flags; /* specifying the scheduler behaviour */
+
+ /*
+ * Some bool flags:
+ *
+ * @dl_throttled tells if we exhausted the runtime. If so, the
+ * task has to wait for a replenishment to be performed at the
+ * next firing of dl_timer.
+ *
+ * @dl_new tells if a new instance arrived. If so we must
+ * start executing it with full runtime and reset its absolute
+ * deadline;
+ */
+ int dl_throttled, dl_new;
+
+ /*
+ * Bandwidth enforcement timer. Each -deadline task has its
+ * own bandwidth to be enforced, thus we need one timer per task.
+ */
+ struct hrtimer dl_timer;
+};
+
/*
* default timeslice is 100 msecs (used only for SCHED_RR tasks).
* Timeslices get refilled after they expire.
@@ -1257,6 +1303,7 @@ struct task_struct {
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
#endif
+ struct sched_dl_entity dl;

#ifdef CONFIG_PREEMPT_NOTIFIERS
/* list of struct preempt_notifier: */
@@ -1613,6 +1660,10 @@ struct task_struct {
* user-space. This allows kernel threads to set their
* priority to a value higher than any user task. Note:
* MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
+ *
+ * SCHED_DEADLINE tasks has negative priorities, reflecting
+ * the fact that any of them has higher prio than RT and
+ * NORMAL/BATCH tasks.
*/

#define MAX_USER_RT_PRIO 100
@@ -1621,9 +1672,23 @@ struct task_struct {
#define MAX_PRIO (MAX_RT_PRIO + 40)
#define DEFAULT_PRIO (MAX_RT_PRIO + 20)

+#define MAX_DL_PRIO 0
+
+static inline int dl_prio(int prio)
+{
+ if (unlikely(prio < MAX_DL_PRIO))
+ return 1;
+ return 0;
+}
+
+static inline int dl_task(struct task_struct *p)
+{
+ return dl_prio(p->prio);
+}
+
static inline int rt_prio(int prio)
{
- if (unlikely(prio < MAX_RT_PRIO))
+ if ((unsigned)prio < MAX_RT_PRIO)
return 1;
return 0;
}
@@ -2202,7 +2267,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
#else
static inline void kick_process(struct task_struct *tsk) { }
#endif
-extern void sched_fork(struct task_struct *p);
+extern int sched_fork(struct task_struct *p);
extern void sched_dead(struct task_struct *p);

extern void proc_caches_init(void);
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5a0f945..2d5e49a 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -39,6 +39,7 @@
#define SCHED_BATCH 3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
+#define SCHED_DEADLINE 6
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000

diff --git a/kernel/fork.c b/kernel/fork.c
index 8b20ab7..d34cc64 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1287,7 +1287,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
#endif

/* Perform scheduler related setup. Assign this task to a CPU. */
- sched_fork(p);
+ retval = sched_fork(p);
+ if (retval)
+ goto bad_fork_cleanup_policy;

retval = perf_event_init_task(p);
if (retval)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 6db7a5e..3700ba5 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1586,7 +1586,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
unsigned long slack;

slack = current->timer_slack_ns;
- if (rt_task(current))
+ if (dl_task(current) || rt_task(current))
slack = 0;

hrtimer_init_on_stack(&t.timer, clockid, mode);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index f06d249..622046c 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -11,7 +11,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
endif

-obj-y += core.o clock.o cputime.o idle_task.o fair.o rt.o stop_task.o
+obj-y += core.o clock.o cputime.o idle_task.o fair.o rt.o dl.o stop_task.o
obj-$(CONFIG_SMP) += cpupri.o
obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 11f69ea..9e2d26d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -849,7 +849,9 @@ static inline int normal_prio(struct task_struct *p)
{
int prio;

- if (task_has_rt_policy(p))
+ if (task_has_dl_policy(p))
+ prio = MAX_DL_PRIO-1;
+ else if (task_has_rt_policy(p))
prio = MAX_RT_PRIO-1 - p->rt_priority;
else
prio = __normal_prio(p);
@@ -1528,6 +1530,12 @@ static void __sched_fork(struct task_struct *p)
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
#endif

+ RB_CLEAR_NODE(&p->dl.rb_node);
+ hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ p->dl.dl_runtime = p->dl.runtime = 0;
+ p->dl.dl_deadline = p->dl.deadline = 0;
+ p->dl.flags = 0;
+
INIT_LIST_HEAD(&p->rt.run_list);

#ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -1538,7 +1546,7 @@ static void __sched_fork(struct task_struct *p)
/*
* fork()/clone()-time setup:
*/
-void sched_fork(struct task_struct *p)
+int sched_fork(struct task_struct *p)
{
unsigned long flags;
int cpu = get_cpu();
@@ -1560,7 +1568,7 @@ void sched_fork(struct task_struct *p)
* Revert to default priority/policy on fork if requested.
*/
if (unlikely(p->sched_reset_on_fork)) {
- if (task_has_rt_policy(p)) {
+ if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
p->policy = SCHED_NORMAL;
p->static_prio = NICE_TO_PRIO(0);
p->rt_priority = 0;
@@ -1577,8 +1585,14 @@ void sched_fork(struct task_struct *p)
p->sched_reset_on_fork = 0;
}

- if (!rt_prio(p->prio))
+ if (dl_prio(p->prio)) {
+ put_cpu();
+ return -EAGAIN;
+ } else if (rt_prio(p->prio)) {
+ p->sched_class = &rt_sched_class;
+ } else {
p->sched_class = &fair_sched_class;
+ }

if (p->sched_class->task_fork)
p->sched_class->task_fork(p);
@@ -1610,6 +1624,7 @@ void sched_fork(struct task_struct *p)
#endif

put_cpu();
+ return 0;
}

/*
@@ -3459,7 +3474,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
struct rq *rq;
const struct sched_class *prev_class;

- BUG_ON(prio < 0 || prio > MAX_PRIO);
+ BUG_ON(prio > MAX_PRIO);

rq = __task_rq_lock(p);

@@ -3491,7 +3506,9 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
if (running)
p->sched_class->put_prev_task(rq, p);

- if (rt_prio(prio))
+ if (dl_prio(prio))
+ p->sched_class = &dl_sched_class;
+ else if (rt_prio(prio))
p->sched_class = &rt_sched_class;
else
p->sched_class = &fair_sched_class;
@@ -3525,9 +3542,9 @@ void set_user_nice(struct task_struct *p, long nice)
* The RT priorities are set via sched_setscheduler(), but we still
* allow the 'normal' nice value to be set - but as expected
* it wont have any effect on scheduling until the task is
- * SCHED_FIFO/SCHED_RR:
+ * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
*/
- if (task_has_rt_policy(p)) {
+ if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
@@ -3683,7 +3700,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
p->normal_prio = normal_prio(p);
/* we are holding p->pi_lock already */
p->prio = rt_mutex_getprio(p);
- if (rt_prio(p->prio))
+ if (dl_prio(p->prio))
+ p->sched_class = &dl_sched_class;
+ else if (rt_prio(p->prio))
p->sched_class = &rt_sched_class;
else
p->sched_class = &fair_sched_class;
@@ -3691,6 +3710,50 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
}

/*
+ * This function initializes the sched_dl_entity of a newly becoming
+ * SCHED_DEADLINE task.
+ *
+ * Only the static values are considered here, the actual runtime and the
+ * absolute deadline will be properly calculated when the task is enqueued
+ * for the first time with its new policy.
+ */
+static void
+__setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
+{
+ struct sched_dl_entity *dl_se = &p->dl;
+
+ init_dl_task_timer(dl_se);
+ dl_se->dl_runtime = param2->sched_runtime;
+ dl_se->dl_deadline = param2->sched_deadline;
+ dl_se->flags = param2->sched_flags;
+ dl_se->dl_throttled = 0;
+ dl_se->dl_new = 1;
+}
+
+static void
+__getparam_dl(struct task_struct *p, struct sched_param2 *param2)
+{
+ struct sched_dl_entity *dl_se = &p->dl;
+
+ param2->sched_priority = p->rt_priority;
+ param2->sched_runtime = dl_se->dl_runtime;
+ param2->sched_deadline = dl_se->dl_deadline;
+ param2->sched_flags = dl_se->flags;
+}
+
+/*
+ * This function validates the new parameters of a -deadline task.
+ * We ask for the deadline not being zero, and greater or equal
+ * than the runtime.
+ */
+static bool
+__checkparam_dl(const struct sched_param2 *prm)
+{
+ return prm && (&prm->sched_deadline) != 0 &&
+ (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
+}
+
+/*
* check the target process has a UID that matches the current process's
*/
static bool check_same_owner(struct task_struct *p)
@@ -3727,7 +3790,8 @@ recheck:
reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
policy &= ~SCHED_RESET_ON_FORK;

- if (policy != SCHED_FIFO && policy != SCHED_RR &&
+ if (policy != SCHED_DEADLINE &&
+ policy != SCHED_FIFO && policy != SCHED_RR &&
policy != SCHED_NORMAL && policy != SCHED_BATCH &&
policy != SCHED_IDLE)
return -EINVAL;
@@ -3742,7 +3806,8 @@ recheck:
(p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
(!p->mm && param->sched_priority > MAX_RT_PRIO-1))
return -EINVAL;
- if (rt_policy(policy) != (param->sched_priority != 0))
+ if ((dl_policy(policy) && !__checkparam_dl(param)) ||
+ (rt_policy(policy) != (param->sched_priority != 0)))
return -EINVAL;

/*
@@ -3808,7 +3873,8 @@ recheck:
* If not changing anything there's no need to proceed further:
*/
if (unlikely(policy == p->policy && (!rt_policy(policy) ||
- param->sched_priority == p->rt_priority))) {
+ param->sched_priority == p->rt_priority) &&
+ !dl_policy(policy))) {
task_rq_unlock(rq, p, &flags);
return 0;
}
@@ -3845,7 +3911,11 @@ recheck:

oldprio = p->prio;
prev_class = p->sched_class;
- __setscheduler(rq, p, policy, param->sched_priority);
+ if (dl_policy(policy)) {
+ __setparam_dl(p, param);
+ __setscheduler(rq, p, policy, param->sched_priority);
+ } else
+ __setscheduler(rq, p, policy, param->sched_priority);

if (running)
p->sched_class->set_curr_task(rq);
@@ -3945,8 +4015,11 @@ do_sched_setscheduler2(pid_t pid, int policy,
rcu_read_lock();
retval = -ESRCH;
p = find_process_by_pid(pid);
- if (p != NULL)
+ if (p != NULL) {
+ if (dl_policy(policy))
+ lparam2.sched_priority = 0;
retval = sched_setscheduler2(p, policy, &lparam2);
+ }
rcu_read_unlock();

return retval;
@@ -4093,7 +4166,10 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
if (retval)
goto out_unlock;

- lp.sched_priority = p->rt_priority;
+ if (task_has_dl_policy(p))
+ __getparam_dl(p, &lp);
+ else
+ lp.sched_priority = p->rt_priority;
rcu_read_unlock();

retval = copy_to_user(param2, &lp,
@@ -4498,6 +4574,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
case SCHED_RR:
ret = MAX_USER_RT_PRIO-1;
break;
+ case SCHED_DEADLINE:
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_IDLE:
@@ -4523,6 +4600,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
case SCHED_RR:
ret = 1;
break;
+ case SCHED_DEADLINE:
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_IDLE:
@@ -6921,6 +6999,7 @@ void __init sched_init(void)
rq->calc_load_update = jiffies + LOAD_FREQ;
init_cfs_rq(&rq->cfs);
init_rt_rq(&rq->rt, rq);
+ init_dl_rq(&rq->dl, rq);
#ifdef CONFIG_FAIR_GROUP_SCHED
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
@@ -7101,7 +7180,7 @@ void normalize_rt_tasks(void)
p->se.statistics.block_start = 0;
#endif

- if (!rt_task(p)) {
+ if (!dl_task(p) && !rt_task(p)) {
/*
* Renice negative nice level userspace
* tasks back to 0:
diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
new file mode 100644
index 0000000..7e12ceb
--- /dev/null
+++ b/kernel/sched/dl.c
@@ -0,0 +1,691 @@
+/*
+ * Deadline Scheduling Class (SCHED_DEADLINE)
+ *
+ * Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS).
+ *
+ * Tasks that periodically executes their instances for less than their
+ * runtime won't miss any of their deadlines.
+ * Tasks that are not periodic or sporadic or that tries to execute more
+ * than their reserved bandwidth will be slowed down (and may potentially
+ * miss some of their deadlines), and won't affect any other task.
+ *
+ * Copyright (C) 2012 Dario Faggioli <[email protected]>,
+ * Michael Trimarchi <[email protected]>,
+ * Fabio Checconi <[email protected]>
+ */
+#include <linux/math128.h>
+#include "sched.h"
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+ return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
+{
+ return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+ struct task_struct *p = dl_task_of(dl_se);
+ struct rq *rq = task_rq(p);
+
+ return &rq->dl;
+}
+
+static inline int on_dl_rq(struct sched_dl_entity *dl_se)
+{
+ return !RB_EMPTY_NODE(&dl_se->rb_node);
+}
+
+static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+{
+ struct sched_dl_entity *dl_se = &p->dl;
+
+ return dl_rq->rb_leftmost == &dl_se->rb_node;
+}
+
+void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
+{
+ dl_rq->rb_root = RB_ROOT;
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+ int flags);
+
+/*
+ * We are being explicitly informed that a new instance is starting,
+ * and this means that:
+ * - the absolute deadline of the entity has to be placed at
+ * current time + relative deadline;
+ * - the runtime of the entity has to be set to the maximum value.
+ *
+ * The capability of specifying such event is useful whenever a -deadline
+ * entity wants to (try to!) synchronize its behaviour with the scheduler's
+ * one, and to (try to!) reconcile itself with its own scheduling
+ * parameters.
+ */
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+{
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
+
+ WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
+
+ /*
+ * We use the regular wall clock time to set deadlines in the
+ * future; in fact, we must consider execution overheads (time
+ * spent on hardirq context, etc.).
+ */
+ dl_se->deadline = rq->clock + dl_se->dl_deadline;
+ dl_se->runtime = dl_se->dl_runtime;
+ dl_se->dl_new = 0;
+}
+
+/*
+ * Pure Earliest Deadline First (EDF) scheduling does not deal with the
+ * possibility of a entity lasting more than what it declared, and thus
+ * exhausting its runtime.
+ *
+ * Here we are interested in making runtime overrun possible, but we do
+ * not want a entity which is misbehaving to affect the scheduling of all
+ * other entities.
+ * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
+ * is used, in order to confine each entity within its own bandwidth.
+ *
+ * This function deals exactly with that, and ensures that when the runtime
+ * of a entity is replenished, its deadline is also postponed. That ensures
+ * the overrunning entity can't interfere with other entity in the system and
+ * can't make them miss their deadlines. Reasons why this kind of overruns
+ * could happen are, typically, a entity voluntarily trying to overcome its
+ * runtime, or it just underestimated it during sched_setscheduler_ex().
+ */
+static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+{
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
+
+ /*
+ * We keep moving the deadline away until we get some
+ * available runtime for the entity. This ensures correct
+ * handling of situations where the runtime overrun is
+ * arbitrary large.
+ */
+ while (dl_se->runtime <= 0) {
+ dl_se->deadline += dl_se->dl_deadline;
+ dl_se->runtime += dl_se->dl_runtime;
+ }
+
+ /*
+ * At this point, the deadline really should be "in
+ * the future" with respect to rq->clock. If it's
+ * not, we are, for some reason, lagging too much!
+ * Anyway, after having warn userspace abut that,
+ * we still try to keep the things running by
+ * resetting the deadline and the budget of the
+ * entity.
+ */
+ if (dl_time_before(dl_se->deadline, rq->clock)) {
+ static bool lag_once = false;
+
+ if (!lag_once) {
+ lag_once = true;
+ printk_sched("sched: DL replenish lagged to much\n");
+ }
+ dl_se->deadline = rq->clock + dl_se->dl_deadline;
+ dl_se->runtime = dl_se->dl_runtime;
+ }
+}
+
+/*
+ * Here we check if --at time t-- an entity (which is probably being
+ * [re]activated or, in general, enqueued) can use its remaining runtime
+ * and its current deadline _without_ exceeding the bandwidth it is
+ * assigned (function returns true if it can't). We are in fact applying
+ * one of the CBS rules: when a task wakes up, if the residual runtime
+ * over residual deadline fits within the allocated bandwidth, then we
+ * can keep the current (absolute) deadline and residual budget without
+ * disrupting the schedulability of the system. Otherwise, we should
+ * refill the runtime and set the deadline a period in the future,
+ * because keeping the current (absolute) deadline of the task would
+ * result in breaking guarantees promised to other tasks.
+ *
+ * This function returns true if:
+ *
+ * runtime / (deadline - t) > dl_runtime / dl_deadline ,
+ *
+ * IOW we can't recycle current parameters.
+ */
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+{
+ u128 left, right;
+
+ /*
+ * left and right are the two sides of the equation above,
+ * after a bit of shuffling to use multiplications instead
+ * of divisions.
+ *
+ * Note that none of the time values involved in the two
+ * multiplications are absolute: dl_deadline and dl_runtime
+ * are the relative deadline and the maximum runtime of each
+ * instance, runtime is the runtime left for the last instance
+ * and (deadline - t), since t is rq->clock, is the time left
+ * to the (absolute) deadline. Therefore, overflowing the u64
+ * type is very unlikely to occur in both cases.
+ */
+ left = mul_u64_u64(dl_se->dl_deadline, dl_se->runtime);
+ right = mul_u64_u64((dl_se->deadline - t), dl_se->dl_runtime);
+
+ if (cmp_u128(left, right) > 0)
+ return true;
+
+ return false;
+}
+
+/*
+ * When a -deadline entity is queued back on the runqueue, its runtime and
+ * deadline might need updating.
+ *
+ * The policy here is that we update the deadline of the entity only if:
+ * - the current deadline is in the past,
+ * - using the remaining runtime with the current deadline would make
+ * the entity exceed its bandwidth.
+ */
+static void update_dl_entity(struct sched_dl_entity *dl_se)
+{
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
+
+ /*
+ * The arrival of a new instance needs special treatment, i.e.,
+ * the actual scheduling parameters have to be "renewed".
+ */
+ if (dl_se->dl_new) {
+ setup_new_dl_entity(dl_se);
+ return;
+ }
+
+ if (dl_time_before(dl_se->deadline, rq->clock) ||
+ dl_entity_overflow(dl_se, rq->clock)) {
+ dl_se->deadline = rq->clock + dl_se->dl_deadline;
+ dl_se->runtime = dl_se->dl_runtime;
+ }
+}
+
+/*
+ * If the entity depleted all its runtime, and if we want it to sleep
+ * while waiting for some new execution time to become available, we
+ * set the bandwidth enforcement timer to the replenishment instant
+ * and try to activate it.
+ *
+ * Notice that it is important for the caller to know if the timer
+ * actually started or not (i.e., the replenishment instant is in
+ * the future or in the past).
+ */
+static int start_dl_timer(struct sched_dl_entity *dl_se)
+{
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
+ ktime_t now, act;
+ ktime_t soft, hard;
+ unsigned long range;
+ s64 delta;
+
+ /*
+ * We want the timer to fire at the deadline, but considering
+ * that it is actually coming from rq->clock and not from
+ * hrtimer's time base reading.
+ */
+ act = ns_to_ktime(dl_se->deadline);
+ now = hrtimer_cb_get_time(&dl_se->dl_timer);
+ delta = ktime_to_ns(now) - rq->clock;
+ act = ktime_add_ns(act, delta);
+
+ /*
+ * If the expiry time already passed, e.g., because the value
+ * chosen as the deadline is too small, don't even try to
+ * start the timer in the past!
+ */
+ if (ktime_us_delta(act, now) < 0)
+ return 0;
+
+ hrtimer_set_expires(&dl_se->dl_timer, act);
+
+ soft = hrtimer_get_softexpires(&dl_se->dl_timer);
+ hard = hrtimer_get_expires(&dl_se->dl_timer);
+ range = ktime_to_ns(ktime_sub(hard, soft));
+ __hrtimer_start_range_ns(&dl_se->dl_timer, soft,
+ range, HRTIMER_MODE_ABS, 0);
+
+ return hrtimer_active(&dl_se->dl_timer);
+}
+
+/*
+ * This is the bandwidth enforcement timer callback. If here, we know
+ * a task is not on its dl_rq, since the fact that the timer was running
+ * means the task is throttled and needs a runtime replenishment.
+ *
+ * However, what we actually do depends on the fact the task is active,
+ * (it is on its rq) or has been removed from there by a call to
+ * dequeue_task_dl(). In the former case we must issue the runtime
+ * replenishment and add the task back to the dl_rq; in the latter, we just
+ * do nothing but clearing dl_throttled, so that runtime and deadline
+ * updating (and the queueing back to dl_rq) will be done by the
+ * next call to enqueue_task_dl().
+ */
+static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
+{
+ struct sched_dl_entity *dl_se = container_of(timer,
+ struct sched_dl_entity,
+ dl_timer);
+ struct task_struct *p = dl_task_of(dl_se);
+ struct rq *rq = task_rq(p);
+ raw_spin_lock(&rq->lock);
+
+ /*
+ * We need to take care of a possible races here. In fact, the
+ * task might have changed its scheduling policy to something
+ * different from SCHED_DEADLINE or changed its reservation
+ * parameters (through sched_setscheduler()).
+ */
+ if (!dl_task(p) || dl_se->dl_new)
+ goto unlock;
+
+ dl_se->dl_throttled = 0;
+ if (p->on_rq) {
+ enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
+ if (task_has_dl_policy(rq->curr))
+ check_preempt_curr_dl(rq, p, 0);
+ else
+ resched_task(rq->curr);
+ }
+unlock:
+ raw_spin_unlock(&rq->lock);
+
+ return HRTIMER_NORESTART;
+}
+
+void init_dl_task_timer(struct sched_dl_entity *dl_se)
+{
+ struct hrtimer *timer = &dl_se->dl_timer;
+
+ if (hrtimer_active(timer)) {
+ hrtimer_try_to_cancel(timer);
+ return;
+ }
+
+ hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ timer->function = dl_task_timer;
+}
+
+static
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
+{
+ int dmiss = dl_time_before(dl_se->deadline, rq->clock);
+ int rorun = dl_se->runtime <= 0;
+
+ if (!rorun && !dmiss)
+ return 0;
+
+ /*
+ * If we are beyond our current deadline and we are still
+ * executing, then we have already used some of the runtime of
+ * the next instance. Thus, if we do not account that, we are
+ * stealing bandwidth from the system at each deadline miss!
+ */
+ if (dmiss) {
+ dl_se->runtime = rorun ? dl_se->runtime : 0;
+ dl_se->runtime -= rq->clock - dl_se->deadline;
+ }
+
+ return 1;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+ struct task_struct *curr = rq->curr;
+ struct sched_dl_entity *dl_se = &curr->dl;
+ u64 delta_exec;
+
+ if (!dl_task(curr) || !on_dl_rq(dl_se))
+ return;
+
+ /*
+ * Consumed budget is computed considering the time as
+ * observed by schedulable tasks (excluding time spent
+ * in hardirq context, etc.)
+ */
+ delta_exec = rq->clock_task - curr->se.exec_start;
+ if (unlikely((s64)delta_exec < 0))
+ delta_exec = 0;
+
+ schedstat_set(curr->se.statistics.exec_max,
+ max(curr->se.statistics.exec_max, delta_exec));
+
+ curr->se.sum_exec_runtime += delta_exec;
+ account_group_exec_runtime(curr, delta_exec);
+
+ curr->se.exec_start = rq->clock_task;
+ cpuacct_charge(curr, delta_exec);
+
+ dl_se->runtime -= delta_exec;
+ if (dl_runtime_exceeded(rq, dl_se)) {
+ __dequeue_task_dl(rq, curr, 0);
+ if (likely(start_dl_timer(dl_se)))
+ dl_se->dl_throttled = 1;
+ else
+ enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+
+ if (!is_leftmost(curr, &rq->dl))
+ resched_task(curr);
+ }
+}
+
+static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
+{
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+ struct rb_node **link = &dl_rq->rb_root.rb_node;
+ struct rb_node *parent = NULL;
+ struct sched_dl_entity *entry;
+ int leftmost = 1;
+
+ BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct sched_dl_entity, rb_node);
+ if (dl_time_before(dl_se->deadline, entry->deadline))
+ link = &parent->rb_left;
+ else {
+ link = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ if (leftmost)
+ dl_rq->rb_leftmost = &dl_se->rb_node;
+
+ rb_link_node(&dl_se->rb_node, parent, link);
+ rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
+
+ dl_rq->dl_nr_running++;
+}
+
+static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+ struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+ if (RB_EMPTY_NODE(&dl_se->rb_node))
+ return;
+
+ if (dl_rq->rb_leftmost == &dl_se->rb_node) {
+ struct rb_node *next_node;
+
+ next_node = rb_next(&dl_se->rb_node);
+ dl_rq->rb_leftmost = next_node;
+ }
+
+ rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
+ RB_CLEAR_NODE(&dl_se->rb_node);
+
+ dl_rq->dl_nr_running--;
+}
+
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+{
+ BUG_ON(on_dl_rq(dl_se));
+
+ /*
+ * If this is a wakeup or a new instance, the scheduling
+ * parameters of the task might need updating. Otherwise,
+ * we want a replenishment of its runtime.
+ */
+ if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
+ replenish_dl_entity(dl_se);
+ else
+ update_dl_entity(dl_se);
+
+ __enqueue_dl_entity(dl_se);
+}
+
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+ __dequeue_dl_entity(dl_se);
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+ /*
+ * If p is throttled, we do nothing. In fact, if it exhausted
+ * its budget it needs a replenishment and, since it now is on
+ * its rq, the bandwidth timer callback (which clearly has not
+ * run yet) will take care of this.
+ */
+ if (p->dl.dl_throttled)
+ return;
+
+ enqueue_dl_entity(&p->dl, flags);
+ inc_nr_running(rq);
+}
+
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+ dequeue_dl_entity(&p->dl);
+}
+
+static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+ update_curr_dl(rq);
+ __dequeue_task_dl(rq, p, flags);
+
+ dec_nr_running(rq);
+}
+
+/*
+ * Yield task semantic for -deadline tasks is:
+ *
+ * get off from the CPU until our next instance, with
+ * a new runtime.
+ */
+static void yield_task_dl(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ /*
+ * We make the task go to sleep until its current deadline by
+ * forcing its runtime to zero. This way, update_curr_dl() stops
+ * it and the bandwidth timer will wake it up and will give it
+ * new scheduling parameters (thanks to dl_new=1).
+ */
+ if (p->dl.runtime > 0) {
+ rq->curr->dl.dl_new = 1;
+ p->dl.runtime = 0;
+ }
+ update_curr_dl(rq);
+}
+
+/*
+ * Only called when both the current and waking task are -deadline
+ * tasks.
+ */
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+ int flags)
+{
+ if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+ resched_task(rq->curr);
+}
+
+#ifdef CONFIG_SCHED_HRTICK
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+ s64 delta = p->dl.dl_runtime - p->dl.runtime;
+
+ if (delta > 10000)
+ hrtick_start(rq, delta);
+}
+#else
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+}
+#endif
+
+static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
+ struct dl_rq *dl_rq)
+{
+ struct rb_node *left = dl_rq->rb_leftmost;
+
+ if (!left)
+ return NULL;
+
+ return rb_entry(left, struct sched_dl_entity, rb_node);
+}
+
+struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+ struct sched_dl_entity *dl_se;
+ struct task_struct *p;
+ struct dl_rq *dl_rq;
+
+ dl_rq = &rq->dl;
+
+ if (unlikely(!dl_rq->dl_nr_running))
+ return NULL;
+
+ dl_se = pick_next_dl_entity(rq, dl_rq);
+ BUG_ON(!dl_se);
+
+ p = dl_task_of(dl_se);
+ p->se.exec_start = rq->clock;
+#ifdef CONFIG_SCHED_HRTICK
+ if (hrtick_enabled(rq))
+ start_hrtick_dl(rq, p);
+#endif
+ return p;
+}
+
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+{
+ update_curr_dl(rq);
+}
+
+static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
+{
+ update_curr_dl(rq);
+
+#ifdef CONFIG_SCHED_HRTICK
+ if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
+ start_hrtick_dl(rq, p);
+#endif
+}
+
+static void task_fork_dl(struct task_struct *p)
+{
+ /*
+ * SCHED_DEADLINE tasks cannot fork and this is achieved through
+ * sched_fork()
+ */
+}
+
+static void task_dead_dl(struct task_struct *p)
+{
+ struct hrtimer *timer = &p->dl.dl_timer;
+
+ if (hrtimer_active(timer))
+ hrtimer_try_to_cancel(timer);
+}
+
+static void set_curr_task_dl(struct rq *rq)
+{
+ struct task_struct *p = rq->curr;
+
+ p->se.exec_start = rq->clock;
+}
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p)
+{
+ if (hrtimer_active(&p->dl.dl_timer))
+ hrtimer_try_to_cancel(&p->dl.dl_timer);
+}
+
+static void switched_to_dl(struct rq *rq, struct task_struct *p)
+{
+ /*
+ * If p is throttled, don't consider the possibility
+ * of preempting rq->curr, the check will be done right
+ * after its runtime will get replenished.
+ */
+ if (unlikely(p->dl.dl_throttled))
+ return;
+
+ if (!p->on_rq || rq->curr != p) {
+ if (task_has_dl_policy(rq->curr))
+ check_preempt_curr_dl(rq, p, 0);
+ else
+ resched_task(rq->curr);
+ }
+}
+
+static void prio_changed_dl(struct rq *rq, struct task_struct *p,
+ int oldprio)
+{
+ switched_to_dl(rq, p);
+}
+
+#ifdef CONFIG_SMP
+static int
+select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
+{
+ return task_cpu(p);
+}
+
+static void set_cpus_allowed_dl(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+ int weight = cpumask_weight(new_mask);
+
+ BUG_ON(!dl_task(p));
+
+ cpumask_copy(&p->cpus_allowed, new_mask);
+ p->dl.nr_cpus_allowed = weight;
+}
+#endif
+
+const struct sched_class dl_sched_class = {
+ .next = &rt_sched_class,
+ .enqueue_task = enqueue_task_dl,
+ .dequeue_task = dequeue_task_dl,
+ .yield_task = yield_task_dl,
+
+ .check_preempt_curr = check_preempt_curr_dl,
+
+ .pick_next_task = pick_next_task_dl,
+ .put_prev_task = put_prev_task_dl,
+
+#ifdef CONFIG_SMP
+ .select_task_rq = select_task_rq_dl,
+
+ .set_cpus_allowed = set_cpus_allowed_dl,
+#endif
+
+ .set_curr_task = set_curr_task_dl,
+ .task_tick = task_tick_dl,
+ .task_fork = task_fork_dl,
+ .task_dead = task_dead_dl,
+
+ .prio_changed = prio_changed_dl,
+ .switched_from = switched_from_dl,
+ .switched_to = switched_to_dl,
+};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a7db09..a76d210 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -50,11 +50,23 @@ static inline int rt_policy(int policy)
return 0;
}

+static inline int dl_policy(int policy)
+{
+ if (unlikely(policy == SCHED_DEADLINE))
+ return 1;
+ return 0;
+}
+
static inline int task_has_rt_policy(struct task_struct *p)
{
return rt_policy(p->policy);
}

+static inline int task_has_dl_policy(struct task_struct *p)
+{
+ return dl_policy(p->policy);
+}
+
/*
* This is the priority-queue data structure of the RT scheduling class:
*/
@@ -309,6 +321,15 @@ struct rt_rq {
#endif
};

+/* Deadline class' related fields in a runqueue */
+struct dl_rq {
+ /* runqueue is an rbtree, ordered by deadline */
+ struct rb_root rb_root;
+ struct rb_node *rb_leftmost;
+
+ unsigned long dl_nr_running;
+};
+
#ifdef CONFIG_SMP

/*
@@ -370,6 +391,7 @@ struct rq {

struct cfs_rq cfs;
struct rt_rq rt;
+ struct dl_rq dl;

#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
@@ -841,6 +863,7 @@ enum cpuacct_stat_index {
for (class = sched_class_highest; class; class = class->next)

extern const struct sched_class stop_sched_class;
+extern const struct sched_class dl_sched_class;
extern const struct sched_class rt_sched_class;
extern const struct sched_class fair_sched_class;
extern const struct sched_class idle_sched_class;
@@ -873,6 +896,8 @@ extern void resched_cpu(int cpu);
extern struct rt_bandwidth def_rt_bandwidth;
extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);

+extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+
extern void update_idle_cpu_load(struct rq *this_rq);

#ifdef CONFIG_CGROUP_CPUACCT
@@ -1151,6 +1176,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);

extern void init_cfs_rq(struct cfs_rq *cfs_rq);
extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
+extern void init_dl_rq(struct dl_rq *rt_rq, struct rq *rq);

extern void account_cfs_bandwidth_used(int enabled, int was_enabled);

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index da5eb5b..da80047 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -103,7 +103,7 @@ get_rr_interval_stop(struct rq *rq, struct task_struct *task)
* Simple, special scheduling class for the per-CPU stop tasks:
*/
const struct sched_class stop_sched_class = {
- .next = &rt_sched_class,
+ .next = &dl_sched_class,

.enqueue_task = enqueue_task_stop,
.dequeue_task = dequeue_task_stop,
--
1.7.9.5

2012-10-24 21:56:33

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 07/16] sched: SCHED_DEADLINE avg_update accounting.

From: Dario Faggioli <[email protected]>

Make the core scheduler and load balancer aware of the load
produced by -deadline tasks, by updating the moving average
like for sched_rt.

Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/dl.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
index bc8c310..38e6071 100644
--- a/kernel/sched/dl.c
+++ b/kernel/sched/dl.c
@@ -556,6 +556,8 @@ static void update_curr_dl(struct rq *rq)
curr->se.exec_start = rq->clock_task;
cpuacct_charge(curr, delta_exec);

+ sched_rt_avg_update(rq, delta_exec);
+
dl_se->runtime -= delta_exec;
if (dl_runtime_exceeded(rq, dl_se)) {
__dequeue_task_dl(rq, curr, 0);
--
1.7.9.5

2012-10-24 21:56:47

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 08/16] sched: add period support for -deadline tasks.

From: Harald Gustafsson <[email protected]>

Make it possible to specify a period (different or equal than
deadline) for -deadline tasks.

Signed-off-by: Harald Gustafsson <[email protected]>
Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 15 ++++++++++++---
kernel/sched/dl.c | 10 +++++++---
3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 92ae764..cfbb086 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1236,6 +1236,7 @@ struct sched_dl_entity {
*/
u64 dl_runtime; /* maximum runtime for each instance */
u64 dl_deadline; /* relative deadline of each instance */
+ u64 dl_period; /* separation of two instances (period) */

/*
* Actual scheduling parameters. Initialized with the values above,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 934d3c3..9b6b988 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1534,6 +1534,7 @@ static void __sched_fork(struct task_struct *p)
hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
p->dl.dl_runtime = p->dl.runtime = 0;
p->dl.dl_deadline = p->dl.deadline = 0;
+ p->dl.dl_period = 0;
p->dl.flags = 0;

INIT_LIST_HEAD(&p->rt.run_list);
@@ -3726,6 +3727,10 @@ __setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
init_dl_task_timer(dl_se);
dl_se->dl_runtime = param2->sched_runtime;
dl_se->dl_deadline = param2->sched_deadline;
+ if (param2->sched_period != 0)
+ dl_se->dl_period = param2->sched_period;
+ else
+ dl_se->dl_period = dl_se->dl_deadline;
dl_se->flags = param2->sched_flags;
dl_se->dl_throttled = 0;
dl_se->dl_new = 1;
@@ -3739,19 +3744,23 @@ __getparam_dl(struct task_struct *p, struct sched_param2 *param2)
param2->sched_priority = p->rt_priority;
param2->sched_runtime = dl_se->dl_runtime;
param2->sched_deadline = dl_se->dl_deadline;
+ param2->sched_period = dl_se->dl_period;
param2->sched_flags = dl_se->flags;
}

/*
* This function validates the new parameters of a -deadline task.
* We ask for the deadline not being zero, and greater or equal
- * than the runtime.
+ * than the runtime, as well as the period of being zero or
+ * greater than deadline.
*/
static bool
__checkparam_dl(const struct sched_param2 *prm)
{
- return prm && (&prm->sched_deadline) != 0 &&
- (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
+ return prm && prm->sched_deadline != 0 &&
+ (prm->sched_period == 0 ||
+ (s64)(prm->sched_period - prm->sched_deadline) >= 0) &&
+ (s64)(prm->sched_deadline - prm->sched_runtime) >= 0;
}

/*
diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
index 38e6071..0adbffb 100644
--- a/kernel/sched/dl.c
+++ b/kernel/sched/dl.c
@@ -288,7 +288,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
* arbitrary large.
*/
while (dl_se->runtime <= 0) {
- dl_se->deadline += dl_se->dl_deadline;
+ dl_se->deadline += dl_se->dl_period;
dl_se->runtime += dl_se->dl_runtime;
}

@@ -328,9 +328,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
*
* This function returns true if:
*
- * runtime / (deadline - t) > dl_runtime / dl_deadline ,
+ * runtime / (deadline - t) > dl_runtime / dl_period ,
*
* IOW we can't recycle current parameters.
+ *
+ * Notice that the bandwidth check is done against the period. For
+ * task with deadline equal to period this is the same of using
+ * dl_deadline instead of dl_period in the equation above.
*/
static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
{
@@ -349,7 +353,7 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
* to the (absolute) deadline. Therefore, overflowing the u64
* type is very unlikely to occur in both cases.
*/
- left = mul_u64_u64(dl_se->dl_deadline, dl_se->runtime);
+ left = mul_u64_u64(dl_se->dl_period, dl_se->runtime);
right = mul_u64_u64((dl_se->deadline - t), dl_se->dl_runtime);

if (cmp_u128(left, right) > 0)
--
1.7.9.5

2012-10-24 21:57:04

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 09/16] sched: add schedstats for -deadline tasks.

From: Dario Faggioli <[email protected]>

Add some typical sched-debug output to dl_rq(s) and some
schedstats to -deadline tasks.

Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 13 +++++++++++++
kernel/sched/debug.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/dl.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 9 ++++++++-
4 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfbb086..6416517 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1226,6 +1226,15 @@ struct sched_rt_entity {
#endif
};

+#ifdef CONFIG_SCHEDSTATS
+struct sched_stats_dl {
+ u64 last_dmiss;
+ u64 last_rorun;
+ u64 dmiss_max;
+ u64 rorun_max;
+};
+#endif
+
struct sched_dl_entity {
struct rb_node rb_node;

@@ -1265,6 +1274,10 @@ struct sched_dl_entity {
* own bandwidth to be enforced, thus we need one timer per task.
*/
struct hrtimer dl_timer;
+
+#ifdef CONFIG_SCHEDSTATS
+ struct sched_stats_dl stats;
+#endif
};

/*
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6f79596..df20c81 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -242,6 +242,45 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
#undef P
}

+extern struct sched_dl_entity *__pick_dl_last_entity(struct dl_rq *dl_rq);
+extern void print_dl_stats(struct seq_file *m, int cpu);
+
+void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+ s64 min_deadline = -1, max_deadline = -1;
+ struct rq *rq = cpu_rq(cpu);
+ struct sched_dl_entity *last;
+ unsigned long flags;
+
+ SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ if (dl_rq->rb_leftmost)
+ min_deadline = (rb_entry(dl_rq->rb_leftmost,
+ struct sched_dl_entity,
+ rb_node))->deadline;
+ last = __pick_dl_last_entity(dl_rq);
+ if (last)
+ max_deadline = last->deadline;
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+#define P(x) \
+ SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(dl_rq->x))
+#define __PN(x) \
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(x))
+#define PN(x) \
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(dl_rq->x))
+
+ P(dl_nr_running);
+ PN(exec_clock);
+ __PN(min_deadline);
+ __PN(max_deadline);
+
+#undef PN
+#undef __PN
+#undef P
+}
+
extern __read_mostly int sched_clock_running;

static void print_cpu(struct seq_file *m, int cpu)
@@ -309,6 +348,7 @@ do { \
spin_lock_irqsave(&sched_debug_lock, flags);
print_cfs_stats(m, cpu);
print_rt_stats(m, cpu);
+ print_dl_stats(m, cpu);

rcu_read_lock();
print_rq(m, rq, cpu);
@@ -460,6 +500,12 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.statistics.nr_wakeups_affine_attempts);
P(se.statistics.nr_wakeups_passive);
P(se.statistics.nr_wakeups_idle);
+ if (dl_task(p)) {
+ PN(dl.stats.last_dmiss);
+ PN(dl.stats.dmiss_max);
+ PN(dl.stats.last_rorun);
+ PN(dl.stats.rorun_max);
+ }

{
u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
index 0adbffb..d881cc8 100644
--- a/kernel/sched/dl.c
+++ b/kernel/sched/dl.c
@@ -516,6 +516,25 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
return 0;

/*
+ * Record statistics about last and maximum deadline
+ * misses and runtime overruns.
+ */
+ if (dmiss) {
+ u64 damount = rq->clock - dl_se->deadline;
+
+ schedstat_set(dl_se->stats.last_dmiss, damount);
+ schedstat_set(dl_se->stats.dmiss_max,
+ max(dl_se->stats.dmiss_max, damount));
+ }
+ if (rorun) {
+ u64 ramount = -dl_se->runtime;
+
+ schedstat_set(dl_se->stats.last_rorun, ramount);
+ schedstat_set(dl_se->stats.rorun_max,
+ max(dl_se->stats.rorun_max, ramount));
+ }
+
+ /*
* If we are beyond our current deadline and we are still
* executing, then we have already used some of the runtime of
* the next instance. Thus, if we do not account that, we are
@@ -555,6 +574,7 @@ static void update_curr_dl(struct rq *rq)
max(curr->se.statistics.exec_max, delta_exec));

curr->se.sum_exec_runtime += delta_exec;
+ schedstat_add(&rq->dl, exec_clock, delta_exec);
account_group_exec_runtime(curr, delta_exec);

curr->se.exec_start = rq->clock_task;
@@ -906,6 +926,18 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
}
#endif

+#ifdef CONFIG_SCHED_DEBUG
+struct sched_dl_entity *__pick_dl_last_entity(struct dl_rq *dl_rq)
+{
+ struct rb_node *last = rb_last(&dl_rq->rb_root);
+
+ if (!last)
+ return NULL;
+
+ return rb_entry(last, struct sched_dl_entity, rb_node);
+}
+#endif /* CONFIG_SCHED_DEBUG */
+
static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
struct dl_rq *dl_rq)
{
@@ -1578,3 +1610,16 @@ const struct sched_class dl_sched_class = {
.switched_from = switched_from_dl,
.switched_to = switched_to_dl,
};
+
+#ifdef CONFIG_SCHED_DEBUG
+extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
+
+void print_dl_stats(struct seq_file *m, int cpu)
+{
+ struct dl_rq *dl_rq = &cpu_rq(cpu)->dl;
+
+ rcu_read_lock();
+ print_dl_rq(m, cpu, dl_rq);
+ rcu_read_unlock();
+}
+#endif /* CONFIG_SCHED_DEBUG */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2ca517d..6e3d095 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -329,6 +329,8 @@ struct dl_rq {

unsigned long dl_nr_running;

+ u64 exec_clock;
+
#ifdef CONFIG_SMP
/*
* Deadline values of the currently executing and the
@@ -355,6 +357,11 @@ struct dl_rq {
#endif
};

+#ifdef CONFIG_SCHED_DEBUG
+struct sched_dl_entity *__pick_dl_last_entity(struct dl_rq *dl_rq);
+void print_dl_stats(struct seq_file *m, int cpu);
+#endif
+
#ifdef CONFIG_SMP

/*
@@ -1209,7 +1216,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);

extern void init_cfs_rq(struct cfs_rq *cfs_rq);
extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
-extern void init_dl_rq(struct dl_rq *rt_rq, struct rq *rq);
+extern void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq);

extern void account_cfs_bandwidth_used(int enabled, int was_enabled);

--
1.7.9.5

2012-10-24 21:57:13

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 10/16] sched: add latency tracing for -deadline tasks.

From: Dario Faggioli <[email protected]>

It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.

For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.

As a consequence of applying this patch there will be three wakeup
latency tracer:
* "wakeup", that deals with all tasks in the system;
* "wakeup_rt", that deals with -rt and -deadline tasks only;
* "wakeup_dl", that deals with -deadline tasks only.

Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
kernel/trace/trace_sched_wakeup.c | 44 +++++++++++++++++++++++++++++++++----
kernel/trace/trace_selftest.c | 28 +++++++++++++----------
2 files changed, 57 insertions(+), 15 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 02170c0..8050c80 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -27,6 +27,7 @@ static int wakeup_cpu;
static int wakeup_current_cpu;
static unsigned wakeup_prio = -1;
static int wakeup_rt;
+static int wakeup_dl;

static arch_spinlock_t wakeup_lock =
(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
@@ -429,9 +430,17 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
tracing_record_cmdline(p);
tracing_record_cmdline(current);

- if ((wakeup_rt && !rt_task(p)) ||
- p->prio >= wakeup_prio ||
- p->prio >= current->prio)
+ /*
+ * Semantic is like this:
+ * - wakeup tracer handles all tasks in the system, independently
+ * from their scheduling class;
+ * - wakeup_rt tracer handles tasks belonging to sched_dl and
+ * sched_rt class;
+ * - wakeup_dl handles tasks belonging to sched_dl class only.
+ */
+ if ((wakeup_dl && !dl_task(p)) ||
+ (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
+ (p->prio >= wakeup_prio || p->prio >= current->prio))
return;

pc = preempt_count();
@@ -443,7 +452,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
arch_spin_lock(&wakeup_lock);

/* check for races. */
- if (!tracer_enabled || p->prio >= wakeup_prio)
+ if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
goto out_locked;

/* reset the trace */
@@ -551,16 +560,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)

static int wakeup_tracer_init(struct trace_array *tr)
{
+ wakeup_dl = 0;
wakeup_rt = 0;
return __wakeup_tracer_init(tr);
}

static int wakeup_rt_tracer_init(struct trace_array *tr)
{
+ wakeup_dl = 0;
wakeup_rt = 1;
return __wakeup_tracer_init(tr);
}

+static int wakeup_dl_tracer_init(struct trace_array *tr)
+{
+ wakeup_dl = 1;
+ wakeup_rt = 0;
+ return __wakeup_tracer_init(tr);
+}
+
static void wakeup_tracer_reset(struct trace_array *tr)
{
stop_wakeup_tracer(tr);
@@ -623,6 +641,20 @@ static struct tracer wakeup_rt_tracer __read_mostly =
.use_max_tr = 1,
};

+static struct tracer wakeup_dl_tracer __read_mostly =
+{
+ .name = "wakeup_dl",
+ .init = wakeup_dl_tracer_init,
+ .reset = wakeup_tracer_reset,
+ .start = wakeup_tracer_start,
+ .stop = wakeup_tracer_stop,
+ .wait_pipe = poll_wait_pipe,
+ .print_max = 1,
+#ifdef CONFIG_FTRACE_SELFTEST
+ .selftest = trace_selftest_startup_wakeup,
+#endif
+};
+
__init static int init_wakeup_tracer(void)
{
int ret;
@@ -635,6 +667,10 @@ __init static int init_wakeup_tracer(void)
if (ret)
return ret;

+ ret = register_tracer(&wakeup_dl_tracer);
+ if (ret)
+ return ret;
+
return 0;
}
device_initcall(init_wakeup_tracer);
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 2c00a69..45cd1e2 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -1028,11 +1028,17 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
#ifdef CONFIG_SCHED_TRACER
static int trace_wakeup_test_thread(void *data)
{
- /* Make this a RT thread, doesn't need to be too high */
- static const struct sched_param param = { .sched_priority = 5 };
+ /* Make this a -deadline thread */
+ struct sched_param2 paramx = {
+ .sched_priority = 0,
+ .sched_runtime = 100000ULL,
+ .sched_deadline = 10000000ULL,
+ .sched_period = 10000000ULL
+ .sched_flags = 0
+ };
struct completion *x = data;

- sched_setscheduler(current, SCHED_FIFO, &param);
+ sched_setscheduler2(current, SCHED_DEADLINE, &paramx);

/* Make it know we have a new prio */
complete(x);
@@ -1046,8 +1052,8 @@ static int trace_wakeup_test_thread(void *data)
/* we are awake, now wait to disappear */
while (!kthread_should_stop()) {
/*
- * This is an RT task, do short sleeps to let
- * others run.
+ * This will likely be the system top priority
+ * task, do short sleeps to let others run.
*/
msleep(100);
}
@@ -1060,21 +1066,21 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
{
unsigned long save_max = tracing_max_latency;
struct task_struct *p;
- struct completion isrt;
+ struct completion is_ready;
unsigned long count;
int ret;

- init_completion(&isrt);
+ init_completion(&is_ready);

- /* create a high prio thread */
- p = kthread_run(trace_wakeup_test_thread, &isrt, "ftrace-test");
+ /* create a -deadline thread */
+ p = kthread_run(trace_wakeup_test_thread, &is_ready, "ftrace-test");
if (IS_ERR(p)) {
printk(KERN_CONT "Failed to create ftrace wakeup test thread ");
return -1;
}

- /* make sure the thread is running at an RT prio */
- wait_for_completion(&isrt);
+ /* make sure the thread is running at -deadline policy */
+ wait_for_completion(&is_ready);

/* start the tracing */
ret = tracer_init(trace, tr);
--
1.7.9.5

2012-10-24 21:57:25

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 11/16] rtmutex: turn the plist into an rb-tree.

From: Peter Zijlstra <[email protected]>

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
- classical prio field of the plist is just an int, which might
not be enough for representing a deadline;
- manipulating such a list would become O(nr_deadline_tasks),
which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
- among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
one with the higher (lower, actually!) prio wins;
- among a -priority and a -deadline task, the latter always wins;
- among two -deadline tasks, the one with the earliest deadline
wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/init_task.h | 10 +++
include/linux/rtmutex.h | 18 ++----
include/linux/sched.h | 4 +-
kernel/fork.c | 3 +-
kernel/futex.c | 2 +
kernel/rtmutex-debug.c | 10 ++-
kernel/rtmutex.c | 152 ++++++++++++++++++++++++++++++++++++---------
kernel/rtmutex_common.h | 22 +++----
kernel/sched/core.c | 4 --
9 files changed, 159 insertions(+), 66 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..7d2634b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
#include <linux/securebits.h>
+#include <linux/rbtree.h>
#include <net/net_namespace.h>

#ifdef CONFIG_SMP
@@ -143,6 +144,14 @@ extern struct task_group root_task_group;

#define INIT_TASK_COMM "swapper"

+#ifdef CONFIG_RT_MUTEXES
+# define INIT_RT_MUTEXES(tsk) \
+ .pi_waiters = RB_ROOT, \
+ .pi_waiters_leftmost = NULL,
+#else
+# define INIT_RT_MUTEXES(tsk)
+#endif
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +219,7 @@ extern struct task_group root_task_group;
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
INIT_CPUSET_SEQ \
+ INIT_RT_MUTEXES(tsk) \
}


diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index de17134..3aed8d7 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -13,7 +13,7 @@
#define __LINUX_RT_MUTEX_H

#include <linux/linkage.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
#include <linux/spinlock_types.h>

extern int max_lock_depth; /* for sysctl */
@@ -22,12 +22,14 @@ extern int max_lock_depth; /* for sysctl */
* The rt_mutex structure
*
* @wait_lock: spinlock to protect the structure
- * @wait_list: pilist head to enqueue waiters in priority order
+ * @waiters: rbtree root to enqueue waiters in priority order
+ * @waiters_leftmost: top waiter
* @owner: the mutex owner
*/
struct rt_mutex {
raw_spinlock_t wait_lock;
- struct plist_head wait_list;
+ struct rb_root waiters;
+ struct rb_node *waiters_leftmost;
struct task_struct *owner;
#ifdef CONFIG_DEBUG_RT_MUTEXES
int save_state;
@@ -66,7 +68,7 @@ struct hrtimer_sleeper;

#define __RT_MUTEX_INITIALIZER(mutexname) \
{ .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(mutexname.wait_lock) \
- , .wait_list = PLIST_HEAD_INIT(mutexname.wait_list) \
+ , .waiters = RB_ROOT \
, .owner = NULL \
__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}

@@ -98,12 +100,4 @@ extern int rt_mutex_trylock(struct rt_mutex *lock);

extern void rt_mutex_unlock(struct rt_mutex *lock);

-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk) \
- .pi_waiters = PLIST_HEAD_INIT(tsk.pi_waiters), \
- INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6416517..1f0f5de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -16,6 +16,7 @@ struct sched_param {
#include <linux/types.h>
#include <linux/timex.h>
#include <linux/jiffies.h>
+#include <linux/plist.h>
#include <linux/rbtree.h>
#include <linux/thread_info.h>
#include <linux/cpumask.h>
@@ -1500,7 +1501,8 @@ struct task_struct {

#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task */
- struct plist_head pi_waiters;
+ struct rb_root pi_waiters;
+ struct rb_node *pi_waiters_leftmost;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index d34cc64..d8928dd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1092,7 +1092,8 @@ static void rt_mutex_init_task(struct task_struct *p)
{
raw_spin_lock_init(&p->pi_lock);
#ifdef CONFIG_RT_MUTEXES
- plist_head_init(&p->pi_waiters);
+ p->pi_waiters = RB_ROOT;
+ p->pi_waiters_leftmost = NULL;
p->pi_blocked_on = NULL;
#endif
}
diff --git a/kernel/futex.c b/kernel/futex.c
index 3717e7b..cdf5267 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2294,6 +2294,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
*/
debug_rt_mutex_init_waiter(&rt_waiter);
rt_waiter.task = NULL;
+ //rb_init_node(&rt_waiter.tree_entry);
+ //rb_init_node(&rt_waiter.pi_tree_entry);

ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
if (unlikely(ret != 0))
diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index 16502d3..0f339ca 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -23,7 +23,7 @@
#include <linux/kallsyms.h>
#include <linux/syscalls.h>
#include <linux/interrupt.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
#include <linux/fs.h>
#include <linux/debug_locks.h>

@@ -56,7 +56,7 @@ static void printk_lock(struct rt_mutex *lock, int print_owner)

void rt_mutex_debug_task_free(struct task_struct *task)
{
- DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters));
+ DEBUG_LOCKS_WARN_ON(!RB_EMPTY_ROOT(&task->pi_waiters));
DEBUG_LOCKS_WARN_ON(task->pi_blocked_on);
}

@@ -153,16 +153,14 @@ void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock)
void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
{
memset(waiter, 0x11, sizeof(*waiter));
- plist_node_init(&waiter->list_entry, MAX_PRIO);
- plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
+ RB_CLEAR_NODE(&waiter->pi_tree_entry);
+ RB_CLEAR_NODE(&waiter->tree_entry);
waiter->deadlock_task_pid = NULL;
}

void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
{
put_pid(waiter->deadlock_task_pid);
- DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry));
- DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
memset(waiter, 0x22, sizeof(*waiter));
}

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a242e69..aca58e6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -90,10 +90,104 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
}
#endif

+static inline int
+rt_mutex_waiter_less(struct rt_mutex_waiter *left,
+ struct rt_mutex_waiter *right)
+{
+ if (left->task->prio < right->task->prio)
+ return 1;
+
+ /*
+ * If both tasks are dl_task(), we check their deadlines.
+ */
+ if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
+ return (left->task->dl.deadline < right->task->dl.deadline);
+
+ return 0;
+}
+
+static void
+rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+ struct rb_node **link = &lock->waiters.rb_node;
+ struct rb_node *parent = NULL;
+ struct rt_mutex_waiter *entry;
+ int leftmost = 1;
+
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct rt_mutex_waiter, tree_entry);
+ if (rt_mutex_waiter_less(waiter, entry)) {
+ link = &parent->rb_left;
+ } else {
+ link = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ if (leftmost)
+ lock->waiters_leftmost = &waiter->tree_entry;
+
+ rb_link_node(&waiter->tree_entry, parent, link);
+ rb_insert_color(&waiter->tree_entry, &lock->waiters);
+}
+
+static void
+rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+ if (RB_EMPTY_NODE(&waiter->tree_entry))
+ return;
+
+ if (lock->waiters_leftmost == &waiter->tree_entry)
+ lock->waiters_leftmost = rb_next(&waiter->tree_entry);
+
+ rb_erase(&waiter->tree_entry, &lock->waiters);
+ RB_CLEAR_NODE(&waiter->tree_entry);
+}
+
+static void
+rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+ struct rb_node **link = &task->pi_waiters.rb_node;
+ struct rb_node *parent = NULL;
+ struct rt_mutex_waiter *entry;
+ int leftmost = 1;
+
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct rt_mutex_waiter, pi_tree_entry);
+ if (rt_mutex_waiter_less(waiter, entry)) {
+ link = &parent->rb_left;
+ } else {
+ link = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ if (leftmost)
+ task->pi_waiters_leftmost = &waiter->pi_tree_entry;
+
+ rb_link_node(&waiter->pi_tree_entry, parent, link);
+ rb_insert_color(&waiter->pi_tree_entry, &task->pi_waiters);
+}
+
+static void
+rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+ if (RB_EMPTY_NODE(&waiter->pi_tree_entry))
+ return;
+
+ if (task->pi_waiters_leftmost == &waiter->pi_tree_entry)
+ task->pi_waiters_leftmost = rb_next(&waiter->pi_tree_entry);
+
+ rb_erase(&waiter->pi_tree_entry, &task->pi_waiters);
+ RB_CLEAR_NODE(&waiter->pi_tree_entry);
+}
+
/*
- * Calculate task priority from the waiter list priority
+ * Calculate task priority from the waiter tree priority
*
- * Return task->normal_prio when the waiter list is empty or when
+ * Return task->normal_prio when the waiter tree is empty or when
* the waiter is not allowed to do priority boosting
*/
int rt_mutex_getprio(struct task_struct *task)
@@ -101,7 +195,7 @@ int rt_mutex_getprio(struct task_struct *task)
if (likely(!task_has_pi_waiters(task)))
return task->normal_prio;

- return min(task_top_pi_waiter(task)->pi_list_entry.prio,
+ return min(task_top_pi_waiter(task)->task->prio,
task->normal_prio);
}

@@ -219,7 +313,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
* When deadlock detection is off then we check, if further
* priority adjustment is necessary.
*/
- if (!detect_deadlock && waiter->list_entry.prio == task->prio)
+ if (!detect_deadlock && waiter->task->prio == task->prio)
goto out_unlock_pi;

lock = waiter->lock;
@@ -240,9 +334,9 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
top_waiter = rt_mutex_top_waiter(lock);

/* Requeue the waiter */
- plist_del(&waiter->list_entry, &lock->wait_list);
- waiter->list_entry.prio = task->prio;
- plist_add(&waiter->list_entry, &lock->wait_list);
+ rt_mutex_dequeue(lock, waiter);
+ waiter->task->prio = task->prio;
+ rt_mutex_enqueue(lock, waiter);

/* Release the task */
raw_spin_unlock_irqrestore(&task->pi_lock, flags);
@@ -266,17 +360,15 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,

if (waiter == rt_mutex_top_waiter(lock)) {
/* Boost the owner */
- plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
- waiter->pi_list_entry.prio = waiter->list_entry.prio;
- plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+ rt_mutex_dequeue_pi(task, top_waiter);
+ rt_mutex_enqueue_pi(task, waiter);
__rt_mutex_adjust_prio(task);

} else if (top_waiter == waiter) {
/* Deboost the owner */
- plist_del(&waiter->pi_list_entry, &task->pi_waiters);
+ rt_mutex_dequeue_pi(task, waiter);
waiter = rt_mutex_top_waiter(lock);
- waiter->pi_list_entry.prio = waiter->list_entry.prio;
- plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+ rt_mutex_enqueue_pi(task, waiter);
__rt_mutex_adjust_prio(task);
}

@@ -341,7 +433,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
* 3) it is top waiter
*/
if (rt_mutex_has_waiters(lock)) {
- if (task->prio >= rt_mutex_top_waiter(lock)->list_entry.prio) {
+ if (task->prio >= rt_mutex_top_waiter(lock)->task->prio) {
if (!waiter || waiter != rt_mutex_top_waiter(lock))
return 0;
}
@@ -355,7 +447,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,

/* remove the queued waiter. */
if (waiter) {
- plist_del(&waiter->list_entry, &lock->wait_list);
+ rt_mutex_dequeue(lock, waiter);
task->pi_blocked_on = NULL;
}

@@ -365,8 +457,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
*/
if (rt_mutex_has_waiters(lock)) {
top = rt_mutex_top_waiter(lock);
- top->pi_list_entry.prio = top->list_entry.prio;
- plist_add(&top->pi_list_entry, &task->pi_waiters);
+ rt_mutex_enqueue_pi(task, top);
}
raw_spin_unlock_irqrestore(&task->pi_lock, flags);
}
@@ -402,13 +493,11 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
__rt_mutex_adjust_prio(task);
waiter->task = task;
waiter->lock = lock;
- plist_node_init(&waiter->list_entry, task->prio);
- plist_node_init(&waiter->pi_list_entry, task->prio);
-
+
/* Get the top priority waiter on the lock */
if (rt_mutex_has_waiters(lock))
top_waiter = rt_mutex_top_waiter(lock);
- plist_add(&waiter->list_entry, &lock->wait_list);
+ rt_mutex_enqueue(lock, waiter);

task->pi_blocked_on = waiter;

@@ -419,8 +508,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,

if (waiter == rt_mutex_top_waiter(lock)) {
raw_spin_lock_irqsave(&owner->pi_lock, flags);
- plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
- plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
+ rt_mutex_dequeue_pi(owner, top_waiter);
+ rt_mutex_enqueue_pi(owner, waiter);

__rt_mutex_adjust_prio(owner);
if (owner->pi_blocked_on)
@@ -472,7 +561,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
* boosted mode and go back to normal after releasing
* lock->wait_lock.
*/
- plist_del(&waiter->pi_list_entry, &current->pi_waiters);
+ rt_mutex_dequeue_pi(current, waiter);

rt_mutex_set_owner(lock, NULL);

@@ -496,7 +585,7 @@ static void remove_waiter(struct rt_mutex *lock,
int chain_walk = 0;

raw_spin_lock_irqsave(&current->pi_lock, flags);
- plist_del(&waiter->list_entry, &lock->wait_list);
+ rt_mutex_dequeue(lock, waiter);
current->pi_blocked_on = NULL;
raw_spin_unlock_irqrestore(&current->pi_lock, flags);

@@ -507,13 +596,13 @@ static void remove_waiter(struct rt_mutex *lock,

raw_spin_lock_irqsave(&owner->pi_lock, flags);

- plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
+ rt_mutex_dequeue_pi(owner, waiter);

if (rt_mutex_has_waiters(lock)) {
struct rt_mutex_waiter *next;

next = rt_mutex_top_waiter(lock);
- plist_add(&next->pi_list_entry, &owner->pi_waiters);
+ rt_mutex_enqueue_pi(owner, next);
}
__rt_mutex_adjust_prio(owner);

@@ -523,8 +612,6 @@ static void remove_waiter(struct rt_mutex *lock,
raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
}

- WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
if (!chain_walk)
return;

@@ -551,7 +638,7 @@ void rt_mutex_adjust_pi(struct task_struct *task)
raw_spin_lock_irqsave(&task->pi_lock, flags);

waiter = task->pi_blocked_on;
- if (!waiter || waiter->list_entry.prio == task->prio) {
+ if (!waiter || waiter->task->prio == task->prio) {
raw_spin_unlock_irqrestore(&task->pi_lock, flags);
return;
}
@@ -624,6 +711,8 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
int ret = 0;

debug_rt_mutex_init_waiter(&waiter);
+ //rb_init_node(&waiter.tree_entry);
+ //rb_init_node(&waiter.pi_tree_entry);

raw_spin_lock(&lock->wait_lock);

@@ -890,7 +979,8 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
{
lock->owner = NULL;
raw_spin_lock_init(&lock->wait_lock);
- plist_head_init(&lock->wait_list);
+ lock->waiters = RB_ROOT;
+ lock->waiters_leftmost = NULL;

debug_rt_mutex_init(lock, name);
}
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 53a66c8..b65442f 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -40,13 +40,13 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
* This is the control structure for tasks blocked on a rt_mutex,
* which is allocated on the kernel stack on of the blocked task.
*
- * @list_entry: pi node to enqueue into the mutex waiters list
- * @pi_list_entry: pi node to enqueue into the mutex owner waiters list
+ * @tree_entry: pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry: pi node to enqueue into the mutex owner waiters tree
* @task: task reference to the blocked task
*/
struct rt_mutex_waiter {
- struct plist_node list_entry;
- struct plist_node pi_list_entry;
+ struct rb_node tree_entry;
+ struct rb_node pi_tree_entry;
struct task_struct *task;
struct rt_mutex *lock;
#ifdef CONFIG_DEBUG_RT_MUTEXES
@@ -57,11 +57,11 @@ struct rt_mutex_waiter {
};

/*
- * Various helpers to access the waiters-plist:
+ * Various helpers to access the waiters-tree:
*/
static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
{
- return !plist_head_empty(&lock->wait_list);
+ return !RB_EMPTY_ROOT(&lock->waiters);
}

static inline struct rt_mutex_waiter *
@@ -69,8 +69,8 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
{
struct rt_mutex_waiter *w;

- w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
- list_entry);
+ w = rb_entry(lock->waiters_leftmost, struct rt_mutex_waiter,
+ tree_entry);
BUG_ON(w->lock != lock);

return w;
@@ -78,14 +78,14 @@ rt_mutex_top_waiter(struct rt_mutex *lock)

static inline int task_has_pi_waiters(struct task_struct *p)
{
- return !plist_head_empty(&p->pi_waiters);
+ return !RB_EMPTY_ROOT(&p->pi_waiters);
}

static inline struct rt_mutex_waiter *
task_top_pi_waiter(struct task_struct *p)
{
- return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
- pi_list_entry);
+ return rb_entry(p->pi_waiters_leftmost, struct rt_mutex_waiter,
+ pi_tree_entry);
}

/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9b6b988..4a96c44 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7084,10 +7084,6 @@ void __init sched_init(void)
INIT_HLIST_HEAD(&init_task.preempt_notifiers);
#endif

-#ifdef CONFIG_RT_MUTEXES
- plist_head_init(&init_task.pi_waiters);
-#endif
-
/*
* The boot idle thread does lazy MMU switching as well:
*/
--
1.7.9.5

2012-10-24 21:57:52

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 13/16] sched: add bandwidth management for sched_dl.

From: Dario Faggioli <[email protected]>

In order of -deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.

Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.

However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).

Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.

This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.

This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:

M * (sched_dl_runtime_us / sched_dl_period_us)

It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.

Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 8 +
kernel/sched/core.c | 413 ++++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/dl.c | 45 +++++-
kernel/sched/sched.h | 67 +++++++-
kernel/sysctl.c | 14 ++
5 files changed, 518 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc452ae..4ad8dc1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1247,6 +1247,7 @@ struct sched_dl_entity {
u64 dl_runtime; /* maximum runtime for each instance */
u64 dl_deadline; /* relative deadline of each instance */
u64 dl_period; /* separation of two instances (period) */
+ u64 dl_bw; /* dl_runtime / dl_deadline */

/*
* Actual scheduling parameters. Initialized with the values above,
@@ -2155,6 +2156,13 @@ int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);

+extern unsigned int sysctl_sched_dl_period;
+extern int sysctl_sched_dl_runtime;
+
+int sched_dl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+
#ifdef CONFIG_SCHED_AUTOGROUP
extern unsigned int sysctl_sched_autogroup_enabled;

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb02515..b926969 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -287,6 +287,15 @@ __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;

+/*
+ * Maximum bandwidth available for all -deadline tasks and groups
+ * (if group scheduling is configured) on each CPU.
+ *
+ * default: 5%
+ */
+unsigned int sysctl_sched_dl_period = 1000000;
+int sysctl_sched_dl_runtime = 50000;
+


/*
@@ -1629,6 +1638,96 @@ int sched_fork(struct task_struct *p)
return 0;
}

+unsigned long to_ratio(u64 period, u64 runtime)
+{
+ if (runtime == RUNTIME_INF)
+ return 1ULL << 20;
+
+ /*
+ * Doing this here saves a lot of checks in all
+ * the calling paths, and returning zero seems
+ * safe for them anyway.
+ */
+ if (period == 0)
+ return 0;
+
+ return div64_u64(runtime << 20, period);
+}
+
+static inline
+void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+{
+ dl_b->total_bw -= tsk_bw;
+}
+
+static inline
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
+{
+ dl_b->total_bw += tsk_bw;
+}
+
+static inline
+bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+{
+ return dl_b->bw != -1 &&
+ dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+}
+
+/*
+ * We must be sure that accepting a new task (or allowing changing the
+ * parameters of an existing one) is consistent with the bandwidth
+ * constraints. If yes, this function also accordingly updates the currently
+ * allocated bandwidth to reflect the new situation.
+ *
+ * This function is called while holding p's rq->lock.
+ */
+static int dl_overflow(struct task_struct *p, int policy,
+ const struct sched_param2 *param2)
+{
+#ifdef CONFIG_SMP
+ struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+#else
+ struct dl_bw *dl_b = &task_rq(p)->dl.dl_bw;
+#endif
+ u64 period = param2->sched_period;
+ u64 runtime = param2->sched_runtime;
+ u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
+#ifdef CONFIG_SMP
+ int cpus = cpumask_weight(task_rq(p)->rd->span);
+#else
+ int cpus = 1;
+#endif
+ int err = -1;
+
+ if (new_bw == p->dl.dl_bw)
+ return 0;
+
+ /*
+ * Either if a task, enters, leave, or stays -deadline but changes
+ * its parameters, we may need to update accordingly the total
+ * allocated bandwidth of the container.
+ */
+ raw_spin_lock(&dl_b->lock);
+ if (dl_policy(policy) && !task_has_dl_policy(p) &&
+ !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+ __dl_add(dl_b, new_bw);
+ err = 0;
+ } else if (dl_policy(policy) && task_has_dl_policy(p) &&
+ !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+ __dl_clear(dl_b, p->dl.dl_bw);
+ __dl_add(dl_b, new_bw);
+ err = 0;
+ } else if (!dl_policy(policy) && task_has_dl_policy(p)) {
+ __dl_clear(dl_b, p->dl.dl_bw);
+ err = 0;
+ }
+ raw_spin_unlock(&dl_b->lock);
+
+ return err;
+}
+
+extern void init_dl_bw(struct dl_bw *dl_b);
+
/*
* wake_up_new_task - wake up a newly created task for the first time.
*
@@ -3755,6 +3854,7 @@ __setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
dl_se->dl_period = param2->sched_period;
else
dl_se->dl_period = dl_se->dl_deadline;
+ dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
dl_se->flags = param2->sched_flags;
dl_se->dl_throttled = 0;
dl_se->dl_new = 1;
@@ -3913,8 +4013,8 @@ recheck:
return 0;
}

-#ifdef CONFIG_RT_GROUP_SCHED
if (user) {
+#ifdef CONFIG_RT_GROUP_SCHED
/*
* Do not allow realtime tasks into groups that have no runtime
* assigned.
@@ -3925,8 +4025,25 @@ recheck:
task_rq_unlock(rq, p, &flags);
return -EPERM;
}
- }
#endif
+#ifdef CONFIG_SMP
+ if (dl_bandwidth_enabled() && dl_policy(policy)) {
+ const struct cpumask *span = rq->rd->span;
+
+ /*
+ * Don't allow tasks with an affinity mask smaller than
+ * the entire root_domain to become SCHED_DEADLINE. We
+ * will also fail if there's no bandwidth available.
+ */
+ if (!cpumask_equal(&p->cpus_allowed, span) ||
+ rq->rd->dl_bw.bw == 0) {
+ __task_rq_unlock(rq);
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+ return -EPERM;
+ }
+ }
+#endif
+ }

/* recheck policy now with rq lock held */
if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
@@ -3934,6 +4051,19 @@ recheck:
task_rq_unlock(rq, p, &flags);
goto recheck;
}
+
+ /*
+ * If setscheduling to SCHED_DEADLINE (or changing the parameters
+ * of a SCHED_DEADLINE task) we need to check if enough bandwidth
+ * is available.
+ */
+ if ((dl_policy(policy) || dl_task(p)) &&
+ dl_overflow(p, policy, param)) {
+ __task_rq_unlock(rq);
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+ return -EBUSY;
+ }
+
on_rq = p->on_rq;
running = task_current(rq, p);
if (on_rq)
@@ -4253,6 +4383,24 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
if (retval)
goto out_unlock;

+ /*
+ * Since bandwidth control happens on root_domain basis,
+ * if admission test is enabled, we only admit -deadline
+ * tasks allowed to run on all the CPUs in the task's
+ * root_domain.
+ */
+#ifdef CONFIG_SMP
+ if (task_has_dl_policy(p)) {
+ const struct cpumask *span = task_rq(p)->rd->span;
+
+ if (dl_bandwidth_enabled() &&
+ !cpumask_equal(in_mask, span)) {
+ retval = -EBUSY;
+ goto out_unlock;
+ }
+ }
+#endif
+
cpuset_cpus_allowed(p, cpus_allowed);
cpumask_and(new_mask, in_mask, cpus_allowed);
again:
@@ -4891,6 +5039,42 @@ out:
EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);

/*
+ * When dealing with a -deadline task, we have to check if moving it to
+ * a new CPU is possible or not. In fact, this is only true iff there
+ * is enough bandwidth available on such CPU, otherwise we want the
+ * whole migration progedure to fail over.
+ */
+static inline
+bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
+{
+ struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+ struct dl_bw *cpu_b = &cpu_rq(cpu)->rd->dl_bw;
+ int ret = 1;
+ u64 bw;
+
+ if (dl_b == cpu_b)
+ return 1;
+
+ raw_spin_lock(&dl_b->lock);
+ raw_spin_lock(&cpu_b->lock);
+
+ bw = cpu_b->bw * cpumask_weight(cpu_rq(cpu)->rd->span);
+ if (dl_bandwidth_enabled() &&
+ bw < cpu_b->total_bw + p->dl.dl_bw) {
+ ret = 0;
+ goto unlock;
+ }
+ dl_b->total_bw -= p->dl.dl_bw;
+ cpu_b->total_bw += p->dl.dl_bw;
+
+unlock:
+ raw_spin_unlock(&cpu_b->lock);
+ raw_spin_unlock(&dl_b->lock);
+
+ return ret;
+}
+
+/*
* Move (not current) task off this cpu, onto dest cpu. We're doing
* this because either it can't run here any more (set_cpus_allowed()
* away from this CPU, or CPU going down), or because we're
@@ -4922,6 +5106,13 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
goto fail;

/*
+ * If p is -deadline, proceed only if there is enough
+ * bandwidth available on dest_cpu
+ */
+ if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
+ goto fail;
+
+ /*
* If we're not on a rq, the next wake-up will ensure we're
* placed properly.
*/
@@ -5605,6 +5796,8 @@ static int init_rootdomain(struct root_domain *rd)
if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
goto free_dlo_mask;

+ init_dl_bw(&rd->dl_bw);
+
if (cpupri_init(&rd->cpupri) != 0)
goto free_rto_mask;
return 0;
@@ -7010,6 +7203,8 @@ void __init sched_init(void)

init_rt_bandwidth(&def_rt_bandwidth,
global_rt_period(), global_rt_runtime());
+ init_dl_bandwidth(&def_dl_bandwidth,
+ global_dl_period(), global_dl_runtime());

#ifdef CONFIG_RT_GROUP_SCHED
init_rt_bandwidth(&root_task_group.rt_bandwidth,
@@ -7403,16 +7598,6 @@ void sched_move_task(struct task_struct *tsk)
}
#endif /* CONFIG_CGROUP_SCHED */

-#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
-static unsigned long to_ratio(u64 period, u64 runtime)
-{
- if (runtime == RUNTIME_INF)
- return 1ULL << 20;
-
- return div64_u64(runtime << 20, period);
-}
-#endif
-
#ifdef CONFIG_RT_GROUP_SCHED
/*
* Ensure that the real time constraints are schedulable.
@@ -7586,10 +7771,48 @@ long sched_group_rt_period(struct task_group *tg)
do_div(rt_period_us, NSEC_PER_USEC);
return rt_period_us;
}
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+/*
+ * Coupling of -rt and -deadline bandwidth.
+ *
+ * Here we check if the new -rt bandwidth value is consistent
+ * with the system settings for the bandwidth available
+ * to -deadline tasks.
+ *
+ * IOW, we want to enforce that
+ *
+ * rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_rt_dl_global_constraints(u64 rt_bw)
+{
+ unsigned long flags;
+ u64 dl_bw;
+ bool ret;
+
+ raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
+ if (global_rt_runtime() == RUNTIME_INF ||
+ global_dl_runtime() == RUNTIME_INF) {
+ ret = true;
+ goto unlock;
+ }
+
+ dl_bw = to_ratio(def_dl_bandwidth.dl_period,
+ def_dl_bandwidth.dl_runtime);
+
+ ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+ raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+ return ret;
+}

+#ifdef CONFIG_RT_GROUP_SCHED
static int sched_rt_global_constraints(void)
{
- u64 runtime, period;
+ u64 runtime, period, bw;
int ret = 0;

if (sysctl_sched_rt_period <= 0)
@@ -7604,6 +7827,10 @@ static int sched_rt_global_constraints(void)
if (runtime > period && runtime != RUNTIME_INF)
return -EINVAL;

+ bw = to_ratio(period, runtime);
+ if (!__sched_rt_dl_global_constraints(bw))
+ return -EINVAL;
+
mutex_lock(&rt_constraints_mutex);
read_lock(&tasklist_lock);
ret = __rt_schedulable(NULL, 0, 0);
@@ -7626,19 +7853,19 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
static int sched_rt_global_constraints(void)
{
unsigned long flags;
- int i;
+ int i, ret = 0;
+ u64 bw;

if (sysctl_sched_rt_period <= 0)
return -EINVAL;

- /*
- * There's always some RT tasks in the root group
- * -- migration, kstopmachine etc..
- */
- if (sysctl_sched_rt_runtime == 0)
- return -EBUSY;
-
raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
+ bw = to_ratio(global_rt_period(), global_rt_runtime());
+ if (!__sched_rt_dl_global_constraints(bw)) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
for_each_possible_cpu(i) {
struct rt_rq *rt_rq = &cpu_rq(i)->rt;

@@ -7646,12 +7873,96 @@ static int sched_rt_global_constraints(void)
rt_rq->rt_runtime = global_rt_runtime();
raw_spin_unlock(&rt_rq->rt_runtime_lock);
}
+unlock:
raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);

- return 0;
+ return ret;
}
#endif /* CONFIG_RT_GROUP_SCHED */

+/*
+ * Coupling of -dl and -rt bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for -dl tasks and groups, if the new values are consistent with
+ * the system settings for the bandwidth available to -rt entities.
+ *
+ * IOW, we want to enforce that
+ *
+ * rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_dl_rt_global_constraints(u64 dl_bw)
+{
+ u64 rt_bw;
+ bool ret;
+
+ raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
+ if (global_dl_runtime() == RUNTIME_INF ||
+ global_rt_runtime() == RUNTIME_INF) {
+ ret = true;
+ goto unlock;
+ }
+
+ rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
+ def_rt_bandwidth.rt_runtime);
+
+ ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+ raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
+
+ return ret;
+}
+
+static bool __sched_dl_global_constraints(u64 runtime, u64 period)
+{
+ if (!period || (runtime != RUNTIME_INF && runtime > period))
+ return -EINVAL;
+
+ return 0;
+}
+
+static int sched_dl_global_constraints(void)
+{
+ u64 runtime = global_dl_runtime();
+ u64 period = global_dl_period();
+ u64 new_bw = to_ratio(period, runtime);
+ int ret, i;
+
+ ret = __sched_dl_global_constraints(runtime, period);
+ if (ret)
+ return ret;
+
+ if (!__sched_dl_rt_global_constraints(new_bw))
+ return -EINVAL;
+
+ /*
+ * Here we want to check the bandwidth not being set to some
+ * value smaller than the currently allocated bandwidth in
+ * any of the root_domains.
+ *
+ * FIXME: Cycling on all the CPUs is overdoing, but simpler than
+ * cycling on root_domains... Discussion on different/better
+ * solutions is welcome!
+ */
+ for_each_possible_cpu(i) {
+#ifdef CONFIG_SMP
+ struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+#else
+ struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
+#endif
+ raw_spin_lock(&dl_b->lock);
+ if (new_bw < dl_b->total_bw) {
+ raw_spin_unlock(&dl_b->lock);
+ return -EBUSY;
+ }
+ raw_spin_unlock(&dl_b->lock);
+ }
+
+ return 0;
+}
+
int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -7682,6 +7993,64 @@ int sched_rt_handler(struct ctl_table *table, int write,
return ret;
}

+int sched_dl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int ret;
+ int old_period, old_runtime;
+ static DEFINE_MUTEX(mutex);
+ unsigned long flags;
+
+ mutex_lock(&mutex);
+ old_period = sysctl_sched_dl_period;
+ old_runtime = sysctl_sched_dl_runtime;
+
+ ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+ if (!ret && write) {
+ raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+ flags);
+
+ ret = sched_dl_global_constraints();
+ if (ret) {
+ sysctl_sched_dl_period = old_period;
+ sysctl_sched_dl_runtime = old_runtime;
+ } else {
+ u64 new_bw;
+ int i;
+
+ def_dl_bandwidth.dl_period = global_dl_period();
+ def_dl_bandwidth.dl_runtime = global_dl_runtime();
+ if (global_dl_runtime() == RUNTIME_INF)
+ new_bw = -1;
+ else
+ new_bw = to_ratio(global_dl_period(),
+ global_dl_runtime());
+ /*
+ * FIXME: As above...
+ */
+ for_each_possible_cpu(i) {
+#ifdef CONFIG_SMP
+ struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+#else
+ struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
+#endif
+
+ raw_spin_lock(&dl_b->lock);
+ dl_b->bw = new_bw;
+ raw_spin_unlock(&dl_b->lock);
+ }
+ }
+
+ raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+ flags);
+ }
+ mutex_unlock(&mutex);
+
+ return ret;
+}
+
#ifdef CONFIG_CGROUP_SCHED

/* return corresponding task_group object of a cgroup */
diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
index 1ad1a00..b345853 100644
--- a/kernel/sched/dl.c
+++ b/kernel/sched/dl.c
@@ -17,6 +17,8 @@
#include <linux/math128.h>
#include "sched.h"

+struct dl_bandwidth def_dl_bandwidth;
+
static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
{
return container_of(dl_se, struct task_struct, dl);
@@ -47,6 +49,27 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
return dl_rq->rb_leftmost == &dl_se->rb_node;
}

+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+ raw_spin_lock_init(&dl_b->dl_runtime_lock);
+ dl_b->dl_period = period;
+ dl_b->dl_runtime = runtime;
+}
+
+extern unsigned long to_ratio(u64 period, u64 runtime);
+
+void init_dl_bw(struct dl_bw *dl_b)
+{
+ raw_spin_lock_init(&dl_b->lock);
+ raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
+ if (global_dl_runtime() == RUNTIME_INF)
+ dl_b->bw = -1;
+ else
+ dl_b->bw = to_ratio(global_dl_period(), global_dl_runtime());
+ raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
+ dl_b->total_bw = 0;
+}
+
void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
{
dl_rq->rb_root = RB_ROOT;
@@ -58,6 +81,8 @@ void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
dl_rq->dl_nr_migratory = 0;
dl_rq->overloaded = 0;
dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#else
+ init_dl_bw(&dl_rq->dl_bw);
#endif
}

@@ -922,8 +947,8 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
* In the unlikely case current and p have the same deadline
* let us try to decide what's the best thing to do...
*/
- if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
- !need_resched())
+ if ((p->dl.deadline == rq->curr->dl.deadline) &&
+ !test_tsk_need_resched(rq->curr))
check_preempt_equal_dl(rq, p);
#endif /* CONFIG_SMP */
}
@@ -1027,6 +1052,18 @@ static void task_fork_dl(struct task_struct *p)
static void task_dead_dl(struct task_struct *p)
{
struct hrtimer *timer = &p->dl.dl_timer;
+#ifdef CONFIG_SMP
+ struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+#else
+ struct dl_bw *dl_b = &task_rq(p)->dl.dl_bw;
+#endif
+
+ /*
+ * Since we are TASK_DEAD we won't slip out of the domain!
+ */
+ raw_spin_lock_irq(&dl_b->lock);
+ dl_b->total_bw -= p->dl.dl_bw;
+ raw_spin_unlock_irq(&dl_b->lock);

hrtimer_cancel(timer);
}
@@ -1249,7 +1286,7 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
BUG_ON(task_current(rq, p));
BUG_ON(p->nr_cpus_allowed <= 1);

- BUG_ON(!p->se.on_rq);
+ BUG_ON(!p->on_rq);
BUG_ON(!dl_task(p));

return p;
@@ -1390,7 +1427,7 @@ static int pull_dl_task(struct rq *this_rq)
dl_time_before(p->dl.deadline,
this_rq->dl.earliest_dl.curr))) {
WARN_ON(p == src_rq->curr);
- WARN_ON(!p->se.on_rq);
+ WARN_ON(!p->on_rq);

/*
* Then we pull iff p has actually an earlier
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6304bef..85fe8a0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -67,7 +67,7 @@ static inline int task_has_dl_policy(struct task_struct *p)
return dl_policy(p->policy);
}

-static inline int dl_time_before(u64 a, u64 b)
+static inline bool dl_time_before(u64 a, u64 b)
{
return (s64)(a - b) < 0;
}
@@ -75,8 +75,8 @@ static inline int dl_time_before(u64 a, u64 b)
/*
* Tells if entity @a should preempt entity @b.
*/
-static inline
-int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+static inline bool
+dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
{
return dl_time_before(a->deadline, b->deadline);
}
@@ -96,6 +96,48 @@ struct rt_bandwidth {
u64 rt_runtime;
struct hrtimer rt_period_timer;
};
+/*
+ * To keep the bandwidth of -deadline tasks and groups under control
+ * we need some place where:
+ * - store the maximum -deadline bandwidth of the system (the group);
+ * - cache the fraction of that bandwidth that is currently allocated.
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, the bandwidth is given on a per-CPU basis,
+ * meaning that:
+ * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ * - dl_total_bw array contains, in the i-eth element, the currently
+ * allocated bandwidth on the i-eth CPU.
+ * Moreover, groups consume bandwidth on each CPU, while tasks only
+ * consume bandwidth on the CPU they're running on.
+ * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
+ * that will be shown the next time the proc or cgroup controls will
+ * be red. It on its turn can be changed by writing on its own
+ * control.
+ */
+struct dl_bandwidth {
+ raw_spinlock_t dl_runtime_lock;
+ u64 dl_runtime;
+ u64 dl_period;
+};
+
+static inline int dl_bandwidth_enabled(void)
+{
+ return sysctl_sched_dl_runtime >= 0;
+}
+
+struct dl_bw {
+ raw_spinlock_t lock;
+ u64 bw, total_bw;
+};
+
+static inline u64 global_dl_period(void);
+static inline u64 global_dl_runtime(void);

extern struct mutex sched_domains_mutex;

@@ -368,6 +410,8 @@ struct dl_rq {
*/
struct rb_root pushable_dl_tasks_root;
struct rb_node *pushable_dl_tasks_leftmost;
+#else
+ struct dl_bw dl_bw;
#endif
};

@@ -399,6 +443,7 @@ struct root_domain {
*/
cpumask_var_t dlo_mask;
atomic_t dlo_count;
+ struct dl_bw dl_bw;

/*
* The "RT overload" flag: it gets set if a CPU has more than
@@ -736,7 +781,18 @@ static inline u64 global_rt_runtime(void)
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
}

+static inline u64 global_dl_period(void)
+{
+ return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_dl_runtime(void)
+{
+ if (sysctl_sched_dl_runtime < 0)
+ return RUNTIME_INF;

+ return (u64)sysctl_sched_dl_runtime * NSEC_PER_USEC;
+}

static inline int task_current(struct rq *rq, struct task_struct *p)
{
@@ -943,6 +999,7 @@ extern int update_runtime(struct notifier_block *nfb, unsigned long action, void
extern void init_sched_dl_class(void);
extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void);
+extern void init_sched_dl_class(void);

extern void resched_task(struct task_struct *p);
extern void resched_cpu(int cpu);
@@ -950,8 +1007,12 @@ extern void resched_cpu(int cpu);
extern struct rt_bandwidth def_rt_bandwidth;
extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);

+extern struct dl_bandwidth def_dl_bandwidth;
+extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
extern void init_dl_task_timer(struct sched_dl_entity *dl_se);

+unsigned long to_ratio(u64 period, u64 runtime);
+
extern void update_idle_cpu_load(struct rq *this_rq);

#ifdef CONFIG_CGROUP_CPUACCT
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f65ea..9731aab 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -362,6 +362,20 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sched_rt_handler,
},
+ {
+ .procname = "sched_dl_period_us",
+ .data = &sysctl_sched_dl_period,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_dl_handler,
+ },
+ {
+ .procname = "sched_dl_runtime_us",
+ .data = &sysctl_sched_dl_runtime,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = sched_dl_handler,
+ },
#ifdef CONFIG_SCHED_AUTOGROUP
{
.procname = "sched_autogroup_enabled",
--
1.7.9.5

2012-10-24 21:57:55

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 12/16] sched: drafted deadline inheritance logic.

From: Dario Faggioli <[email protected]>

Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).

This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.

Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.

We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)

Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 9 ++++-
kernel/fork.c | 1 +
kernel/rtmutex.c | 13 +++++--
kernel/sched/core.c | 34 +++++++++++++++---
kernel/sched/dl.c | 91 ++++++++++++++++++++++++++++---------------------
kernel/sched/sched.h | 14 ++++++++
6 files changed, 116 insertions(+), 46 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1f0f5de..bc452ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1267,8 +1267,12 @@ struct sched_dl_entity {
* @dl_new tells if a new instance arrived. If so we must
* start executing it with full runtime and reset its absolute
* deadline;
+ *
+ * @dl_boosted tells if we are boosted due to DI. If so we are
+ * outside bandwidth enforcement mechanism (but only until we
+ * exit the critical section).
*/
- int dl_throttled, dl_new;
+ int dl_throttled, dl_new, dl_boosted;

/*
* Bandwidth enforcement timer. Each -deadline task has its
@@ -1505,6 +1509,8 @@ struct task_struct {
struct rb_node *pi_waiters_leftmost;
/* Deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *pi_blocked_on;
+ /* Top pi_waiters task */
+ struct task_struct *pi_top_task;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
@@ -2174,6 +2180,7 @@ extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
+extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
extern void rt_mutex_adjust_pi(struct task_struct *p);
static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index d8928dd..3213173 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1095,6 +1095,7 @@ static void rt_mutex_init_task(struct task_struct *p)
p->pi_waiters = RB_ROOT;
p->pi_waiters_leftmost = NULL;
p->pi_blocked_on = NULL;
+ p->pi_top_task = NULL;
#endif
}

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index aca58e6..f6a9074 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -199,6 +199,14 @@ int rt_mutex_getprio(struct task_struct *task)
task->normal_prio);
}

+struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
+{
+ if (likely(!task_has_pi_waiters(task)))
+ return NULL;
+
+ return task_top_pi_waiter(task)->task;
+}
+
/*
* Adjust the priority of a task, after its pi_waiters got modified.
*
@@ -208,7 +216,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
{
int prio = rt_mutex_getprio(task);

- if (task->prio != prio)
+ if (task->prio != prio || dl_prio(prio))
rt_mutex_setprio(task, prio);
}

@@ -638,7 +646,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
raw_spin_lock_irqsave(&task->pi_lock, flags);

waiter = task->pi_blocked_on;
- if (!waiter || waiter->task->prio == task->prio) {
+ if (!waiter || (waiter->task->prio == task->prio &&
+ !dl_prio(task->prio))) {
raw_spin_unlock_irqrestore(&task->pi_lock, flags);
return;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4a96c44..fb02515 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3472,7 +3472,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
*/
void rt_mutex_setprio(struct task_struct *p, int prio)
{
- int oldprio, on_rq, running;
+ int oldprio, on_rq, running, enqueue_flag = 0;
struct rq *rq;
const struct sched_class *prev_class;

@@ -3499,6 +3499,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
}

trace_sched_pi_setprio(p, prio);
+ p->pi_top_task = rt_mutex_get_top_task(p);
oldprio = p->prio;
prev_class = p->sched_class;
on_rq = p->on_rq;
@@ -3508,19 +3509,42 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
if (running)
p->sched_class->put_prev_task(rq, p);

- if (dl_prio(prio))
+ /*
+ * Boosting condition are:
+ * 1. -rt task is running and holds mutex A
+ * --> -dl task blocks on mutex A
+ *
+ * 2. -dl task is running and holds mutex A
+ * --> -dl task blocks on mutex A and could preempt the
+ * running task
+ */
+ if (dl_prio(prio)) {
+ if (!dl_prio(p->normal_prio) || (p->pi_top_task &&
+ dl_entity_preempt(&p->pi_top_task->dl, &p->dl))) {
+ p->dl.dl_boosted = 1;
+ p->dl.dl_throttled = 0;
+ enqueue_flag = ENQUEUE_REPLENISH;
+ } else
+ p->dl.dl_boosted = 0;
p->sched_class = &dl_sched_class;
- else if (rt_prio(prio))
+ } else if (rt_prio(prio)) {
+ if (dl_prio(oldprio))
+ p->dl.dl_boosted = 0;
+ if (oldprio < prio)
+ enqueue_flag = ENQUEUE_HEAD;
p->sched_class = &rt_sched_class;
- else
+ } else {
+ if (dl_prio(oldprio))
+ p->dl.dl_boosted = 0;
p->sched_class = &fair_sched_class;
+ }

p->prio = prio;

if (running)
p->sched_class->set_curr_task(rq);
if (on_rq)
- enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+ enqueue_task(rq, p, enqueue_flag);

check_class_changed(rq, p, prev_class, oldprio);
out_unlock:
diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
index d881cc8..1ad1a00 100644
--- a/kernel/sched/dl.c
+++ b/kernel/sched/dl.c
@@ -17,20 +17,6 @@
#include <linux/math128.h>
#include "sched.h"

-static inline int dl_time_before(u64 a, u64 b)
-{
- return (s64)(a - b) < 0;
-}
-
-/*
- * Tells if entity @a should preempt entity @b.
- */
-static inline
-int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
-{
- return dl_time_before(a->deadline, b->deadline);
-}
-
static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
{
return container_of(dl_se, struct task_struct, dl);
@@ -241,7 +227,8 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
* one, and to (try to!) reconcile itself with its own scheduling
* parameters.
*/
-static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
+ struct sched_dl_entity *pi_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -253,8 +240,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
* future; in fact, we must consider execution overheads (time
* spent on hardirq context, etc.).
*/
- dl_se->deadline = rq->clock + dl_se->dl_deadline;
- dl_se->runtime = dl_se->dl_runtime;
+ dl_se->deadline = rq->clock + pi_se->dl_deadline;
+ dl_se->runtime = pi_se->dl_runtime;
dl_se->dl_new = 0;
}

@@ -276,11 +263,23 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
* could happen are, typically, a entity voluntarily trying to overcome its
* runtime, or it just underestimated it during sched_setscheduler_ex().
*/
-static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+static void replenish_dl_entity(struct sched_dl_entity *dl_se,
+ struct sched_dl_entity *pi_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);

+ BUG_ON(pi_se->dl_runtime <= 0);
+
+ /*
+ * This could be the case for a !-dl task that is boosted.
+ * Just go with full inherited parameters.
+ */
+ if (dl_se->dl_deadline == 0) {
+ dl_se->deadline = rq->clock + pi_se->dl_deadline;
+ dl_se->runtime = pi_se->dl_runtime;
+ }
+
/*
* We keep moving the deadline away until we get some
* available runtime for the entity. This ensures correct
@@ -288,8 +287,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
* arbitrary large.
*/
while (dl_se->runtime <= 0) {
- dl_se->deadline += dl_se->dl_period;
- dl_se->runtime += dl_se->dl_runtime;
+ dl_se->deadline += pi_se->dl_period;
+ dl_se->runtime += pi_se->dl_runtime;
}

/*
@@ -308,8 +307,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
lag_once = true;
printk_sched("sched: DL replenish lagged to much\n");
}
- dl_se->deadline = rq->clock + dl_se->dl_deadline;
- dl_se->runtime = dl_se->dl_runtime;
+ dl_se->deadline = rq->clock + pi_se->dl_deadline;
+ dl_se->runtime = pi_se->dl_runtime;
}
}

@@ -336,7 +335,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
* task with deadline equal to period this is the same of using
* dl_deadline instead of dl_period in the equation above.
*/
-static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
+ struct sched_dl_entity *pi_se, u64 t)
{
u128 left, right;

@@ -353,8 +353,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
* to the (absolute) deadline. Therefore, overflowing the u64
* type is very unlikely to occur in both cases.
*/
- left = mul_u64_u64(dl_se->dl_period, dl_se->runtime);
- right = mul_u64_u64((dl_se->deadline - t), dl_se->dl_runtime);
+ left = mul_u64_u64(pi_se->dl_period, dl_se->runtime);
+ right = mul_u64_u64((dl_se->deadline - t), pi_se->dl_runtime);

if (cmp_u128(left, right) > 0)
return true;
@@ -371,7 +371,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
* - using the remaining runtime with the current deadline would make
* the entity exceed its bandwidth.
*/
-static void update_dl_entity(struct sched_dl_entity *dl_se)
+static void update_dl_entity(struct sched_dl_entity *dl_se,
+ struct sched_dl_entity *pi_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -381,14 +382,14 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
* the actual scheduling parameters have to be "renewed".
*/
if (dl_se->dl_new) {
- setup_new_dl_entity(dl_se);
+ setup_new_dl_entity(dl_se, pi_se);
return;
}

if (dl_time_before(dl_se->deadline, rq->clock) ||
- dl_entity_overflow(dl_se, rq->clock)) {
- dl_se->deadline = rq->clock + dl_se->dl_deadline;
- dl_se->runtime = dl_se->dl_runtime;
+ dl_entity_overflow(dl_se, pi_se, rq->clock)) {
+ dl_se->deadline = rq->clock + pi_se->dl_deadline;
+ dl_se->runtime = pi_se->dl_runtime;
}
}

@@ -402,7 +403,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
* actually started or not (i.e., the replenishment instant is in
* the future or in the past).
*/
-static int start_dl_timer(struct sched_dl_entity *dl_se)
+static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -411,6 +412,8 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
unsigned long range;
s64 delta;

+ if (boosted)
+ return 0;
/*
* We want the timer to fire at the deadline, but considering
* that it is actually coming from rq->clock and not from
@@ -585,7 +588,7 @@ static void update_curr_dl(struct rq *rq)
dl_se->runtime -= delta_exec;
if (dl_runtime_exceeded(rq, dl_se)) {
__dequeue_task_dl(rq, curr, 0);
- if (likely(start_dl_timer(dl_se)))
+ if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
dl_se->dl_throttled = 1;
else
enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
@@ -740,7 +743,8 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
}

static void
-enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+enqueue_dl_entity(struct sched_dl_entity *dl_se,
+ struct sched_dl_entity *pi_se, int flags)
{
BUG_ON(on_dl_rq(dl_se));

@@ -750,9 +754,9 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
* we want a replenishment of its runtime.
*/
if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
- replenish_dl_entity(dl_se);
+ replenish_dl_entity(dl_se, pi_se);
else
- update_dl_entity(dl_se);
+ update_dl_entity(dl_se, pi_se);

__enqueue_dl_entity(dl_se);
}
@@ -764,6 +768,18 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se)

static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
+ struct task_struct *pi_task = p->pi_top_task;
+ struct sched_dl_entity *pi_se = &p->dl;
+
+ /*
+ * Use the scheduling parameters of the top pi-waiter
+ * task if we have one and its (relative) deadline is
+ * smaller than our one... OTW we keep our runtime and
+ * deadline.
+ */
+ if (pi_task && p->dl.dl_boosted && dl_prio(pi_task->normal_prio))
+ pi_se = &pi_task->dl;
+
/*
* If p is throttled, we do nothing. In fact, if it exhausted
* its budget it needs a replenishment and, since it now is on
@@ -773,7 +789,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
if (p->dl.dl_throttled)
return;

- enqueue_dl_entity(&p->dl, flags);
+ enqueue_dl_entity(&p->dl, pi_se, flags);

if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
@@ -1012,8 +1028,7 @@ static void task_dead_dl(struct task_struct *p)
{
struct hrtimer *timer = &p->dl.dl_timer;

- if (hrtimer_active(timer))
- hrtimer_try_to_cancel(timer);
+ hrtimer_cancel(timer);
}

static void set_curr_task_dl(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6e3d095..6304bef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -67,6 +67,20 @@ static inline int task_has_dl_policy(struct task_struct *p)
return dl_policy(p->policy);
}

+static inline int dl_time_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+ return dl_time_before(a->deadline, b->deadline);
+}
+
/*
* This is the priority-queue data structure of the RT scheduling class:
*/
--
1.7.9.5

2012-10-24 21:58:01

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 14/16] sched: make dl_bw a sub-quota of rt_bw

Change real-time bandwidth management as to make dl_bw a sub-quota
of rt_bw. This patch leaves rt_bw at its default value and sets
dl_bw at 40% of rt_bw. It also remove sched_dl_period_us control
knob using sched_rt_period_us as common period for both rt_bw and
dl_bw.

Checks are made when the user tries to change dl_bw sub-quota as to
not fall below what currently used. Since dl_bw now depends upon
rt_bw, similar checks are performed when the users modifies rt_bw
and dl_bw is changed accordingly. Setting rt_bw sysctl variable to
-1 (actually disabling rt throttling) disables dl_bw checks as well.

Signed-off-by: Juri Lelli <[email protected]>
---
include/linux/sched.h | 1 -
kernel/sched/core.c | 282 +++++++++++++++++++++++--------------------------
kernel/sched/dl.c | 3 +-
kernel/sched/sched.h | 22 ++--
kernel/sysctl.c | 7 --
5 files changed, 143 insertions(+), 172 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ad8dc1..3bce12f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2156,7 +2156,6 @@ int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);

-extern unsigned int sysctl_sched_dl_period;
extern int sysctl_sched_dl_runtime;

int sched_dl_handler(struct ctl_table *table, int write,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b926969..3003a4e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -288,13 +288,12 @@ __read_mostly int scheduler_running;
int sysctl_sched_rt_runtime = 950000;

/*
- * Maximum bandwidth available for all -deadline tasks and groups
- * (if group scheduling is configured) on each CPU.
+ * Sub-quota or rt bandwidth available for all -deadline tasks
+ * on each CPU.
*
- * default: 5%
+ * default: 40%
*/
-unsigned int sysctl_sched_dl_period = 1000000;
-int sysctl_sched_dl_runtime = 50000;
+int sysctl_sched_dl_runtime = 400000;



@@ -7204,7 +7203,7 @@ void __init sched_init(void)
init_rt_bandwidth(&def_rt_bandwidth,
global_rt_period(), global_rt_runtime());
init_dl_bandwidth(&def_dl_bandwidth,
- global_dl_period(), global_dl_runtime());
+ global_rt_period(), global_dl_runtime());

#ifdef CONFIG_RT_GROUP_SCHED
init_rt_bandwidth(&root_task_group.rt_bandwidth,
@@ -7598,6 +7597,93 @@ void sched_move_task(struct task_struct *tsk)
}
#endif /* CONFIG_CGROUP_SCHED */

+static u64 actual_dl_runtime(void)
+{
+ u64 dl_runtime = global_dl_runtime();
+ u64 rt_runtime = global_rt_runtime();
+ u64 period = global_rt_period();
+
+ /*
+ * We want to calculate the sub-quota of rt_bw actually available
+ * for -dl tasks. It is a percentage of percentage. By default 95%
+ * of system bandwidth is allocate to -rt tasks; among this, a 40%
+ * quota is reserved for -dl tasks. To have the actual quota a simple
+ * multiplication is needed: .95 * .40 = .38 (38% of system bandwidth
+ * for deadline tasks).
+ * What follows is basically the same, but using unsigned integers.
+ *
+ * dl_runtime rt_runtime
+ * actual_runtime = ---------- * ---------- * period
+ * period period
+ */
+ if (dl_runtime == RUNTIME_INF)
+ return RUNTIME_INF;
+
+ return div64_u64 (dl_runtime * rt_runtime, period);
+}
+
+static int check_dl_bw(void)
+{
+ int i;
+ u64 period = global_rt_period();
+ u64 dl_actual_runtime = actual_dl_runtime();
+ u64 new_bw = to_ratio(period, dl_actual_runtime);
+
+ /*
+ * Here we want to check the bandwidth not being set to some
+ * value smaller than the currently allocated bandwidth in
+ * any of the root_domains.
+ *
+ * FIXME: Cycling on all the CPUs is overdoing, but simpler than
+ * cycling on root_domains... Discussion on different/better
+ * solutions is welcome!
+ */
+ for_each_possible_cpu(i) {
+#ifdef CONFIG_SMP
+ struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+#else
+ struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
+#endif
+ raw_spin_lock(&dl_b->lock);
+ if (new_bw < dl_b->total_bw) {
+ raw_spin_unlock(&dl_b->lock);
+ return -EBUSY;
+ }
+ raw_spin_unlock(&dl_b->lock);
+ }
+
+ return 0;
+}
+
+static void update_dl_bw(void)
+{
+ u64 new_bw;
+ int i;
+
+ def_dl_bandwidth.dl_runtime = global_dl_runtime();
+ if (global_dl_runtime() == RUNTIME_INF ||
+ global_rt_runtime() == RUNTIME_INF)
+ new_bw = -1;
+ else {
+ new_bw = to_ratio(global_rt_period(),
+ actual_dl_runtime());
+ }
+ /*
+ * FIXME: As above...
+ */
+ for_each_possible_cpu(i) {
+#ifdef CONFIG_SMP
+ struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+#else
+ struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
+#endif
+
+ raw_spin_lock(&dl_b->lock);
+ dl_b->bw = new_bw;
+ raw_spin_unlock(&dl_b->lock);
+ }
+}
+
#ifdef CONFIG_RT_GROUP_SCHED
/*
* Ensure that the real time constraints are schedulable.
@@ -7771,48 +7857,10 @@ long sched_group_rt_period(struct task_group *tg)
do_div(rt_period_us, NSEC_PER_USEC);
return rt_period_us;
}
-#endif /* CONFIG_RT_GROUP_SCHED */

-/*
- * Coupling of -rt and -deadline bandwidth.
- *
- * Here we check if the new -rt bandwidth value is consistent
- * with the system settings for the bandwidth available
- * to -deadline tasks.
- *
- * IOW, we want to enforce that
- *
- * rt_bandwidth + dl_bandwidth <= 100%
- *
- * is always true.
- */
-static bool __sched_rt_dl_global_constraints(u64 rt_bw)
-{
- unsigned long flags;
- u64 dl_bw;
- bool ret;
-
- raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
- if (global_rt_runtime() == RUNTIME_INF ||
- global_dl_runtime() == RUNTIME_INF) {
- ret = true;
- goto unlock;
- }
-
- dl_bw = to_ratio(def_dl_bandwidth.dl_period,
- def_dl_bandwidth.dl_runtime);
-
- ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
-unlock:
- raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
-
- return ret;
-}
-
-#ifdef CONFIG_RT_GROUP_SCHED
static int sched_rt_global_constraints(void)
{
- u64 runtime, period, bw;
+ u64 runtime, period;
int ret = 0;

if (sysctl_sched_rt_period <= 0)
@@ -7827,9 +7875,13 @@ static int sched_rt_global_constraints(void)
if (runtime > period && runtime != RUNTIME_INF)
return -EINVAL;

- bw = to_ratio(period, runtime);
- if (!__sched_rt_dl_global_constraints(bw))
- return -EINVAL;
+ /*
+ * Check if changing rt_bw could have negative effects
+ * on dl_bw
+ */
+ ret = check_dl_bw();
+ if (ret)
+ return ret;

mutex_lock(&rt_constraints_mutex);
read_lock(&tasklist_lock);
@@ -7853,18 +7905,27 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
static int sched_rt_global_constraints(void)
{
unsigned long flags;
- int i, ret = 0;
- u64 bw;
+ int i, ret;

if (sysctl_sched_rt_period <= 0)
return -EINVAL;

+ /*
+ * There's always some RT tasks in the root group
+ * -- migration, kstopmachine etc..
+ */
+ if (sysctl_sched_rt_runtime == 0)
+ return -EBUSY;
+
+ /*
+ * Check if changing rt_bw could have negative effects
+ * on dl_bw
+ */
+ ret = check_dl_bw();
+ if (ret)
+ return ret;
+
raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
- bw = to_ratio(global_rt_period(), global_rt_runtime());
- if (!__sched_rt_dl_global_constraints(bw)) {
- ret = -EINVAL;
- goto unlock;
- }

for_each_possible_cpu(i) {
struct rt_rq *rt_rq = &cpu_rq(i)->rt;
@@ -7873,48 +7934,12 @@ static int sched_rt_global_constraints(void)
rt_rq->rt_runtime = global_rt_runtime();
raw_spin_unlock(&rt_rq->rt_runtime_lock);
}
-unlock:
raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);

- return ret;
+ return 0;
}
#endif /* CONFIG_RT_GROUP_SCHED */

-/*
- * Coupling of -dl and -rt bandwidth.
- *
- * Here we check, while setting the system wide bandwidth available
- * for -dl tasks and groups, if the new values are consistent with
- * the system settings for the bandwidth available to -rt entities.
- *
- * IOW, we want to enforce that
- *
- * rt_bandwidth + dl_bandwidth <= 100%
- *
- * is always true.
- */
-static bool __sched_dl_rt_global_constraints(u64 dl_bw)
-{
- u64 rt_bw;
- bool ret;
-
- raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
- if (global_dl_runtime() == RUNTIME_INF ||
- global_rt_runtime() == RUNTIME_INF) {
- ret = true;
- goto unlock;
- }
-
- rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
- def_rt_bandwidth.rt_runtime);
-
- ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
-unlock:
- raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
-
- return ret;
-}
-
static bool __sched_dl_global_constraints(u64 runtime, u64 period)
{
if (!period || (runtime != RUNTIME_INF && runtime > period))
@@ -7925,40 +7950,17 @@ static bool __sched_dl_global_constraints(u64 runtime, u64 period)

static int sched_dl_global_constraints(void)
{
- u64 runtime = global_dl_runtime();
- u64 period = global_dl_period();
- u64 new_bw = to_ratio(period, runtime);
- int ret, i;
+ u64 period = global_rt_period();
+ u64 dl_actual_runtime = actual_dl_runtime();
+ int ret;

- ret = __sched_dl_global_constraints(runtime, period);
+ ret = __sched_dl_global_constraints(dl_actual_runtime, period);
if (ret)
return ret;

- if (!__sched_dl_rt_global_constraints(new_bw))
- return -EINVAL;
-
- /*
- * Here we want to check the bandwidth not being set to some
- * value smaller than the currently allocated bandwidth in
- * any of the root_domains.
- *
- * FIXME: Cycling on all the CPUs is overdoing, but simpler than
- * cycling on root_domains... Discussion on different/better
- * solutions is welcome!
- */
- for_each_possible_cpu(i) {
-#ifdef CONFIG_SMP
- struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
-#else
- struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
-#endif
- raw_spin_lock(&dl_b->lock);
- if (new_bw < dl_b->total_bw) {
- raw_spin_unlock(&dl_b->lock);
- return -EBUSY;
- }
- raw_spin_unlock(&dl_b->lock);
- }
+ ret = check_dl_bw();
+ if (ret)
+ return ret;

return 0;
}
@@ -7970,6 +7972,7 @@ int sched_rt_handler(struct ctl_table *table, int write,
int ret;
int old_period, old_runtime;
static DEFINE_MUTEX(mutex);
+ unsigned long flags;

mutex_lock(&mutex);
old_period = sysctl_sched_rt_period;
@@ -7978,6 +7981,8 @@ int sched_rt_handler(struct ctl_table *table, int write,
ret = proc_dointvec(table, write, buffer, lenp, ppos);

if (!ret && write) {
+ raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+ flags);
ret = sched_rt_global_constraints();
if (ret) {
sysctl_sched_rt_period = old_period;
@@ -7986,7 +7991,11 @@ int sched_rt_handler(struct ctl_table *table, int write,
def_rt_bandwidth.rt_runtime = global_rt_runtime();
def_rt_bandwidth.rt_period =
ns_to_ktime(global_rt_period());
+
+ update_dl_bw();
}
+ raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+ flags);
}
mutex_unlock(&mutex);

@@ -7998,12 +8007,11 @@ int sched_dl_handler(struct ctl_table *table, int write,
loff_t *ppos)
{
int ret;
- int old_period, old_runtime;
+ int old_runtime;
static DEFINE_MUTEX(mutex);
unsigned long flags;

mutex_lock(&mutex);
- old_period = sysctl_sched_dl_period;
old_runtime = sysctl_sched_dl_runtime;

ret = proc_dointvec(table, write, buffer, lenp, ppos);
@@ -8014,33 +8022,9 @@ int sched_dl_handler(struct ctl_table *table, int write,

ret = sched_dl_global_constraints();
if (ret) {
- sysctl_sched_dl_period = old_period;
sysctl_sched_dl_runtime = old_runtime;
} else {
- u64 new_bw;
- int i;
-
- def_dl_bandwidth.dl_period = global_dl_period();
- def_dl_bandwidth.dl_runtime = global_dl_runtime();
- if (global_dl_runtime() == RUNTIME_INF)
- new_bw = -1;
- else
- new_bw = to_ratio(global_dl_period(),
- global_dl_runtime());
- /*
- * FIXME: As above...
- */
- for_each_possible_cpu(i) {
-#ifdef CONFIG_SMP
- struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
-#else
- struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
-#endif
-
- raw_spin_lock(&dl_b->lock);
- dl_b->bw = new_bw;
- raw_spin_unlock(&dl_b->lock);
- }
+ update_dl_bw();
}

raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
index b345853..4176b8c 100644
--- a/kernel/sched/dl.c
+++ b/kernel/sched/dl.c
@@ -52,7 +52,6 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
{
raw_spin_lock_init(&dl_b->dl_runtime_lock);
- dl_b->dl_period = period;
dl_b->dl_runtime = runtime;
}

@@ -65,7 +64,7 @@ void init_dl_bw(struct dl_bw *dl_b)
if (global_dl_runtime() == RUNTIME_INF)
dl_b->bw = -1;
else
- dl_b->bw = to_ratio(global_dl_period(), global_dl_runtime());
+ dl_b->bw = to_ratio(global_rt_period(), global_dl_runtime());
raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
dl_b->total_bw = 0;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 85fe8a0..1e0d5b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -97,20 +97,20 @@ struct rt_bandwidth {
struct hrtimer rt_period_timer;
};
/*
- * To keep the bandwidth of -deadline tasks and groups under control
- * we need some place where:
- * - store the maximum -deadline bandwidth of the system (the group);
+ * To keep the bandwidth of -deadline tasks under control we need some
+ * place where:
+ * - store the maximum -deadline bandwidth of the system;
* - cache the fraction of that bandwidth that is currently allocated.
*
* This is all done in the data structure below. It is similar to the
* one used for RT-throttling (rt_bandwidth), with the main difference
* that, since here we are only interested in admission control, we
- * do not decrease any runtime while the group "executes", neither we
+ * do not decrease any runtime while the task "executes", neither we
* need a timer to replenish it.
*
* With respect to SMP, the bandwidth is given on a per-CPU basis,
* meaning that:
- * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ * - dl_bw (< 100%) is the bandwidth of the system on each CPU;
* - dl_total_bw array contains, in the i-eth element, the currently
* allocated bandwidth on the i-eth CPU.
* Moreover, groups consume bandwidth on each CPU, while tasks only
@@ -123,7 +123,6 @@ struct rt_bandwidth {
struct dl_bandwidth {
raw_spinlock_t dl_runtime_lock;
u64 dl_runtime;
- u64 dl_period;
};

static inline int dl_bandwidth_enabled(void)
@@ -133,10 +132,12 @@ static inline int dl_bandwidth_enabled(void)

struct dl_bw {
raw_spinlock_t lock;
- u64 bw, total_bw;
+ /* default value */
+ u64 bw;
+ /* allocated */
+ u64 total_bw;
};

-static inline u64 global_dl_period(void);
static inline u64 global_dl_runtime(void);

extern struct mutex sched_domains_mutex;
@@ -781,11 +782,6 @@ static inline u64 global_rt_runtime(void)
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
}

-static inline u64 global_dl_period(void)
-{
- return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
-}
-
static inline u64 global_dl_runtime(void)
{
if (sysctl_sched_dl_runtime < 0)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9731aab..2938473 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -363,13 +363,6 @@ static struct ctl_table kern_table[] = {
.proc_handler = sched_rt_handler,
},
{
- .procname = "sched_dl_period_us",
- .data = &sysctl_sched_dl_period,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = sched_dl_handler,
- },
- {
.procname = "sched_dl_runtime_us",
.data = &sysctl_sched_dl_runtime,
.maxlen = sizeof(int),
--
1.7.9.5

2012-10-24 21:58:13

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 15/16] sched: speed up -dl pushes with a push-heap.

Data from tests confirmed that the original active load balancing
logic didn't scale neither in the number of CPU nor in the number of
tasks (as sched_rt does).

Here we provide a global data structure to keep track of deadlines
of the running tasks in the system. The structure is composed by
a bitmask showing the free CPUs and a max-heap, needed when the system
is heavily loaded.

The implementation and concurrent access scheme are kept simple by
design. However, our measurements show that we can compete with sched_rt
on large multi-CPUs machines [1].

Only the push path is addressed, the extension to use this structure
also for pull decisions is straightforward. However, we are currently
evaluating different (in order to decrease/avoid contention) data
structures to solve possibly both problems. We are also going to re-run
tests considering recent changes inside cpupri [2].

[1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
[2] http://www.spinics.net/lists/linux-rt-users/msg06778.html

Signed-off-by: Juri Lelli <[email protected]>
---
kernel/sched/Makefile | 2 +-
kernel/sched/core.c | 3 +
kernel/sched/cpudl.c | 208 +++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/cpudl.h | 33 ++++++++
kernel/sched/dl.c | 51 +++---------
kernel/sched/sched.h | 2 +
6 files changed, 259 insertions(+), 40 deletions(-)
create mode 100644 kernel/sched/cpudl.c
create mode 100644 kernel/sched/cpudl.h

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 622046c..1c788ab 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -12,7 +12,7 @@ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
endif

obj-y += core.o clock.o cputime.o idle_task.o fair.o rt.o dl.o stop_task.o
-obj-$(CONFIG_SMP) += cpupri.o
+obj-$(CONFIG_SMP) += cpupri.o cpudl.o
obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3003a4e..499c23d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5738,6 +5738,7 @@ static void free_rootdomain(struct rcu_head *rcu)
struct root_domain *rd = container_of(rcu, struct root_domain, rcu);

cpupri_cleanup(&rd->cpupri);
+ cpudl_cleanup(&rd->cpudl);
free_cpumask_var(rd->dlo_mask);
free_cpumask_var(rd->rto_mask);
free_cpumask_var(rd->online);
@@ -5796,6 +5797,8 @@ static int init_rootdomain(struct root_domain *rd)
goto free_dlo_mask;

init_dl_bw(&rd->dl_bw);
+ if (cpudl_init(&rd->cpudl) != 0)
+ goto free_dlo_mask;

if (cpupri_init(&rd->cpupri) != 0)
goto free_rto_mask;
diff --git a/kernel/sched/cpudl.c b/kernel/sched/cpudl.c
new file mode 100644
index 0000000..ac4f746
--- /dev/null
+++ b/kernel/sched/cpudl.c
@@ -0,0 +1,208 @@
+/*
+ * kernel/sched/cpudl.c
+ *
+ * Global CPU deadline management
+ *
+ * Author: Juri Lelli <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include "cpudl.h"
+
+static inline int parent(int i)
+{
+ return (i - 1) >> 1;
+}
+
+static inline int left_child(int i)
+{
+ return (i << 1) + 1;
+}
+
+static inline int right_child(int i)
+{
+ return (i << 1) + 2;
+}
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
+void cpudl_exchange(struct cpudl *cp, int a, int b)
+{
+ int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
+
+ swap(cp->elements[a], cp->elements[b]);
+ swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
+}
+
+void cpudl_heapify(struct cpudl *cp, int idx)
+{
+ int l, r, largest;
+
+ /* adapted from lib/prio_heap.c */
+ while(1) {
+ l = left_child(idx);
+ r = right_child(idx);
+ largest = idx;
+
+ if ((l < cp->size) && dl_time_before(cp->elements[idx].dl,
+ cp->elements[l].dl))
+ largest = l;
+ if ((r < cp->size) && dl_time_before(cp->elements[largest].dl,
+ cp->elements[r].dl))
+ largest = r;
+ if (largest == idx)
+ break;
+
+ /* Push idx down the heap one level and bump one up */
+ cpudl_exchange(cp, largest, idx);
+ idx = largest;
+ }
+}
+
+void cpudl_change_key(struct cpudl *cp, int idx, u64 new_dl)
+{
+ WARN_ON(idx > num_present_cpus() && idx != -1);
+
+ if (dl_time_before(new_dl, cp->elements[idx].dl)) {
+ cp->elements[idx].dl = new_dl;
+ cpudl_heapify(cp, idx);
+ } else {
+ cp->elements[idx].dl = new_dl;
+ while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl,
+ cp->elements[idx].dl)) {
+ cpudl_exchange(cp, idx, parent(idx));
+ idx = parent(idx);
+ }
+ }
+}
+
+static inline int cpudl_maximum(struct cpudl *cp)
+{
+ return cp->elements[0].cpu;
+}
+
+/*
+ * cpudl_find - find the best (later-dl) CPU in the system
+ * @cp: the cpudl max-heap context
+ * @p: the task
+ * @later_mask: a mask to fill in with the selected CPUs (or NULL)
+ *
+ * Returns: int - best CPU (heap maximum if suitable)
+ */
+int cpudl_find(struct cpudl *cp, struct task_struct *p,
+ struct cpumask *later_mask)
+{
+ int best_cpu = -1;
+ const struct sched_dl_entity *dl_se = &p->dl;
+
+ if (later_mask && cpumask_and(later_mask, cp->free_cpus,
+ &p->cpus_allowed) && cpumask_and(later_mask,
+ later_mask, cpu_active_mask)) {
+ best_cpu = cpumask_any(later_mask);
+ goto out;
+ } else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&
+ dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
+ best_cpu = cpudl_maximum(cp);
+ if (later_mask)
+ cpumask_set_cpu(best_cpu, later_mask);
+ }
+
+out:
+ WARN_ON(best_cpu > num_present_cpus() && best_cpu != -1);
+
+ return best_cpu;
+}
+
+/*
+ * cpudl_set - update the cpudl max-heap
+ * @cp: the cpudl max-heap context
+ * @cpu: the target cpu
+ * @dl: the new earliest deadline for this cpu
+ *
+ * Notes: assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
+{
+ int old_idx, new_cpu;
+ unsigned long flags;
+
+ WARN_ON(cpu > num_present_cpus());
+
+ raw_spin_lock_irqsave(&cp->lock, flags);
+ old_idx = cp->cpu_to_idx[cpu];
+ if (!is_valid) {
+ /* remove item */
+ new_cpu = cp->elements[cp->size - 1].cpu;
+ cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
+ cp->elements[old_idx].cpu = new_cpu;
+ cp->size--;
+ cp->cpu_to_idx[new_cpu] = old_idx;
+ cp->cpu_to_idx[cpu] = IDX_INVALID;
+ while (old_idx > 0 && dl_time_before(
+ cp->elements[parent(old_idx)].dl,
+ cp->elements[old_idx].dl)) {
+ cpudl_exchange(cp, old_idx, parent(old_idx));
+ old_idx = parent(old_idx);
+ }
+ cpumask_set_cpu(cpu, cp->free_cpus);
+ cpudl_heapify(cp, old_idx);
+
+ goto out;
+ }
+
+ if (old_idx == IDX_INVALID) {
+ cp->size++;
+ cp->elements[cp->size - 1].dl = 0;
+ cp->elements[cp->size - 1].cpu = cpu;
+ cp->cpu_to_idx[cpu] = cp->size - 1;
+ cpudl_change_key(cp, cp->size - 1, dl);
+ cpumask_clear_cpu(cpu, cp->free_cpus);
+ } else {
+ cpudl_change_key(cp, old_idx, dl);
+ }
+
+out:
+ raw_spin_unlock_irqrestore(&cp->lock, flags);
+}
+
+/*
+ * cpudl_init - initialize the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+int cpudl_init(struct cpudl *cp)
+{
+ int i;
+
+ memset(cp, 0, sizeof(*cp));
+ raw_spin_lock_init(&cp->lock);
+ cp->size = 0;
+ for (i = 0; i < NR_CPUS; i++)
+ cp->cpu_to_idx[i] = IDX_INVALID;
+ if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
+ return -ENOMEM;
+ cpumask_setall(cp->free_cpus);
+
+ return 0;
+}
+
+/*
+ * cpudl_cleanup - clean up the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+void cpudl_cleanup(struct cpudl *cp)
+{
+ /*
+ * nothing to do for the moment
+ */
+}
diff --git a/kernel/sched/cpudl.h b/kernel/sched/cpudl.h
new file mode 100644
index 0000000..a202789
--- /dev/null
+++ b/kernel/sched/cpudl.h
@@ -0,0 +1,33 @@
+#ifndef _LINUX_CPUDL_H
+#define _LINUX_CPUDL_H
+
+#include <linux/sched.h>
+
+#define IDX_INVALID -1
+
+struct array_item {
+ u64 dl;
+ int cpu;
+};
+
+struct cpudl {
+ raw_spinlock_t lock;
+ int size;
+ int cpu_to_idx[NR_CPUS];
+ struct array_item elements[NR_CPUS];
+ cpumask_var_t free_cpus;
+};
+
+
+#ifdef CONFIG_SMP
+int cpudl_find(struct cpudl *cp, struct task_struct *p,
+ struct cpumask *later_mask);
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid);
+int cpudl_init(struct cpudl *cp);
+void cpudl_cleanup(struct cpudl *cp);
+#else
+#define cpudl_set(cp, cpu, dl) do { } while (0)
+#define cpudl_init() do { } while (0)
+#endif /* CONFIG_SMP */
+
+#endif /* _LINUX_CPUDL_H */
diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
index 4176b8c..1002955 100644
--- a/kernel/sched/dl.c
+++ b/kernel/sched/dl.c
@@ -650,6 +650,7 @@ static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
*/
dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
dl_rq->earliest_dl.curr = deadline;
+ cpudl_set(&rq->rd->cpudl, rq->cpu, deadline, 1);
} else if (dl_rq->earliest_dl.next == 0 ||
dl_time_before(deadline, dl_rq->earliest_dl.next)) {
/*
@@ -673,6 +674,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
if (!dl_rq->dl_nr_running) {
dl_rq->earliest_dl.curr = 0;
dl_rq->earliest_dl.next = 0;
+ cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
} else {
struct rb_node *leftmost = dl_rq->rb_leftmost;
struct sched_dl_entity *entry;
@@ -680,6 +682,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
dl_rq->earliest_dl.curr = entry->deadline;
dl_rq->earliest_dl.next = next_deadline(rq);
+ cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline, 1);
}
}

@@ -861,9 +864,6 @@ static void yield_task_dl(struct rq *rq)
#ifdef CONFIG_SMP

static int find_later_rq(struct task_struct *task);
-static int latest_cpu_find(struct cpumask *span,
- struct task_struct *task,
- struct cpumask *later_mask);

static int
select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
@@ -913,7 +913,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 ||
- latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+ cpudl_find(&rq->rd->cpudl, rq->curr, NULL) == -1)
return;

/*
@@ -921,7 +921,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* see if it is pushed or pulled somewhere else.
*/
if (p->nr_cpus_allowed != 1 &&
- latest_cpu_find(rq->rd->span, p, NULL) != -1)
+ cpudl_find(&rq->rd->cpudl, p, NULL) != -1)
return;

resched_task(rq->curr);
@@ -1114,39 +1114,6 @@ next_node:
return NULL;
}

-static int latest_cpu_find(struct cpumask *span,
- struct task_struct *task,
- struct cpumask *later_mask)
-{
- const struct sched_dl_entity *dl_se = &task->dl;
- int cpu, found = -1, best = 0;
- u64 max_dl = 0;
-
- for_each_cpu(cpu, span) {
- struct rq *rq = cpu_rq(cpu);
- struct dl_rq *dl_rq = &rq->dl;
-
- if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
- (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
- dl_rq->earliest_dl.curr))) {
- if (later_mask)
- cpumask_set_cpu(cpu, later_mask);
- if (!best && !dl_rq->dl_nr_running) {
- best = 1;
- found = cpu;
- } else if (!best &&
- dl_time_before(max_dl,
- dl_rq->earliest_dl.curr)) {
- max_dl = dl_rq->earliest_dl.curr;
- found = cpu;
- }
- } else if (later_mask)
- cpumask_clear_cpu(cpu, later_mask);
- }
-
- return found;
-}
-
static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);

static int find_later_rq(struct task_struct *task)
@@ -1163,7 +1130,8 @@ static int find_later_rq(struct task_struct *task)
if (task->nr_cpus_allowed == 1)
return -1;

- best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+ best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
+ task, later_mask);
if (best_cpu == -1)
return -1;

@@ -1529,6 +1497,9 @@ static void rq_online_dl(struct rq *rq)
{
if (rq->dl.overloaded)
dl_set_overload(rq);
+
+ if (rq->dl.dl_nr_running > 0)
+ cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr, 1);
}

/* Assumes rq->lock is held */
@@ -1536,6 +1507,8 @@ static void rq_offline_dl(struct rq *rq)
{
if (rq->dl.overloaded)
dl_clear_overload(rq);
+
+ cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
}

void init_sched_dl_class(void)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1e0d5b1..3450dee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -5,6 +5,7 @@
#include <linux/stop_machine.h>

#include "cpupri.h"
+#include "cpudl.h"

extern __read_mostly int scheduler_running;

@@ -445,6 +446,7 @@ struct root_domain {
cpumask_var_t dlo_mask;
atomic_t dlo_count;
struct dl_bw dl_bw;
+ struct cpudl cpudl;

/*
* The "RT overload" flag: it gets set if a CPU has more than
--
1.7.9.5

2012-10-24 21:58:28

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 16/16] sched: add sched_dl documentation.

From: Dario Faggioli <[email protected]>

Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.

Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
---
Documentation/scheduler/sched-deadline.txt | 164 ++++++++++++++++++++++++++++
kernel/sched/dl.c | 3 +-
2 files changed, 166 insertions(+), 1 deletion(-)
create mode 100644 Documentation/scheduler/sched-deadline.txt

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..d4dcfc7
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,164 @@
+ Deadline Task and Group Scheduling
+ ----------------------------------
+
+CONTENTS
+========
+
+0. WARNING
+1. Overview
+2. Task scheduling
+2. The Interface
+3. Bandwidth management
+ 3.1 System-wide settings
+ 2.2 Task interface
+ 2.4 Default behavior
+3. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root users
+ know what they're doing.
+
+
+1. Overview
+===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that makes it possible to isolate the behavior of tasks between each other.
+
+
+2. Task scheduling
+==================
+
+ The typical -deadline task is composed of a computation phase (instance)
+ which is activated on a periodic or sporadic fashion. The expected (maximum)
+ duration of such computation is called the task's runtime; the time interval
+ by which each instance needs to be completed is called the task's relative
+ deadline. The task's absolute deadline is dynamically calculated as the
+ time instant a task (or, more properly) activates plus the relative
+ deadline.
+
+ The EDF[1] algorithm selects the task with the smallest absolute deadline as
+ the one to be executed first, while the CBS[2,3] ensures that each task runs
+ for at most its runtime every period, avoiding any interference between
+ different tasks (bandwidth isolation).
+ Thanks to this feature, also tasks that do not strictly comply with the
+ computational model described above can effectively use the new policy.
+ IOW, there are no limitations on what kind of task can exploit this new
+ scheduling discipline, even if it must be said that it is particularly
+ suited for periodic or sporadic tasks that need guarantees on their
+ timing behavior, e.g., multimedia, streaming, control applications, etc.
+
+ References:
+ 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
+ ming in a hard-real-time environment. Journal of the Association for
+ Computing Machinery, 20(1), 1973.
+ 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
+ Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
+ Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
+ 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
+ Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
+
+3. Bandwidth management
+=======================
+
+ In order for the -deadline scheduling to be effective and useful, it is
+ important to have some method to keep the allocation of the available CPU
+ bandwidth to the tasks under control.
+ This is usually called "admission control" and if it is not performed at all,
+ no guarantee can be given on the actual scheduling of the -deadline tasks.
+
+ Since when RT-throttling has been introduced each task group has a bandwidth
+ associated, calculated as a certain amount of runtime over a period.
+ Moreover, to make it possible to manipulate such bandwidth, readable/writable
+ controls have been added to both procfs (for system wide settings) and cgroupfs
+ (for per-group settings).
+ Therefore, the same interface is being used for controlling the bandwidth
+ distrubution to -deadline tasks and task groups, i.e., new controls but with
+ similar names, equivalent meaning and with the same usage paradigm are added.
+
+ However, more discussion is needed in order to figure out how we want to manage
+ SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
+ uses (for now) a less sophisticated, but actually very sensible, mechanism to
+ ensure that a certain utilization cap is not overcome per each root_domain.
+
+ Another main difference between deadline bandwidth management and RT-throttling
+ is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
+ and thus we don't need an higher level throttling mechanism to enforce the
+ desired bandwidth.
+
+3.1 System wide settings
+------------------------
+
+The system wide settings are configured under the /proc virtual file system:
+
+ The per-group controls that are added to the cgroupfs virtual file system are:
+ * /proc/sys/kernel/sched_dl_runtime_us,
+ * /proc/sys/kernel/sched_dl_period_us,
+
+ They accept (if written) and provides (if read) the new runtime and period,
+ respectively, for each CPU in each root_domain.
+
+ This means that, for a root_domain comprising M CPUs, -deadline tasks
+ can be created until the sum of their bandwidths stay below:
+
+ M * (sched_dl_runtime_us / sched_dl_period_us)
+
+ It is also possible to disable this bandwidth management logic, and
+ be thus free of oversubscribing the system up to any arbitrary level.
+ This is done by writing -1 in /proc/sys/kernel/sched_dl_runtime_us.
+
+
+2.2 Task interface
+------------------
+
+ Specifying a periodic/sporadic task that executes for a given amount of
+ runtime at each instance, and that is scheduled according to the urgency of
+ its own timing constraints needs, in general, a way of declaring:
+ - a (maximum/typical) instance execution time,
+ - a minimum interval between consecutive instances,
+ - a time constraint by which each instance must be completed.
+
+ Therefore:
+ * a new struct sched_param2, containing all the necessary fields is
+ provided;
+ * the new scheduling related syscalls that manipulate it, i.e.,
+ sched_setscheduler2(), sched_setparam2() and sched_getparam2()
+ are implemented.
+
+
+2.4 Default behavior
+---------------------
+
+The default values for SCHED_DEADLINE bandwidth is to have dl_runtime and
+dl_period equal to 500000 and 1000000, respectively. This means -deadline
+tasks can use at most 5%, multiplied by the number of CPUs that compose the
+root_domain, for each root_domain.
+
+When a -deadline task fork a child, its dl_runtime is set to 0, which means
+someone must call sched_setscheduler2() on it, or it won't even start.
+
+
+3. Future plans
+===============
+
+Still missing:
+
+ - refinements to deadline inheritance, especially regarding the possibility
+ of retaining bandwidth isolation among non-interacting tasks. This is
+ being studied from both theoretical and practical point of views, and
+ hopefully we should be able to produce some demonstrative code soon;
+ - (c)group based bandwidth management, and maybe scheduling;
+ - access control for non-root users (and related security concerns to
+ address), which is the best way to allow unprivileged use of the mechanisms
+ and how to prevent non-root users "cheat" the system?
+
+As already discussed, we are planning also to merge this work with the EDF
+throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
+the preliminary phases of the merge and we really seek feedback that would help us
+decide on the direction it should take.
diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
index 1002955..69d9a54 100644
--- a/kernel/sched/dl.c
+++ b/kernel/sched/dl.c
@@ -347,7 +347,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
* disrupting the schedulability of the system. Otherwise, we should
* refill the runtime and set the deadline a period in the future,
* because keeping the current (absolute) deadline of the task would
- * result in breaking guarantees promised to other tasks.
+ * result in breaking guarantees promised to other tasks (refer to
+ * Documentation/scheduler/sched-deadline.txt for more informations).
*
* This function returns true if:
*
--
1.7.9.5

2012-10-24 21:59:53

by Juri Lelli

[permalink] [raw]
Subject: [PATCH 06/16] sched: SCHED_DEADLINE SMP-related data structures & logic.

Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.

Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.

The very same approach used in sched_rt is utilised:
- -deadline tasks are kept into CPU-specific runqueues,
- -deadline tasks are migrated among runqueues to achieve the
following:
* on an M-CPU system the M earliest deadline ready tasks
are always running;
* affinity/cpusets settings of all the -deadline tasks is
always respected.

Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.

To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.

In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.

Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Dario Faggioli <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/core.c | 10 +-
kernel/sched/dl.c | 939 +++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 33 ++
5 files changed, 955 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 85d33f5..92ae764 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1228,7 +1228,6 @@ struct sched_rt_entity {

struct sched_dl_entity {
struct rb_node rb_node;
- int nr_cpus_allowed;

/*
* Original scheduling parameters. Copied here from sched_param2
@@ -1346,6 +1345,7 @@ struct task_struct {
struct list_head tasks;
#ifdef CONFIG_SMP
struct plist_node pushable_tasks;
+ struct rb_node pushable_dl_tasks;
#endif

struct mm_struct *mm, *active_mm;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e2d26d..934d3c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1621,6 +1621,7 @@ int sched_fork(struct task_struct *p)
#endif
#ifdef CONFIG_SMP
plist_node_init(&p->pushable_tasks, MAX_PRIO);
+ RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif

put_cpu();
@@ -4786,6 +4787,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)

cpumask_copy(&p->cpus_allowed, new_mask);
p->nr_cpus_allowed = cpumask_weight(new_mask);
+ p->nr_cpus_allowed = cpumask_weight(new_mask);
}

/*
@@ -5513,6 +5515,7 @@ static void free_rootdomain(struct rcu_head *rcu)
struct root_domain *rd = container_of(rcu, struct root_domain, rcu);

cpupri_cleanup(&rd->cpupri);
+ free_cpumask_var(rd->dlo_mask);
free_cpumask_var(rd->rto_mask);
free_cpumask_var(rd->online);
free_cpumask_var(rd->span);
@@ -5564,8 +5567,10 @@ static int init_rootdomain(struct root_domain *rd)
goto out;
if (!alloc_cpumask_var(&rd->online, GFP_KERNEL))
goto free_span;
- if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+ if (!alloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL))
goto free_online;
+ if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+ goto free_dlo_mask;

if (cpupri_init(&rd->cpupri) != 0)
goto free_rto_mask;
@@ -5573,6 +5578,8 @@ static int init_rootdomain(struct root_domain *rd)

free_rto_mask:
free_cpumask_var(rd->rto_mask);
+free_dlo_mask:
+ free_cpumask_var(rd->dlo_mask);
free_online:
free_cpumask_var(rd->online);
free_span:
@@ -6898,6 +6905,7 @@ void __init sched_init_smp(void)
free_cpumask_var(non_isolated_cpus);

init_sched_rt_class();
+ init_sched_dl_class();
}
#else
void __init sched_init_smp(void)
diff --git a/kernel/sched/dl.c b/kernel/sched/dl.c
index 7e12ceb..bc8c310 100644
--- a/kernel/sched/dl.c
+++ b/kernel/sched/dl.c
@@ -10,6 +10,7 @@
* miss some of their deadlines), and won't affect any other task.
*
* Copyright (C) 2012 Dario Faggioli <[email protected]>,
+ * Juri Lelli <[email protected]>,
* Michael Trimarchi <[email protected]>,
* Fabio Checconi <[email protected]>
*/
@@ -21,6 +22,15 @@ static inline int dl_time_before(u64 a, u64 b)
return (s64)(a - b) < 0;
}

+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+ return dl_time_before(a->deadline, b->deadline);
+}
+
static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
{
return container_of(dl_se, struct task_struct, dl);
@@ -54,8 +64,166 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
{
dl_rq->rb_root = RB_ROOT;
+
+#ifdef CONFIG_SMP
+ /* zero means no -deadline tasks */
+ dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
+
+ dl_rq->dl_nr_migratory = 0;
+ dl_rq->overloaded = 0;
+ dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#endif
+}
+
+#ifdef CONFIG_SMP
+
+static inline int dl_overloaded(struct rq *rq)
+{
+ return atomic_read(&rq->rd->dlo_count);
+}
+
+static inline void dl_set_overload(struct rq *rq)
+{
+ if (!rq->online)
+ return;
+
+ cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
+ /*
+ * Must be visible before the overload count is
+ * set (as in sched_rt.c).
+ */
+ wmb();
+ atomic_inc(&rq->rd->dlo_count);
+}
+
+static inline void dl_clear_overload(struct rq *rq)
+{
+ if (!rq->online)
+ return;
+
+ atomic_dec(&rq->rd->dlo_count);
+ cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
+}
+
+static void update_dl_migration(struct dl_rq *dl_rq)
+{
+ if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
+ if (!dl_rq->overloaded) {
+ dl_set_overload(rq_of_dl_rq(dl_rq));
+ dl_rq->overloaded = 1;
+ }
+ } else if (dl_rq->overloaded) {
+ dl_clear_overload(rq_of_dl_rq(dl_rq));
+ dl_rq->overloaded = 0;
+ }
+}
+
+static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ struct task_struct *p = dl_task_of(dl_se);
+ dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+ dl_rq->dl_nr_total++;
+ if (p->nr_cpus_allowed > 1)
+ dl_rq->dl_nr_migratory++;
+
+ update_dl_migration(dl_rq);
+}
+
+static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ struct task_struct *p = dl_task_of(dl_se);
+ dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+ dl_rq->dl_nr_total--;
+ if (p->nr_cpus_allowed > 1)
+ dl_rq->dl_nr_migratory--;
+
+ update_dl_migration(dl_rq);
+}
+
+/*
+ * The list of pushable -deadline task is not a plist, like in
+ * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
+ */
+static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+ struct dl_rq *dl_rq = &rq->dl;
+ struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
+ struct rb_node *parent = NULL;
+ struct task_struct *entry;
+ int leftmost = 1;
+
+ BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct task_struct,
+ pushable_dl_tasks);
+ if (dl_entity_preempt(&p->dl, &entry->dl))
+ link = &parent->rb_left;
+ else {
+ link = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ if (leftmost)
+ dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
+
+ rb_link_node(&p->pushable_dl_tasks, parent, link);
+ rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+}
+
+static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+ struct dl_rq *dl_rq = &rq->dl;
+
+ if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
+ return;
+
+ if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
+ struct rb_node *next_node;
+
+ next_node = rb_next(&p->pushable_dl_tasks);
+ dl_rq->pushable_dl_tasks_leftmost = next_node;
+ }
+
+ rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+ RB_CLEAR_NODE(&p->pushable_dl_tasks);
+}
+
+static inline int has_pushable_dl_tasks(struct rq *rq)
+{
+ return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
+}
+
+static int push_dl_task(struct rq *rq);
+
+#else
+
+static inline
+void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+}
+
+static inline
+void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
}

+#endif /* CONFIG_SMP */
+
static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
@@ -306,6 +474,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
check_preempt_curr_dl(rq, p, 0);
else
resched_task(rq->curr);
+#ifdef CONFIG_SMP
+ /*
+ * Queueing this task back might have overloaded rq,
+ * check if we need to kick someone away.
+ */
+ if (has_pushable_dl_tasks(rq))
+ push_dl_task(rq);
+#endif
}
unlock:
raw_spin_unlock(&rq->lock);
@@ -393,6 +569,100 @@ static void update_curr_dl(struct rq *rq)
}
}

+#ifdef CONFIG_SMP
+
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
+
+static inline u64 next_deadline(struct rq *rq)
+{
+ struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
+
+ if (next && dl_prio(next->prio))
+ return next->dl.deadline;
+ else
+ return 0;
+}
+
+static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+ struct rq *rq = rq_of_dl_rq(dl_rq);
+
+ if (dl_rq->earliest_dl.curr == 0 ||
+ dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
+ /*
+ * If the dl_rq had no -deadline tasks, or if the new task
+ * has shorter deadline than the current one on dl_rq, we
+ * know that the previous earliest becomes our next earliest,
+ * as the new task becomes the earliest itself.
+ */
+ dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
+ dl_rq->earliest_dl.curr = deadline;
+ } else if (dl_rq->earliest_dl.next == 0 ||
+ dl_time_before(deadline, dl_rq->earliest_dl.next)) {
+ /*
+ * On the other hand, if the new -deadline task has a
+ * a later deadline than the earliest one on dl_rq, but
+ * it is earlier than the next (if any), we must
+ * recompute the next-earliest.
+ */
+ dl_rq->earliest_dl.next = next_deadline(rq);
+ }
+}
+
+static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+ struct rq *rq = rq_of_dl_rq(dl_rq);
+
+ /*
+ * Since we may have removed our earliest (and/or next earliest)
+ * task we must recompute them.
+ */
+ if (!dl_rq->dl_nr_running) {
+ dl_rq->earliest_dl.curr = 0;
+ dl_rq->earliest_dl.next = 0;
+ } else {
+ struct rb_node *leftmost = dl_rq->rb_leftmost;
+ struct sched_dl_entity *entry;
+
+ entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
+ dl_rq->earliest_dl.curr = entry->deadline;
+ dl_rq->earliest_dl.next = next_deadline(rq);
+ }
+}
+
+#else
+
+static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+
+#endif /* CONFIG_SMP */
+
+static inline
+void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ int prio = dl_task_of(dl_se)->prio;
+ u64 deadline = dl_se->deadline;
+
+ WARN_ON(!dl_prio(prio));
+ dl_rq->dl_nr_running++;
+
+ inc_dl_deadline(dl_rq, deadline);
+ inc_dl_migration(dl_se, dl_rq);
+}
+
+static inline
+void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ int prio = dl_task_of(dl_se)->prio;
+
+ WARN_ON(!dl_prio(prio));
+ WARN_ON(!dl_rq->dl_nr_running);
+ dl_rq->dl_nr_running--;
+
+ dec_dl_deadline(dl_rq, dl_se->deadline);
+ dec_dl_migration(dl_se, dl_rq);
+}
+
static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -420,7 +690,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
rb_link_node(&dl_se->rb_node, parent, link);
rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);

- dl_rq->dl_nr_running++;
+ inc_dl_tasks(dl_se, dl_rq);
}

static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
@@ -440,7 +710,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
RB_CLEAR_NODE(&dl_se->rb_node);

- dl_rq->dl_nr_running--;
+ dec_dl_tasks(dl_se, dl_rq);
}

static void
@@ -478,12 +748,17 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
return;

enqueue_dl_entity(&p->dl, flags);
+
+ if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+ enqueue_pushable_dl_task(rq, p);
+
inc_nr_running(rq);
}

static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
dequeue_dl_entity(&p->dl);
+ dequeue_pushable_dl_task(rq, p);
}

static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -517,6 +792,77 @@ static void yield_task_dl(struct rq *rq)
update_curr_dl(rq);
}

+#ifdef CONFIG_SMP
+
+static int find_later_rq(struct task_struct *task);
+static int latest_cpu_find(struct cpumask *span,
+ struct task_struct *task,
+ struct cpumask *later_mask);
+
+static int
+select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
+{
+ struct task_struct *curr;
+ struct rq *rq;
+ int cpu;
+
+ cpu = task_cpu(p);
+
+ if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
+ goto out;
+
+ rq = cpu_rq(cpu);
+
+ rcu_read_lock();
+ curr = ACCESS_ONCE(rq->curr); /* unlocked access */
+
+ /*
+ * If we are dealing with a -deadline task, we must
+ * decide where to wake it up.
+ * If it has a later deadline and the current task
+ * on this rq can't move (provided the waking task
+ * can!) we prefer to send it somewhere else. On the
+ * other hand, if it has a shorter deadline, we
+ * try to make it stay here, it might be important.
+ */
+ if (unlikely(dl_task(curr)) &&
+ (curr->nr_cpus_allowed < 2 ||
+ !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ (p->nr_cpus_allowed > 1)) {
+ int target = find_later_rq(p);
+
+ if (target != -1)
+ cpu = target;
+ }
+ rcu_read_unlock();
+
+out:
+ return cpu;
+}
+
+static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
+{
+ /*
+ * Current can't be migrated, useless to reschedule,
+ * let's hope p can move out.
+ */
+ if (rq->curr->nr_cpus_allowed == 1 ||
+ latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+ return;
+
+ /*
+ * p is migratable, so let's not schedule it and
+ * see if it is pushed or pulled somewhere else.
+ */
+ if (p->nr_cpus_allowed != 1 &&
+ latest_cpu_find(rq->rd->span, p, NULL) != -1)
+ return;
+
+ resched_task(rq->curr);
+}
+
+#endif /* CONFIG_SMP */
+
/*
* Only called when both the current and waking task are -deadline
* tasks.
@@ -524,8 +870,20 @@ static void yield_task_dl(struct rq *rq)
static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
int flags)
{
- if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+ if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
resched_task(rq->curr);
+ return;
+ }
+
+#ifdef CONFIG_SMP
+ /*
+ * In the unlikely case current and p have the same deadline
+ * let us try to decide what's the best thing to do...
+ */
+ if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
+ !need_resched())
+ check_preempt_equal_dl(rq, p);
+#endif /* CONFIG_SMP */
}

#ifdef CONFIG_SCHED_HRTICK
@@ -568,17 +926,30 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
BUG_ON(!dl_se);

p = dl_task_of(dl_se);
- p->se.exec_start = rq->clock;
+ p->se.exec_start = rq->clock_task;
+
+ /* Running task will never be pushed. */
+ if (p)
+ dequeue_pushable_dl_task(rq, p);
+
#ifdef CONFIG_SCHED_HRTICK
if (hrtick_enabled(rq))
start_hrtick_dl(rq, p);
#endif
+
+#ifdef CONFIG_SMP
+ rq->post_schedule = has_pushable_dl_tasks(rq);
+#endif /* CONFIG_SMP */
+
return p;
}

static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
{
update_curr_dl(rq);
+
+ if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
+ enqueue_pushable_dl_task(rq, p);
}

static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
@@ -611,17 +982,508 @@ static void set_curr_task_dl(struct rq *rq)
{
struct task_struct *p = rq->curr;

- p->se.exec_start = rq->clock;
+ p->se.exec_start = rq->clock_task;
+
+ /* You can't push away the running task */
+ dequeue_pushable_dl_task(rq, p);
+}
+
+#ifdef CONFIG_SMP
+
+/* Only try algorithms three times */
+#define DL_MAX_TRIES 3
+
+static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+ if (!task_running(rq, p) &&
+ (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
+ (p->nr_cpus_allowed > 1))
+ return 1;
+
+ return 0;
}

+/* Returns the second earliest -deadline task, NULL otherwise */
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
+{
+ struct rb_node *next_node = rq->dl.rb_leftmost;
+ struct sched_dl_entity *dl_se;
+ struct task_struct *p = NULL;
+
+next_node:
+ next_node = rb_next(next_node);
+ if (next_node) {
+ dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
+ p = dl_task_of(dl_se);
+
+ if (pick_dl_task(rq, p, cpu))
+ return p;
+
+ goto next_node;
+ }
+
+ return NULL;
+}
+
+static int latest_cpu_find(struct cpumask *span,
+ struct task_struct *task,
+ struct cpumask *later_mask)
+{
+ const struct sched_dl_entity *dl_se = &task->dl;
+ int cpu, found = -1, best = 0;
+ u64 max_dl = 0;
+
+ for_each_cpu(cpu, span) {
+ struct rq *rq = cpu_rq(cpu);
+ struct dl_rq *dl_rq = &rq->dl;
+
+ if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
+ (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
+ dl_rq->earliest_dl.curr))) {
+ if (later_mask)
+ cpumask_set_cpu(cpu, later_mask);
+ if (!best && !dl_rq->dl_nr_running) {
+ best = 1;
+ found = cpu;
+ } else if (!best &&
+ dl_time_before(max_dl,
+ dl_rq->earliest_dl.curr)) {
+ max_dl = dl_rq->earliest_dl.curr;
+ found = cpu;
+ }
+ } else if (later_mask)
+ cpumask_clear_cpu(cpu, later_mask);
+ }
+
+ return found;
+}
+
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
+
+static int find_later_rq(struct task_struct *task)
+{
+ struct sched_domain *sd;
+ struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
+ int this_cpu = smp_processor_id();
+ int best_cpu, cpu = task_cpu(task);
+
+ /* Make sure the mask is initialized first */
+ if (unlikely(!later_mask))
+ return -1;
+
+ if (task->nr_cpus_allowed == 1)
+ return -1;
+
+ best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+ if (best_cpu == -1)
+ return -1;
+
+ /*
+ * If we are here, some target has been found,
+ * the most suitable of which is cached in best_cpu.
+ * This is, among the runqueues where the current tasks
+ * have later deadlines than the task's one, the rq
+ * with the latest possible one.
+ *
+ * Now we check how well this matches with task's
+ * affinity and system topology.
+ *
+ * The last cpu where the task run is our first
+ * guess, since it is most likely cache-hot there.
+ */
+ if (cpumask_test_cpu(cpu, later_mask))
+ return cpu;
+ /*
+ * Check if this_cpu is to be skipped (i.e., it is
+ * not in the mask) or not.
+ */
+ if (!cpumask_test_cpu(this_cpu, later_mask))
+ this_cpu = -1;
+
+ rcu_read_lock();
+ for_each_domain(cpu, sd) {
+ if (sd->flags & SD_WAKE_AFFINE) {
+
+ /*
+ * If possible, preempting this_cpu is
+ * cheaper than migrating.
+ */
+ if (this_cpu != -1 &&
+ cpumask_test_cpu(this_cpu, sched_domain_span(sd)))
+ return this_cpu;
+
+ /*
+ * Last chance: if best_cpu is valid and is
+ * in the mask, that becomes our choice.
+ */
+ if (best_cpu < nr_cpu_ids &&
+ cpumask_test_cpu(best_cpu, sched_domain_span(sd)))
+ return best_cpu;
+ }
+ }
+ rcu_read_unlock();
+
+ /*
+ * At this point, all our guesses failed, we just return
+ * 'something', and let the caller sort the things out.
+ */
+ if (this_cpu != -1)
+ return this_cpu;
+
+ cpu = cpumask_any(later_mask);
+ if (cpu < nr_cpu_ids)
+ return cpu;
+
+ return -1;
+}
+
+/* Locks the rq it finds */
+static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
+{
+ struct rq *later_rq = NULL;
+ int tries;
+ int cpu;
+
+ for (tries = 0; tries < DL_MAX_TRIES; tries++) {
+ cpu = find_later_rq(task);
+
+ if ((cpu == -1) || (cpu == rq->cpu))
+ break;
+
+ later_rq = cpu_rq(cpu);
+
+ /* Retry if something changed. */
+ if (double_lock_balance(rq, later_rq)) {
+ if (unlikely(task_rq(task) != rq ||
+ !cpumask_test_cpu(later_rq->cpu,
+ &task->cpus_allowed) ||
+ task_running(rq, task) || !task->on_rq)) {
+ double_unlock_balance(rq, later_rq);
+ later_rq = NULL;
+ break;
+ }
+ }
+
+ /*
+ * If the rq we found has no -deadline task, or
+ * its earliest one has a later deadline than our
+ * task, the rq is a good one.
+ */
+ if (!later_rq->dl.dl_nr_running ||
+ dl_time_before(task->dl.deadline,
+ later_rq->dl.earliest_dl.curr))
+ break;
+
+ /* Otherwise we try again. */
+ double_unlock_balance(rq, later_rq);
+ later_rq = NULL;
+ }
+
+ return later_rq;
+}
+
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
+{
+ struct task_struct *p;
+
+ if (!has_pushable_dl_tasks(rq))
+ return NULL;
+
+ p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
+ struct task_struct, pushable_dl_tasks);
+
+ BUG_ON(rq->cpu != task_cpu(p));
+ BUG_ON(task_current(rq, p));
+ BUG_ON(p->nr_cpus_allowed <= 1);
+
+ BUG_ON(!p->se.on_rq);
+ BUG_ON(!dl_task(p));
+
+ return p;
+}
+
+/*
+ * See if the non running -deadline tasks on this rq
+ * can be sent to some other CPU where they can preempt
+ * and start executing.
+ */
+static int push_dl_task(struct rq *rq)
+{
+ struct task_struct *next_task;
+ struct rq *later_rq;
+
+ if (!rq->dl.overloaded)
+ return 0;
+
+ next_task = pick_next_pushable_dl_task(rq);
+ if (!next_task)
+ return 0;
+
+retry:
+ if (unlikely(next_task == rq->curr)) {
+ WARN_ON(1);
+ return 0;
+ }
+
+ /*
+ * If next_task preempts rq->curr, and rq->curr
+ * can move away, it makes sense to just reschedule
+ * without going further in pushing next_task.
+ */
+ if (dl_task(rq->curr) &&
+ dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+ rq->curr->nr_cpus_allowed > 1) {
+ resched_task(rq->curr);
+ return 0;
+ }
+
+ /* We might release rq lock */
+ get_task_struct(next_task);
+
+ /* Will lock the rq it'll find */
+ later_rq = find_lock_later_rq(next_task, rq);
+ if (!later_rq) {
+ struct task_struct *task;
+
+ /*
+ * We must check all this again, since
+ * find_lock_later_rq releases rq->lock and it is
+ * then possible that next_task has migrated.
+ */
+ task = pick_next_pushable_dl_task(rq);
+ if (task_cpu(next_task) == rq->cpu && task == next_task) {
+ /*
+ * The task is still there. We don't try
+ * again, some other cpu will pull it when ready.
+ */
+ dequeue_pushable_dl_task(rq, next_task);
+ goto out;
+ }
+
+ if (!task)
+ /* No more tasks */
+ goto out;
+
+ put_task_struct(next_task);
+ next_task = task;
+ goto retry;
+ }
+
+ deactivate_task(rq, next_task, 0);
+ set_task_cpu(next_task, later_rq->cpu);
+ activate_task(later_rq, next_task, 0);
+
+ resched_task(later_rq->curr);
+
+ double_unlock_balance(rq, later_rq);
+
+out:
+ put_task_struct(next_task);
+
+ return 1;
+}
+
+static void push_dl_tasks(struct rq *rq)
+{
+ /* Terminates as it moves a -deadline task */
+ while (push_dl_task(rq))
+ ;
+}
+
+static int pull_dl_task(struct rq *this_rq)
+{
+ int this_cpu = this_rq->cpu, ret = 0, cpu;
+ struct task_struct *p;
+ struct rq *src_rq;
+ u64 dmin = LONG_MAX;
+
+ if (likely(!dl_overloaded(this_rq)))
+ return 0;
+
+ for_each_cpu(cpu, this_rq->rd->dlo_mask) {
+ if (this_cpu == cpu)
+ continue;
+
+ src_rq = cpu_rq(cpu);
+
+ /*
+ * It looks racy, abd it is! However, as in sched_rt.c,
+ * we are fine with this.
+ */
+ if (this_rq->dl.dl_nr_running &&
+ dl_time_before(this_rq->dl.earliest_dl.curr,
+ src_rq->dl.earliest_dl.next))
+ continue;
+
+ /* Might drop this_rq->lock */
+ double_lock_balance(this_rq, src_rq);
+
+ /*
+ * If there are no more pullable tasks on the
+ * rq, we're done with it.
+ */
+ if (src_rq->dl.dl_nr_running <= 1)
+ goto skip;
+
+ p = pick_next_earliest_dl_task(src_rq, this_cpu);
+
+ /*
+ * We found a task to be pulled if:
+ * - it preempts our current (if there's one),
+ * - it will preempt the last one we pulled (if any).
+ */
+ if (p && dl_time_before(p->dl.deadline, dmin) &&
+ (!this_rq->dl.dl_nr_running ||
+ dl_time_before(p->dl.deadline,
+ this_rq->dl.earliest_dl.curr))) {
+ WARN_ON(p == src_rq->curr);
+ WARN_ON(!p->se.on_rq);
+
+ /*
+ * Then we pull iff p has actually an earlier
+ * deadline than the current task of its runqueue.
+ */
+ if (dl_time_before(p->dl.deadline,
+ src_rq->curr->dl.deadline))
+ goto skip;
+
+ ret = 1;
+
+ deactivate_task(src_rq, p, 0);
+ set_task_cpu(p, this_cpu);
+ activate_task(this_rq, p, 0);
+ dmin = p->dl.deadline;
+
+ /* Is there any other task even earlier? */
+ }
+skip:
+ double_unlock_balance(this_rq, src_rq);
+ }
+
+ return ret;
+}
+
+static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
+{
+ /* Try to pull other tasks here */
+ if (dl_task(prev))
+ pull_dl_task(rq);
+}
+
+static void post_schedule_dl(struct rq *rq)
+{
+ push_dl_tasks(rq);
+}
+
+/*
+ * Since the task is not running and a reschedule is not going to happen
+ * anytime soon on its runqueue, we try pushing it away now.
+ */
+static void task_woken_dl(struct rq *rq, struct task_struct *p)
+{
+ if (!task_running(rq, p) &&
+ !test_tsk_need_resched(rq->curr) &&
+ has_pushable_dl_tasks(rq) &&
+ p->nr_cpus_allowed > 1 &&
+ dl_task(rq->curr) &&
+ (rq->curr->nr_cpus_allowed < 2 ||
+ dl_entity_preempt(&rq->curr->dl, &p->dl))) {
+ push_dl_tasks(rq);
+ }
+}
+
+static void set_cpus_allowed_dl(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+ struct rq *rq;
+ int weight;
+
+ BUG_ON(!dl_task(p));
+
+ /*
+ * Update only if the task is actually running (i.e.,
+ * it is on the rq AND it is not throttled).
+ */
+ if (!on_dl_rq(&p->dl))
+ return;
+
+ weight = cpumask_weight(new_mask);
+
+ /*
+ * Only update if the process changes its state from whether it
+ * can migrate or not.
+ */
+ if ((p->nr_cpus_allowed > 1) == (weight > 1))
+ return;
+
+ rq = task_rq(p);
+
+ /*
+ * The process used to be able to migrate OR it can now migrate
+ */
+ if (weight <= 1) {
+ if (!task_current(rq, p))
+ dequeue_pushable_dl_task(rq, p);
+ BUG_ON(!rq->dl.dl_nr_migratory);
+ rq->dl.dl_nr_migratory--;
+ } else {
+ if (!task_current(rq, p))
+ enqueue_pushable_dl_task(rq, p);
+ rq->dl.dl_nr_migratory++;
+ }
+
+ update_dl_migration(&rq->dl);
+}
+
+/* Assumes rq->lock is held */
+static void rq_online_dl(struct rq *rq)
+{
+ if (rq->dl.overloaded)
+ dl_set_overload(rq);
+}
+
+/* Assumes rq->lock is held */
+static void rq_offline_dl(struct rq *rq)
+{
+ if (rq->dl.overloaded)
+ dl_clear_overload(rq);
+}
+
+void init_sched_dl_class(void)
+{
+ unsigned int i;
+
+ for_each_possible_cpu(i)
+ zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
+ GFP_KERNEL, cpu_to_node(i));
+}
+
+#endif /* CONFIG_SMP */
+
static void switched_from_dl(struct rq *rq, struct task_struct *p)
{
- if (hrtimer_active(&p->dl.dl_timer))
+ if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
hrtimer_try_to_cancel(&p->dl.dl_timer);
+
+#ifdef CONFIG_SMP
+ /*
+ * Since this might be the only -deadline task on the rq,
+ * this is the right place to try to pull some other one
+ * from an overloaded cpu, if any.
+ */
+ if (!rq->dl.dl_nr_running)
+ pull_dl_task(rq);
+#endif
}

+/*
+ * When switching to -deadline, we may overload the rq, then
+ * we try to push someone off, if possible.
+ */
static void switched_to_dl(struct rq *rq, struct task_struct *p)
{
+ int check_resched = 1;
+
/*
* If p is throttled, don't consider the possibility
* of preempting rq->curr, the check will be done right
@@ -631,37 +1493,53 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
return;

if (!p->on_rq || rq->curr != p) {
- if (task_has_dl_policy(rq->curr))
+#ifdef CONFIG_SMP
+ if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
+ /* Only reschedule if pushing failed */
+ check_resched = 0;
+#endif /* CONFIG_SMP */
+ if (check_resched && task_has_dl_policy(rq->curr))
check_preempt_curr_dl(rq, p, 0);
- else
- resched_task(rq->curr);
}
}

+/*
+ * If the scheduling parameters of a -deadline task changed,
+ * a push or pull operation might be needed.
+ */
static void prio_changed_dl(struct rq *rq, struct task_struct *p,
int oldprio)
{
- switched_to_dl(rq, p);
-}
-
+ if (p->on_rq || rq->curr == p) {
#ifdef CONFIG_SMP
-static int
-select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
-{
- return task_cpu(p);
-}
-
-static void set_cpus_allowed_dl(struct task_struct *p,
- const struct cpumask *new_mask)
-{
- int weight = cpumask_weight(new_mask);
-
- BUG_ON(!dl_task(p));
-
- cpumask_copy(&p->cpus_allowed, new_mask);
- p->dl.nr_cpus_allowed = weight;
+ /*
+ * This might be too much, but unfortunately
+ * we don't have the old deadline value, and
+ * we can't argue if the task is increasing
+ * or lowering its prio, so...
+ */
+ if (!rq->dl.overloaded)
+ pull_dl_task(rq);
+
+ /*
+ * If we now have a earlier deadline task than p,
+ * then reschedule, provided p is still on this
+ * runqueue.
+ */
+ if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
+ rq->curr == p)
+ resched_task(p);
+#else
+ /*
+ * Again, we don't know if p has a earlier
+ * or later deadline, so let's blindly set a
+ * (maybe not needed) rescheduling point.
+ */
+ resched_task(p);
+#endif /* CONFIG_SMP */
+ } else
+ switched_to_dl(rq, p);
}
-#endif

const struct sched_class dl_sched_class = {
.next = &rt_sched_class,
@@ -678,6 +1556,11 @@ const struct sched_class dl_sched_class = {
.select_task_rq = select_task_rq_dl,

.set_cpus_allowed = set_cpus_allowed_dl,
+ .rq_online = rq_online_dl,
+ .rq_offline = rq_offline_dl,
+ .pre_schedule = pre_schedule_dl,
+ .post_schedule = post_schedule_dl,
+ .task_woken = task_woken_dl,
#endif

.set_curr_task = set_curr_task_dl,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..5f96559 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1809,7 +1809,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
!test_tsk_need_resched(rq->curr) &&
has_pushable_tasks(rq) &&
p->nr_cpus_allowed > 1 &&
- rt_task(rq->curr) &&
+ (dl_task(rq->curr) || rt_task(rq->curr)) &&
(rq->curr->nr_cpus_allowed < 2 ||
rq->curr->prio <= p->prio))
push_rt_tasks(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a76d210..2ca517d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -328,6 +328,31 @@ struct dl_rq {
struct rb_node *rb_leftmost;

unsigned long dl_nr_running;
+
+#ifdef CONFIG_SMP
+ /*
+ * Deadline values of the currently executing and the
+ * earliest ready task on this rq. Caching these facilitates
+ * the decision wether or not a ready but not running task
+ * should migrate somewhere else.
+ */
+ struct {
+ u64 curr;
+ u64 next;
+ } earliest_dl;
+
+ unsigned long dl_nr_migratory;
+ unsigned long dl_nr_total;
+ int overloaded;
+
+ /*
+ * Tasks on this rq that can be pushed away. They are kept in
+ * an rb-tree, ordered by tasks' deadlines, with caching
+ * of the leftmost (earliest deadline) element.
+ */
+ struct rb_root pushable_dl_tasks_root;
+ struct rb_node *pushable_dl_tasks_leftmost;
+#endif
};

#ifdef CONFIG_SMP
@@ -348,6 +373,13 @@ struct root_domain {
cpumask_var_t online;

/*
+ * The bit corresponding to a CPU gets set here if such CPU has more
+ * than one runnable -deadline task (as it is below for RT tasks).
+ */
+ cpumask_var_t dlo_mask;
+ atomic_t dlo_count;
+
+ /*
* The "RT overload" flag: it gets set if a CPU has more than
* one runnable RT task.
*/
@@ -887,6 +919,7 @@ extern void sched_init_granularity(void);
extern void update_max_interval(void);
extern void update_group_power(struct sched_domain *sd, int cpu);
extern int update_runtime(struct notifier_block *nfb, unsigned long action, void *hcpu);
+extern void init_sched_dl_class(void);
extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void);

--
1.7.9.5

2012-10-24 22:29:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 02/16] math128, x86_64: Implement {mul,add}_u128 in 64bit asm

On 10/24/2012 02:53 PM, Juri Lelli wrote:
> diff --git a/arch/x86/include/asm/math128.h b/arch/x86/include/asm/math128.h
> new file mode 100644
> index 0000000..c0e2a6c
> --- /dev/null
> +++ b/arch/x86/include/asm/math128.h
> @@ -0,0 +1,39 @@
> +#ifndef _ASM_MATH128_H
> +#define _ASM_MATH128_H
> +
> +#ifdef CONFIG_X86_64
> +
> +#ifdef __SIZEOF_INT128__
> +#define ARCH_HAS_INT128
> +#endif
> +
> +#ifndef ARCH_HAS_INT128
> +
> +static inline u128 mul_u64_u64(u64 a, u64 b)
> +{
> + u128 res;
> +
> + asm("mulq %2"
> + : "=a" (res.lo), "=d" (res.hi)
> + : "rm" (b), "0" (a));
> +
> + return res;
> +}
> +#define mul_u64_u64 mul_u64_u64
> +
> +static inline u128 add_u128(u128 a, u128 b)
> +{
> + u128 res;
> +
> + asm("addq %2,%0;\n"
> + "adcq %3,%1;\n"
> + : "=rm" (res.lo), "=rm" (res.hi)
> + : "r" (b.lo), "r" (b.hi), "0" (a.lo), "1" (a.hi));
> +
> + return res;
> +}
> +#define add_u128 add_u128
> +

How could this work since u128 presumably has not yet been defined as a
structure? After all, isn't it the absence of ARCH_HAS_INT128 which
makes that happen?

-hpa

2012-10-24 22:47:28

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 02/16] math128, x86_64: Implement {mul,add}_u128 in 64bit asm

On 10/24/2012 03:27 PM, H. Peter Anvin wrote:
> On 10/24/2012 02:53 PM, Juri Lelli wrote:
>> diff --git a/arch/x86/include/asm/math128.h b/arch/x86/include/asm/math128.h
>> new file mode 100644
>> index 0000000..c0e2a6c
>> --- /dev/null
>> +++ b/arch/x86/include/asm/math128.h
>> @@ -0,0 +1,39 @@
>> +#ifndef _ASM_MATH128_H
>> +#define _ASM_MATH128_H
>> +
>> +#ifdef CONFIG_X86_64
>> +
>> +#ifdef __SIZEOF_INT128__
>> +#define ARCH_HAS_INT128
>> +#endif
>> +
>> +#ifndef ARCH_HAS_INT128
>> +
>> +static inline u128 mul_u64_u64(u64 a, u64 b)
>> +{
>> + u128 res;
>> +
>> + asm("mulq %2"
>> + : "=a" (res.lo), "=d" (res.hi)
>> + : "rm" (b), "0" (a));
>> +
>> + return res;
>> +}
>> +#define mul_u64_u64 mul_u64_u64
>> +
>> +static inline u128 add_u128(u128 a, u128 b)
>> +{
>> + u128 res;
>> +
>> + asm("addq %2,%0;\n"
>> + "adcq %3,%1;\n"
>> + : "=rm" (res.lo), "=rm" (res.hi)
>> + : "r" (b.lo), "r" (b.hi), "0" (a.lo), "1" (a.hi));
>> +
>> + return res;
>> +}
>> +#define add_u128 add_u128
>> +
>
> How could this work since u128 presumably has not yet been defined as a
> structure? After all, isn't it the absence of ARCH_HAS_INT128 which
> makes that happen?

Sorry, you were not in the Cc list of the previous patch in the patchset,
so you probably missed that. I should have triple checked git send-email
Cc list. Sorry about that.

I'll add you there.

Thanks and Regards,

- Juri

2012-10-24 22:48:50

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

Adding H. Peter Anvin to the Cc list.

Best,

- Juri

On 10/24/2012 02:53 PM, Juri Lelli wrote:
> From: Peter Zijlstra <[email protected]>
>
> Grow rudimentary u128 support without relying on gcc/libgcc.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Link: http://lkml.kernel.org/n/[email protected]
> ---
> arch/alpha/include/asm/Kbuild | 1 +
> arch/arm/include/asm/Kbuild | 1 +
> arch/avr32/include/asm/Kbuild | 2 +
> arch/blackfin/include/asm/Kbuild | 1 +
> arch/c6x/include/asm/Kbuild | 1 +
> arch/cris/include/asm/Kbuild | 1 +
> arch/frv/include/asm/Kbuild | 3 +
> arch/h8300/include/asm/Kbuild | 1 +
> arch/hexagon/include/asm/Kbuild | 1 +
> arch/ia64/include/asm/Kbuild | 1 +
> arch/m32r/include/asm/Kbuild | 1 +
> arch/m68k/include/asm/Kbuild | 1 +
> arch/microblaze/include/asm/Kbuild | 1 +
> arch/mips/include/asm/Kbuild | 1 +
> arch/mn10300/include/asm/Kbuild | 1 +
> arch/openrisc/include/asm/Kbuild | 1 +
> arch/parisc/include/asm/Kbuild | 2 +-
> arch/powerpc/include/asm/Kbuild | 1 +
> arch/s390/include/asm/Kbuild | 2 +-
> arch/score/include/asm/Kbuild | 1 +
> arch/sh/include/asm/Kbuild | 1 +
> arch/sparc/include/asm/Kbuild | 1 +
> arch/tile/include/asm/Kbuild | 1 +
> arch/um/include/asm/Kbuild | 2 +-
> arch/unicore32/include/asm/Kbuild | 1 +
> arch/x86/include/asm/Kbuild | 1 +
> arch/xtensa/include/asm/Kbuild | 1 +
> include/asm-generic/math128.h | 4 +
> include/linux/math128.h | 180 ++++++++++++++++++++++++++++++++++++
> lib/Makefile | 2 +-
> lib/math128.c | 40 ++++++++
> 31 files changed, 255 insertions(+), 4 deletions(-)
> create mode 100644 include/asm-generic/math128.h
> create mode 100644 include/linux/math128.h
> create mode 100644 lib/math128.c
>
> diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
> index 64ffc9e..e012ed5 100644
> --- a/arch/alpha/include/asm/Kbuild
> +++ b/arch/alpha/include/asm/Kbuild
> @@ -11,3 +11,4 @@ header-y += reg.h
> header-y += regdef.h
> header-y += sysinfo.h
> generic-y += exec.h
> +generic-y += math128.h
> diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
> index f70ae17..07023d4 100644
> --- a/arch/arm/include/asm/Kbuild
> +++ b/arch/arm/include/asm/Kbuild
> @@ -33,3 +33,4 @@ generic-y += termios.h
> generic-y += timex.h
> generic-y += types.h
> generic-y += unaligned.h
> +generic-y += math128.h
> diff --git a/arch/avr32/include/asm/Kbuild b/arch/avr32/include/asm/Kbuild
> index 4807ded..4384224 100644
> --- a/arch/avr32/include/asm/Kbuild
> +++ b/arch/avr32/include/asm/Kbuild
> @@ -1,3 +1,5 @@
>
> generic-y += clkdev.h
> generic-y += exec.h
> +generic-y += math128.h
> +header-y += cachectl.h
> diff --git a/arch/blackfin/include/asm/Kbuild b/arch/blackfin/include/asm/Kbuild
> index 5a0625a..6836e68 100644
> --- a/arch/blackfin/include/asm/Kbuild
> +++ b/arch/blackfin/include/asm/Kbuild
> @@ -47,3 +47,4 @@ generic-y += xor.h
> header-y += bfin_sport.h
> header-y += cachectl.h
> header-y += fixed_code.h
> +generic-y += math128.h
> diff --git a/arch/c6x/include/asm/Kbuild b/arch/c6x/include/asm/Kbuild
> index 112a496..ab11744 100644
> --- a/arch/c6x/include/asm/Kbuild
> +++ b/arch/c6x/include/asm/Kbuild
> @@ -53,3 +53,4 @@ generic-y += types.h
> generic-y += ucontext.h
> generic-y += user.h
> generic-y += vga.h
> +generic-y += math128.h
> diff --git a/arch/cris/include/asm/Kbuild b/arch/cris/include/asm/Kbuild
> index 6d43a95..7674e82 100644
> --- a/arch/cris/include/asm/Kbuild
> +++ b/arch/cris/include/asm/Kbuild
> @@ -11,3 +11,4 @@ header-y += sync_serial.h
> generic-y += clkdev.h
> generic-y += exec.h
> generic-y += module.h
> +generic-y += math128.h
> diff --git a/arch/frv/include/asm/Kbuild b/arch/frv/include/asm/Kbuild
> index 4a159da..732d864 100644
> --- a/arch/frv/include/asm/Kbuild
> +++ b/arch/frv/include/asm/Kbuild
> @@ -1,3 +1,6 @@
>
> generic-y += clkdev.h
> generic-y += exec.h
> +generic-y += math128.h
> +header-y += registers.h
> +header-y += termios.h
> diff --git a/arch/h8300/include/asm/Kbuild b/arch/h8300/include/asm/Kbuild
> index 50bbf38..1270ae0 100644
> --- a/arch/h8300/include/asm/Kbuild
> +++ b/arch/h8300/include/asm/Kbuild
> @@ -3,3 +3,4 @@ include include/asm-generic/Kbuild.asm
> generic-y += clkdev.h
> generic-y += exec.h
> generic-y += module.h
> +generic-y += math128.h
> diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
> index 3bfa9b3..8c179f4 100644
> --- a/arch/hexagon/include/asm/Kbuild
> +++ b/arch/hexagon/include/asm/Kbuild
> @@ -52,3 +52,4 @@ generic-y += types.h
> generic-y += ucontext.h
> generic-y += unaligned.h
> generic-y += xor.h
> +generic-y += math128.h
> diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild
> index dd02f09..f10618b 100644
> --- a/arch/ia64/include/asm/Kbuild
> +++ b/arch/ia64/include/asm/Kbuild
> @@ -2,3 +2,4 @@
> generic-y += clkdev.h
> generic-y += exec.h
> generic-y += kvm_para.h
> +generic-y += math128.h
> diff --git a/arch/m32r/include/asm/Kbuild b/arch/m32r/include/asm/Kbuild
> index 50bbf38..1270ae0 100644
> --- a/arch/m32r/include/asm/Kbuild
> +++ b/arch/m32r/include/asm/Kbuild
> @@ -3,3 +3,4 @@ include include/asm-generic/Kbuild.asm
> generic-y += clkdev.h
> generic-y += exec.h
> generic-y += module.h
> +generic-y += math128.h
> diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
> index 88fa3ac..46d4b99 100644
> --- a/arch/m68k/include/asm/Kbuild
> +++ b/arch/m68k/include/asm/Kbuild
> @@ -27,3 +27,4 @@ generic-y += topology.h
> generic-y += types.h
> generic-y += word-at-a-time.h
> generic-y += xor.h
> +generic-y += math128.h
> diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
> index 8653072..4809e13 100644
> --- a/arch/microblaze/include/asm/Kbuild
> +++ b/arch/microblaze/include/asm/Kbuild
> @@ -3,3 +3,4 @@ include include/asm-generic/Kbuild.asm
> header-y += elf.h
> generic-y += clkdev.h
> generic-y += exec.h
> +generic-y += math128.h
> diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
> index 533053d..0de09e8 100644
> --- a/arch/mips/include/asm/Kbuild
> +++ b/arch/mips/include/asm/Kbuild
> @@ -1 +1,2 @@
> # MIPS headers
> +generic-y += math128.h
> diff --git a/arch/mn10300/include/asm/Kbuild b/arch/mn10300/include/asm/Kbuild
> index 4a159da..6b54375 100644
> --- a/arch/mn10300/include/asm/Kbuild
> +++ b/arch/mn10300/include/asm/Kbuild
> @@ -1,3 +1,4 @@
>
> generic-y += clkdev.h
> generic-y += exec.h
> +generic-y += math128.h
> diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
> index 78de680..fa6fa87 100644
> --- a/arch/openrisc/include/asm/Kbuild
> +++ b/arch/openrisc/include/asm/Kbuild
> @@ -64,3 +64,4 @@ generic-y += types.h
> generic-y += ucontext.h
> generic-y += user.h
> generic-y += word-at-a-time.h
> +generic-y += math128.h
> diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
> index bac8deb..cab5ff7 100644
> --- a/arch/parisc/include/asm/Kbuild
> +++ b/arch/parisc/include/asm/Kbuild
> @@ -2,4 +2,4 @@
> generic-y += word-at-a-time.h auxvec.h user.h cputime.h emergency-restart.h \
> segment.h topology.h vga.h device.h percpu.h hw_irq.h mutex.h \
> div64.h irq_regs.h kdebug.h kvm_para.h local64.h local.h param.h \
> - poll.h xor.h clkdev.h exec.h
> + poll.h xor.h clkdev.h exec.h math128.h
> diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
> index a4fe15e..61d8f6e 100644
> --- a/arch/powerpc/include/asm/Kbuild
> +++ b/arch/powerpc/include/asm/Kbuild
> @@ -2,3 +2,4 @@
>
> generic-y += clkdev.h
> generic-y += rwsem.h
> +generic-y += math128.h
> diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
> index 0633dc6..daa2d19 100644
> --- a/arch/s390/include/asm/Kbuild
> +++ b/arch/s390/include/asm/Kbuild
> @@ -1,3 +1,3 @@
>
> -
> generic-y += clkdev.h
> +generic-y += math128.h
> diff --git a/arch/score/include/asm/Kbuild b/arch/score/include/asm/Kbuild
> index ec697ae..e14c1ed 100644
> --- a/arch/score/include/asm/Kbuild
> +++ b/arch/score/include/asm/Kbuild
> @@ -3,3 +3,4 @@ include include/asm-generic/Kbuild.asm
> header-y +=
>
> generic-y += clkdev.h
> +generic-y += math128.h
> diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
> index 29f83be..2cf354a 100644
> --- a/arch/sh/include/asm/Kbuild
> +++ b/arch/sh/include/asm/Kbuild
> @@ -33,3 +33,4 @@ generic-y += termbits.h
> generic-y += termios.h
> generic-y += ucontext.h
> generic-y += xor.h
> +generic-y += math128.h
> diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
> index 645a58d..ba284f9 100644
> --- a/arch/sparc/include/asm/Kbuild
> +++ b/arch/sparc/include/asm/Kbuild
> @@ -9,3 +9,4 @@ generic-y += irq_regs.h
> generic-y += local.h
> generic-y += module.h
> generic-y += word-at-a-time.h
> +generic-y += math128.h
> diff --git a/arch/tile/include/asm/Kbuild b/arch/tile/include/asm/Kbuild
> index 6948015..e3a37ac 100644
> --- a/arch/tile/include/asm/Kbuild
> +++ b/arch/tile/include/asm/Kbuild
> @@ -36,3 +36,4 @@ generic-y += termbits.h
> generic-y += termios.h
> generic-y += types.h
> generic-y += xor.h
> +generic-y += math128.h
> diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
> index 0f6e7b3..f1a5a8f 100644
> --- a/arch/um/include/asm/Kbuild
> +++ b/arch/um/include/asm/Kbuild
> @@ -1,4 +1,4 @@
> generic-y += bug.h cputime.h device.h emergency-restart.h futex.h hardirq.h
> generic-y += hw_irq.h irq_regs.h kdebug.h percpu.h sections.h topology.h xor.h
> generic-y += ftrace.h pci.h io.h param.h delay.h mutex.h current.h exec.h
> -generic-y += switch_to.h clkdev.h
> +generic-y += switch_to.h clkdev.h math128.h
> diff --git a/arch/unicore32/include/asm/Kbuild b/arch/unicore32/include/asm/Kbuild
> index c910c98..3a5e70e 100644
> --- a/arch/unicore32/include/asm/Kbuild
> +++ b/arch/unicore32/include/asm/Kbuild
> @@ -60,3 +60,4 @@ generic-y += unaligned.h
> generic-y += user.h
> generic-y += vga.h
> generic-y += xor.h
> +generic-y += math128.h
> diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
> index 66e5f0e..0a34aef 100644
> --- a/arch/x86/include/asm/Kbuild
> +++ b/arch/x86/include/asm/Kbuild
> @@ -28,3 +28,4 @@ genhdr-y += unistd_64.h
> genhdr-y += unistd_x32.h
>
> generic-y += clkdev.h
> +generic-y += math128.h
> diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
> index 6d13027..edb183d 100644
> --- a/arch/xtensa/include/asm/Kbuild
> +++ b/arch/xtensa/include/asm/Kbuild
> @@ -26,3 +26,4 @@ generic-y += statfs.h
> generic-y += termios.h
> generic-y += topology.h
> generic-y += xor.h
> +generic-y += math128.h
> diff --git a/include/asm-generic/math128.h b/include/asm-generic/math128.h
> new file mode 100644
> index 0000000..3582691
> --- /dev/null
> +++ b/include/asm-generic/math128.h
> @@ -0,0 +1,4 @@
> +#ifndef _ASM_GENERIC_MATH128_H
> +#define _ASM_GENERIC_MATH128_H
> +
> +#endif /*_ASM_GENERIC_MATH128_H */
> diff --git a/include/linux/math128.h b/include/linux/math128.h
> new file mode 100644
> index 0000000..5b0eef6
> --- /dev/null
> +++ b/include/linux/math128.h
> @@ -0,0 +1,180 @@
> +#ifndef _LINUX_MATH128_H
> +#define _LINUX_MATH128_H
> +
> +#include <linux/types.h>
> +
> +typedef union {
> + struct {
> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> + u64 lo, hi;
> +#else
> + u64 hi, lo;
> +#endif
> + };
> +#ifdef __SIZEOF_INT128__ /* gcc-4.6+ */
> + unsigned __int128 val;
> +#endif
> +} u128;
> +
> +#define U128_INIT(_hi, _lo) (u128){{ .hi = (_hi), .lo = (_lo) }}
> +
> +#include <asm/math128.h>
> +
> +/*
> + * Make usage of __int128 dependent on arch code so they can
> + * judge if gcc is doing the right thing for them and can over-ride
> + * any funnies.
> + */
> +
> +#ifndef ARCH_HAS_INT128
> +
> +#ifndef add_u128
> +static inline u128 add_u128(u128 a, u128 b)
> +{
> + a.hi += b.hi;
> + a.lo += b.lo;
> + if (a.lo < b.lo)
> + a.hi++;
> +
> + return a;
> +}
> +#endif /* add_u128 */
> +
> +#ifndef mul_u64_u64
> +extern u128 mul_u64_u64(u64 a, u64 b);
> +#endif
> +
> +#ifndef mul_u64_u32_shr
> +static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
> +{
> + u32 ah, al;
> + u64 t1, t2;
> +
> + ah = a >> 32;
> + al = a;
> +
> + t1 = ((u64)al * mul) >> shift;
> + t2 = ((u64)ah * mul) << (32 - shift);
> +
> + return t1 + t2;
> +}
> +#endif /* mul_u64_u32_shr */
> +
> +#ifndef shl_u128
> +static inline u128 shl_u128(u128 x, unsigned int n)
> +{
> + u128 res;
> +
> + if (!n)
> + return x;
> +
> + if (n < 64) {
> + res.hi = x.hi << n;
> + res.hi |= x.lo >> (64 - n);
> + res.lo = x.lo << n;
> + } else {
> + res.lo = 0;
> + res.hi = x.lo << (n - 64);
> + }
> +
> + return res;
> +}
> +#endif /* shl_u128 */
> +
> +#ifndef shr_u128
> +static inline u128 shr_u128(u128 x, unsigned int n)
> +{
> + u128 res;
> +
> + if (!n)
> + return x;
> +
> + if (n < 64) {
> + res.lo = x.lo >> n;
> + res.lo |= x.hi << (64 - n);
> + res.hi = x.hi >> n;
> + } else {
> + res.hi = 0;
> + res.lo = x.hi >> (n - 64);
> + }
> +
> + return res;
> +}
> +#endif /* shr_u128 */
> +
> +#ifndef cmp_u128
> +static inline int cmp_u128(u128 a, u128 b)
> +{
> + if (a.hi > b.hi)
> + return 1;
> + if (a.hi < b.hi)
> + return -1;
> + if (a.lo > b.lo)
> + return 1;
> + if (a.lo < b.lo)
> + return -1;
> +
> + return 0;
> +}
> +#endif /* cmp_u128 */
> +
> +#else /* ARCH_HAS_INT128 */
> +
> +#ifndef add_u128
> +static inline u128 add_u128(u128 a, u128 b)
> +{
> + a.val += b.val;
> + return a;
> +}
> +#endif /* add_u128 */
> +
> +#ifndef mul_u64_u64
> +static inline u128 mul_u64_u64(u64 a, u64 b)
> +{
> + u128 res;
> +
> + res.val = a;
> + res.val *= b;
> +
> + return res;
> +}
> +#define mul_u64_u64 mul_u64_u64
> +#endif
> +
> +#ifndef mul_u64_u32_shr
> +static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
> +{
> + return (u64)(((unsigned __int128)a * mul) >> shift);
> +}
> +#endif /* mul_u64_u32_shr */
> +
> +#ifndef shl_u128
> +static inline u128 shl_u128(u128 x, unsigned int n)
> +{
> + x.val <<= n;
> + return x;
> +}
> +#endif /* shl_u128 */
> +
> +#ifndef shr_u128
> +static inline u128 shr_u128(u128 x, unsigned int n)
> +{
> + x.val >>= n;
> + return x;
> +}
> +#endif /* shr_u128 */
> +
> +#ifndef cmp_u128
> +static inline int cmp_u128(u128 a, u128 b)
> +{
> + if (a.val < b.val)
> + return -1;
> + if (a.val > b.val)
> + return 1;
> + return 0;
> +}
> +#endif /* cmp_u128 */
> +
> +#endif /* ARCH_HAS_INT128 */
> +
> +#endif /* _LINUX_MATH128_H */
> diff --git a/lib/Makefile b/lib/Makefile
> index 821a162..367c62c 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
> idr.o int_sqrt.o extable.o \
> sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
> proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> - is_single_threaded.o plist.o decompress.o
> + is_single_threaded.o plist.o decompress.o math128.o
>
> lib-$(CONFIG_MMU) += ioremap.o
> lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/math128.c b/lib/math128.c
> new file mode 100644
> index 0000000..55b123a
> --- /dev/null
> +++ b/lib/math128.c
> @@ -0,0 +1,40 @@
> +#include <linux/math128.h>
> +
> +#ifndef mul_u64_u64
> +/*
> + * a * b = (ah * 2^32 + al) * (bh * 2^32 + bl) =
> + * ah*bh * 2^64 + (ah*bl + bh*al) * 2^32 + al*bl
> + */
> +u128 mul_u64_u64(u64 a, u64 b)
> +{
> + u128 t1, t2, t3, t4;
> + u32 ah, al;
> + u32 bh, bl;
> +
> + ah = a >> 32;
> + al = a;
> +
> + bh = b >> 32;
> + bl = b;
> +
> + t1.lo = 0;
> + t1.hi = (u64)ah * bh;
> +
> + t2.lo = (u64)ah * bl;
> + t2.hi = t2.lo >> 32;
> + t2.lo <<= 32;
> +
> + t3.lo = (u64)al * bh;
> + t3.hi = t3.lo >> 32;
> + t3.lo <<= 32;
> +
> + t4.lo = (u64)al * bl;
> + t4.hi = 0;
> +
> + t1 = add_u128(t1, t2);
> + t1 = add_u128(t1, t3);
> + t1 = add_u128(t1, t4);
> +
> + return t1;
> +}
> +#endif /* mul_u64_u64 */
>

2012-10-24 22:53:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 02/16] math128, x86_64: Implement {mul,add}_u128 in 64bit asm

On 10/24/2012 03:47 PM, Juri Lelli wrote:
>>
>> How could this work since u128 presumably has not yet been defined as a
>> structure? After all, isn't it the absence of ARCH_HAS_INT128 which
>> makes that happen?
>
> Sorry, you were not in the Cc list of the previous patch in the patchset,
> so you probably missed that. I should have triple checked git send-email
> Cc list. Sorry about that.
>
> I'll add you there.
>

Hmm... you realize that at least on some platform the u128 as a union is
going to perform worse than the plain __int128, right? As such IMO it
would be better if the union was only defined if needed.

-hpa

2012-10-24 23:18:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Wed, Oct 24, 2012 at 2:53 PM, Juri Lelli <[email protected]> wrote:
> From: Peter Zijlstra <[email protected]>
>
> Grow rudimentary u128 support without relying on gcc/libgcc.

I missed the part where somebody explains why and what needs this?
It's going to be very expensive indeed on some platforms, so the fact
that it is *sometimes* cheap doesn't necessarily imply it should ever
be used.

So please, explain what the pressing need is that is so worthwhile
that this is worth it. Maybe it was in a 00/16 cover letter, but not
only was that not sent out to the people who got 01, you'd still want
it in the commit message.

> +typedef union {
> + struct {
> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> + u64 lo, hi;
> +#else
> + u64 hi, lo;
> +#endif
> + };
> +#ifdef __SIZEOF_INT128__ /* gcc-4.6+ */
> + unsigned __int128 val;
> +#endif
> +} u128;

This also looks totally wrong.

If gcc has native support for __int128, then the union is pointless.
Don't do it. Just do

#ifdef __SIZEOF_INT128__
typedef unsigned __int128 u128;
#else
typedef struct { ... u64 hi/lo in the right order } u128;
#endif

because it's possible that using the native bare type will make gcc
able to do better for various things.

Sure, it's possible that you want to use a union in low-level
architecture code that implements the actual math, BUT EVEN THEN the
above union is pure and utter garbage. On 32-bit machines, you'd want
to make it a union of 4 32-bit entities etc. So putting it like this
in a generic file looks wrong. In fact, your very own generic
mul_u64_u64() would seem to want to use the "4 32-bit words" kind of
model.

Also, the union isn't used for generic code anyway, since the generic
code has that same __SIZEOF_INT128__ test for which generic version it
should include (and I wonder if it should just be

#ifdef __SIZEOF_INT128__
#include <linux/native-128bit.h>
#elif CONFIG_64BIT
#include <linux/generic64bit-128bit.h>
#else
#include <linux/generic64bit-128bit.h>
#endif

and then have separate files entirely for the "gcc handles the common
operations" vs "64-bit architecture needs two words for most things"
vs "32-bit architectures need 4 words for most things".

I dunno. But I think this is wrong.

Linus

2012-10-25 00:08:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Wed, 2012-10-24 at 16:18 -0700, Linus Torvalds wrote:

> #ifdef __SIZEOF_INT128__
> #include <linux/native-128bit.h>
> #elif CONFIG_64BIT
> #include <linux/generic64bit-128bit.h>
> #else
> #include <linux/generic64bit-128bit.h>
> #endif
>

I'm assuming you meant the last include to be:

#include <linux/generic32bit-128bit.h>

Cut and paste should be a federal crime.

-- Steve

2012-10-25 00:09:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Wed, Oct 24, 2012 at 5:08 PM, Steven Rostedt <[email protected]> wrote:
>
> I'm assuming you meant the last include to be:
>
> #include <linux/generic32bit-128bit.h>

That's a safe assumption.

> Cut and paste should be a federal crime.

We'd all be serving some hard time, I'm afraid.

Linus

2012-10-25 05:21:40

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Wed, Oct 24, 2012 at 11:53 PM, Juri Lelli <[email protected]> wrote:
> +#ifdef __SIZEOF_INT128__ /* gcc-4.6+ */
> + unsigned __int128 val;
> +#endif

So the definition of val depends on (gcc) __SIZEOF_INT128__...

> +/*
> + * Make usage of __int128 dependent on arch code so they can
> + * judge if gcc is doing the right thing for them and can over-ride
> + * any funnies.
> + */
> +
> +#ifndef ARCH_HAS_INT128

... but all generic users depend on (Kconfig) ARCH_HAS_INT128?

How can Kconfig know if gcc supports this?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2012-10-25 07:18:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: SCHED_DEADLINE v6


* Juri Lelli <[email protected]> wrote:

> kernel/sched/dl.c | 1650 ++++++++++++++++++++++++++++

I've got a stupid nit here: please make that deadline.c. Same
for cpudl.c.

(Just to stop future generations from wondering why the Linux
scheduler has a downloading module.)

Thanks,

Ingo

2012-10-25 09:53:39

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: SCHED_DEADLINE v6

On Thu, Oct 25, 2012 at 09:18:01AM +0200, Ingo Molnar wrote:
>
> * Juri Lelli <[email protected]> wrote:
>
> > kernel/sched/dl.c | 1650 ++++++++++++++++++++++++++++
>
> I've got a stupid nit here: please make that deadline.c. Same
> for cpudl.c.
>
> (Just to stop future generations from wondering why the Linux
> scheduler has a downloading module.)

Why not?

In case it needs to download vendor- or system-specific scheduling
policies.

:-)

--
Regards/Gruss,
Boris.

2012-10-25 13:39:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Thu, 2012-10-25 at 07:21 +0200, Geert Uytterhoeven wrote:
> On Wed, Oct 24, 2012 at 11:53 PM, Juri Lelli <[email protected]> wrote:
> > +#ifdef __SIZEOF_INT128__ /* gcc-4.6+ */
> > + unsigned __int128 val;
> > +#endif
>
> So the definition of val depends on (gcc) __SIZEOF_INT128__...
>
> > +/*
> > + * Make usage of __int128 dependent on arch code so they can
> > + * judge if gcc is doing the right thing for them and can over-ride
> > + * any funnies.
> > + */
> > +
> > +#ifndef ARCH_HAS_INT128
>
> ... but all generic users depend on (Kconfig) ARCH_HAS_INT128?

Ah, you're saying both should depend on the same thing. I fear there's a
chicken-egg problem in the code as it is now, the asm/math128.h thing
needs the data structure but is also the one setting ARCH_HAS_INT128.

So its not Kconfig.

> How can Kconfig know if gcc supports this?

It cannot, its per the asm/math128.h header to opt-in on using it. This
so archs can make sure gcc doesn't generate broken code or relies on
libgcc for its __int128 implementation.

Now, if we do as Linus suggests and push the data structure definition
into a separate header we could possibly avoid this.

2012-10-25 13:47:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Wed, 2012-10-24 at 16:18 -0700, Linus Torvalds wrote:
> On Wed, Oct 24, 2012 at 2:53 PM, Juri Lelli <[email protected]> wrote:
> > From: Peter Zijlstra <[email protected]>
> >
> > Grow rudimentary u128 support without relying on gcc/libgcc.
>
> I missed the part where somebody explains why and what needs this?
> It's going to be very expensive indeed on some platforms, so the fact
> that it is *sometimes* cheap doesn't necessarily imply it should ever
> be used.
>
> So please, explain what the pressing need is that is so worthwhile
> that this is worth it. Maybe it was in a 00/16 cover letter, but not
> only was that not sent out to the people who got 01, you'd still want
> it in the commit message.

There's two use cases:

1) the proposed SCHED_DEADLINE needs to do some u64xu64 math, it
ends up having to multiply a deadline (in usec) with runtime (also
in usec).

2) the infrastructure adds mul_u64_u32_shr(), which is something we
do a lot of with all the time manipulation, apply a multiplier to
some u64 clock value.

We can do better on some archs than we can in generic, so this
interface could give a win there.


But yes, in general people should be very very reluctant to use this.

2012-10-25 14:14:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Thu, 2012-10-25 at 15:47 +0200, Peter Zijlstra wrote:
> 1) the proposed SCHED_DEADLINE needs to do some u64xu64 math, it
> ends up having to multiply a deadline (in usec) with runtime (also
> in usec).

s/usec/nsec/g

2012-10-25 16:58:44

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/16] sched: SCHED_DEADLINE v6

On 10/25/2012 12:18 AM, Ingo Molnar wrote:
>
> * Juri Lelli <[email protected]> wrote:
>
>> kernel/sched/dl.c | 1650 ++++++++++++++++++++++++++++
>
> I've got a stupid nit here: please make that deadline.c. Same
> for cpudl.c.
>

Sure, no problems with that.

Thanks and Regards,

- Juri

2012-10-25 22:26:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Thu, Oct 25, 2012 at 6:47 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2012-10-24 at 16:18 -0700, Linus Torvalds wrote:
>>
>> So please, explain what the pressing need is that is so worthwhile
>> that this is worth it. Maybe it was in a 00/16 cover letter, but not
>> only was that not sent out to the people who got 01, you'd still want
>> it in the commit message.
>
> There's two use cases:
>
> 1) the proposed SCHED_DEADLINE needs to do some u64xu64 math, it
> ends up having to multiply a deadline (in usec) with runtime (also
> in usec).
>
> 2) the infrastructure adds mul_u64_u32_shr(), which is something we
> do a lot of with all the time manipulation, apply a multiplier to
> some u64 clock value.
>
> We can do better on some archs than we can in generic, so this
> interface could give a win there.

So I have no objection to the mul_u64_u32_shr() model, exactly because

- it doesn't actually use u128 anywhere (except perhaps internally,
but that is totally about the implementation, not visible anywhere
else).

- it is fundamentally optimizable especially on 32-bit architectures
where it doesn't need to do a full 64x64 multiply.

it's the *rest* of the "u128" math I really object to. I also wonder
about the u64xu64 math case for SCHED_DEADLINE, because I assume that
it doesn't actually end up using the 128-bit result in that form, but
scales it down again some way?

In other words, the thing I really object to is exactly the whole
"generic 128-bit math". That's the part that can easily get very
expensive in 32-bit environments. Even for the "u64xu64" multiply for
SCHED_DEADLINE, how could it possibly be true 64-bit values (even if
your "usec" was wrong, and it's "nsec").

At what point does the scheduler talk/think about billions of seconds
in nanoseconds? Seriously?

That's a perfect example of where "true 128-bit math" is potentially
stupidly expensive on 32-bit platforms, when a 48x48->96 bit multiply
might be cheaper. And if we're talking about some fixed-point
arithmetic, and the thing actually gets shifted down again (like the
mul_u64_u32_shr) so that the final result is actually guaranteed to
fit in (say) 64 bits, then that would be cheaper yet.

I realize that some people seem to think that being "generic" is
superior, and think that maybe somebody wants to do 128-bit arithmetic
for other things. And I think that is exactly the wrong way to think,
because it just encourages people to do exactly the wrong thing,
because "look, 128-bit arithmetic is easily available so I can do
fancy things", and then it just happens to go really fast on x86-64,
and then sucks everywhere else.

Linus

2012-10-26 08:50:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Thu, 2012-10-25 at 15:26 -0700, Linus Torvalds wrote:
> it's the *rest* of the "u128" math I really object to. I also wonder
> about the u64xu64 math case for SCHED_DEADLINE, because I assume that
> it doesn't actually end up using the 128-bit result in that form, but
> scales it down again some way?

No, it does a compare on two u128, so it doesn't loose any precision.
If it were to scale down again and loose precision I'd agree with you
that introducing the u128 stuff is pointless.

The point is (as mentioned in the comments below) overflowing an actual
u64 is rare, however since some of this (specifically the
dl_{runtime,deadline} parameters) is user specified, we have to assume
we will overflow.

---

+/*
+ * Here we check if --at time t-- an entity (which is probably being
+ * [re]activated or, in general, enqueued) can use its remaining runtime
+ * and its current deadline _without_ exceeding the bandwidth it is
+ * assigned (function returns true if it can't). We are in fact applying
+ * one of the CBS rules: when a task wakes up, if the residual runtime
+ * over residual deadline fits within the allocated bandwidth, then we
+ * can keep the current (absolute) deadline and residual budget without
+ * disrupting the schedulability of the system. Otherwise, we should
+ * refill the runtime and set the deadline a period in the future,
+ * because keeping the current (absolute) deadline of the task would
+ * result in breaking guarantees promised to other tasks.
+ *
+ * This function returns true if:
+ *
+ * runtime / (deadline - t) > dl_runtime / dl_deadline ,
+ *
+ * IOW we can't recycle current parameters.
+ */
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+{
+ u128 left, right;
+
+ /*
+ * left and right are the two sides of the equation above,
+ * after a bit of shuffling to use multiplications instead
+ * of divisions.
+ *
+ * Note that none of the time values involved in the two
+ * multiplications are absolute: dl_deadline and dl_runtime
+ * are the relative deadline and the maximum runtime of each
+ * instance, runtime is the runtime left for the last instance
+ * and (deadline - t), since t is rq->clock, is the time left
+ * to the (absolute) deadline. Therefore, overflowing the u64
+ * type is very unlikely to occur in both cases.
+ */
+ left = mul_u64_u64(dl_se->dl_deadline, dl_se->runtime);
+ right = mul_u64_u64((dl_se->deadline - t), dl_se->dl_runtime);
+
+ if (cmp_u128(left, right) > 0)
+ return true;
+
+ return false;
+}

2012-10-26 09:24:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives


* Peter Zijlstra <[email protected]> wrote:

> On Thu, 2012-10-25 at 15:26 -0700, Linus Torvalds wrote:
> > it's the *rest* of the "u128" math I really object to. I also wonder
> > about the u64xu64 math case for SCHED_DEADLINE, because I assume that
> > it doesn't actually end up using the 128-bit result in that form, but
> > scales it down again some way?
>
> No, it does a compare on two u128, so it doesn't loose any
> precision. If it were to scale down again and loose precision
> I'd agree with you that introducing the u128 stuff is
> pointless.
>
> The point is (as mentioned in the comments below) overflowing
> an actual u64 is rare, however since some of this
> (specifically the dl_{runtime,deadline} parameters) is user
> specified, we have to assume we will overflow.

So can we control this by restricting the users and avoiding the
overflow?

A 2^64 result should be a *huge* amount of space already for
just about anything.

Thanks,

Ingo

2012-10-26 09:36:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, 2012-10-26 at 11:24 +0200, Ingo Molnar wrote:

> So can we control this by restricting the users and avoiding the
> overflow?
>
> A 2^64 result should be a *huge* amount of space already for
> just about anything.

I _think_ something like: dl_runtime * dl_deadline < U64_MAX, might do
that. The question is, is this constraint usable? Simplified that boils
down to about 4 seconds each, which sounds pretty much ok for most
people -- but such statements usually come back to bite you (640kb
anybody...).

Hmm, patch 8 (which adds period support) changes this slightly again.
Would it then end up being something like:

dl_period * dl_runtime < U64_MAX && dl_deadline * dl_runtime < U64_MAX

?

Juri, did I get that constraint right and do you know about use-cases
where this would be prohibitive?

2012-10-26 09:42:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2012-10-26 at 11:24 +0200, Ingo Molnar wrote:
>
> > So can we control this by restricting the users and avoiding
> > the overflow?
> >
> > A 2^64 result should be a *huge* amount of space already for
> > just about anything.
>
> I _think_ something like: dl_runtime * dl_deadline < U64_MAX,
> might do that. The question is, is this constraint usable?
> Simplified that boils down to about 4 seconds each, which
> sounds pretty much ok for most people -- but such statements
> usually come back to bite you (640kb anybody...).

We could constrain the precision, not the maximum value.

Having a 4 seconds hard limit is one thing, only having 10 nsecs
precision at 40 seconds is another.

Then the introduction of 128 bit math would be purely optional
and would address *that* limitation of precision, and only that
limitation. That way we could gladly skip 128 bit math.

Thanks,

Ingo

2012-10-26 09:52:29

by Harald Gustafsson

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, Oct 26, 2012 at 11:35 AM, Peter Zijlstra <[email protected]> wrote:
> dl_period * dl_runtime < U64_MAX && dl_deadline * dl_runtime < U64_MAX

I think this makes sense to make the limitation on the product, since
IMO if you need a period longer than 4 seconds then the runtime is
much shorter due to that you want to express something like executing
for 100ms each 10s. Much less likely that an application would need to
execute for 7.8 s out of a period of 8s and would not be happy with
3.9s out of 4s.

/Harald

2012-10-26 09:55:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, 2012-10-26 at 11:42 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, 2012-10-26 at 11:24 +0200, Ingo Molnar wrote:
> >
> > > So can we control this by restricting the users and avoiding
> > > the overflow?
> > >
> > > A 2^64 result should be a *huge* amount of space already for
> > > just about anything.
> >
> > I _think_ something like: dl_runtime * dl_deadline < U64_MAX,
> > might do that. The question is, is this constraint usable?
> > Simplified that boils down to about 4 seconds each, which
> > sounds pretty much ok for most people -- but such statements
> > usually come back to bite you (640kb anybody...).
>
> We could constrain the precision, not the maximum value.
>
> Having a 4 seconds hard limit is one thing, only having 10 nsecs
> precision at 40 seconds is another.

That gets to be rather ugly I think.. for one it might surprise people,
secondly you get to have a bunch of conditionals and shifts in that code
path.

Personally I'd prefer to do the simple thing, esp. for a new interface.
So either do the hard limit or the u128 thing.

If we go with the hard limit, we can always address things when people
run into it and complain, at such a time we also have a better view of
people's uses and expectations methinks.

2012-10-26 10:04:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2012-10-26 at 11:42 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Fri, 2012-10-26 at 11:24 +0200, Ingo Molnar wrote:
> > >
> > > > So can we control this by restricting the users and avoiding
> > > > the overflow?
> > > >
> > > > A 2^64 result should be a *huge* amount of space already for
> > > > just about anything.
> > >
> > > I _think_ something like: dl_runtime * dl_deadline < U64_MAX,
> > > might do that. The question is, is this constraint usable?
> > > Simplified that boils down to about 4 seconds each, which
> > > sounds pretty much ok for most people -- but such statements
> > > usually come back to bite you (640kb anybody...).
> >
> > We could constrain the precision, not the maximum value.
> >
> > Having a 4 seconds hard limit is one thing, only having 10 nsecs
> > precision at 40 seconds is another.
>
> That gets to be rather ugly I think.. for one it might
> surprise people, secondly you get to have a bunch of
> conditionals and shifts in that code path.

I don't think a limitation of precision to about 64 bits is a
"surprise": it's high grade precision of 0.00000005 parts per
trillion...

( As a comparison, there's ~13 parts per trillion amount of pure
gold dissolved in ocean water. )

> Personally I'd prefer to do the simple thing, esp. for a new
> interface. So either do the hard limit or the u128 thing.

Given that the u128 thing, once it gets converted to machine
instructions, is not simple *at all*, that leaves us with the
hard limit.

> If we go with the hard limit, we can always address things
> when people run into it and complain, at such a time we also
> have a better view of people's uses and expectations methinks.

Indeed.

Thanks,

Ingo

2012-10-26 10:37:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, 26 Oct 2012, Peter Zijlstra wrote:
> On Fri, 2012-10-26 at 11:42 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Fri, 2012-10-26 at 11:24 +0200, Ingo Molnar wrote:
> > >
> > > > So can we control this by restricting the users and avoiding
> > > > the overflow?
> > > >
> > > > A 2^64 result should be a *huge* amount of space already for
> > > > just about anything.
> > >
> > > I _think_ something like: dl_runtime * dl_deadline < U64_MAX,
> > > might do that. The question is, is this constraint usable?
> > > Simplified that boils down to about 4 seconds each, which
> > > sounds pretty much ok for most people -- but such statements
> > > usually come back to bite you (640kb anybody...).
> >
> > We could constrain the precision, not the maximum value.
> >
> > Having a 4 seconds hard limit is one thing, only having 10 nsecs
> > precision at 40 seconds is another.
>
> That gets to be rather ugly I think.. for one it might surprise people,
> secondly you get to have a bunch of conditionals and shifts in that code
> path.
>
> Personally I'd prefer to do the simple thing, esp. for a new interface.
> So either do the hard limit or the u128 thing.
>
> If we go with the hard limit, we can always address things when people
> run into it and complain, at such a time we also have a better view of
> people's uses and expectations methinks.

By all means. nsec precision is a completly academic thought
exercise. It's really pointless to even think about anything below
microseconds resolution.

We can still have the user space interface handing in the information
in nsec resolution, but it's reasonable to scale it down to something
useful. Just shift the incoming information right by 10, so you're in
the 1us resolution for all the internal math and all your limitation
problems are gone. A shift by ten for converting back and forth to
nsecs is not a real performance issue.

Thanks,

tglx




2012-10-26 10:44:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, 2012-10-26 at 12:36 +0200, Thomas Gleixner wrote:
> By all means. nsec precision is a completly academic thought
> exercise. It's really pointless to even think about anything below
> microseconds resolution.
>
> We can still have the user space interface handing in the information
> in nsec resolution, but it's reasonable to scale it down to something
> useful. Just shift the incoming information right by 10, so you're in
> the 1us resolution for all the internal math and all your limitation
> problems are gone. A shift by ten for converting back and forth to
> nsecs is not a real performance issue.

I'm fine with that.. all I wanted was to not have the undefined overflow
we initially had.

I had hoped the u128 stuff might be elsewise useful, but if we don't
want to go there, that's fine.

2012-10-26 11:11:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives


* Peter Zijlstra <[email protected]> wrote:

> I had hoped the u128 stuff might be elsewise useful, but if we
> don't want to go there, that's fine.

I think it needs a clearer usecase - and even then the 32-bit
behavior still looks rather horrible ...

So if we can escape all that with reasonable restrictions then
that's far better than taking on this kind of overhead for
32-bit systems. 32-bit still matters, we do the ktime_t
complications for 32-bit systems and that's for a far smaller
effect.

[ Would be nice to also stick in a WARN_ONCE() in the key
place(s) just in case, to make sure the overflow cannot happen
silently in the future. ]

Thanks,

Ingo

2012-10-26 12:39:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, 2012-10-26 at 12:36 +0200, Thomas Gleixner wrote:

> By all means. nsec precision is a completly academic thought
> exercise. It's really pointless to even think about anything below
> microseconds resolution.
>
> We can still have the user space interface handing in the information
> in nsec resolution, but it's reasonable to scale it down to something
> useful. Just shift the incoming information right by 10, so you're in
> the 1us resolution for all the internal math and all your limitation
> problems are gone. A shift by ten for converting back and forth to
> nsecs is not a real performance issue.

Just make sure this is well documented in the man pages, and that should
eliminate any "surprises". This is a new interface, we can just make
this part of the ABI. "The units are in nanoseconds, but all
calculations are performed to the nearest microsecond. Take this into
account for error analysis". People should be fine with this.

-- Steve


2012-10-26 12:57:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, 2012-10-26 at 12:44 +0200, Peter Zijlstra wrote:
> > We can still have the user space interface handing in the information
> > in nsec resolution, but it's reasonable to scale it down to something
> > useful. Just shift the incoming information right by 10, so you're in
> > the 1us resolution for all the internal math and all your limitation
> > problems are gone. A shift by ten for converting back and forth to
> > nsecs is not a real performance issue.
>
> I'm fine with that.. all I wanted was to not have the undefined overflow
> we initially had.

Note that we still need the constraint checking with this, although with
both values shifted right 10 bits the range is now much bigger and
shouldn't be a practical limit anymore.

2012-10-26 13:09:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, 2012-10-26 at 08:39 -0400, Steven Rostedt wrote:
> On Fri, 2012-10-26 at 12:36 +0200, Thomas Gleixner wrote:
>
> > By all means. nsec precision is a completly academic thought
> > exercise. It's really pointless to even think about anything below
> > microseconds resolution.
> >
> > We can still have the user space interface handing in the information
> > in nsec resolution, but it's reasonable to scale it down to something
> > useful. Just shift the incoming information right by 10, so you're in
> > the 1us resolution for all the internal math and all your limitation
> > problems are gone. A shift by ten for converting back and forth to
> > nsecs is not a real performance issue.
>
> Just make sure this is well documented in the man pages, and that should
> eliminate any "surprises". This is a new interface, we can just make
> this part of the ABI. "The units are in nanoseconds, but all
> calculations are performed to the nearest microsecond. Take this into
> account for error analysis". People should be fine with this.

Actually, a shift by 10 is a division by 1024, which is not truly down
to a microsecond. Would just a shift by 9 work as well? This would make
the resolution closer to a half of microsecond. Otherwise things will
probably get screwy if the user passes in 1000 ns, and gets a zero
result.

-- Steve

2012-10-26 15:17:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, Oct 26, 2012 at 1:49 AM, Peter Zijlstra <[email protected]> wrote:
>
> No, it does a compare on two u128

Actually, it apparently compares two multiplications.

That might be optimizable in itself.

> The point is (as mentioned in the comments below) overflowing an actual
> u64 is rare, however since some of this (specifically the
> dl_{runtime,deadline} parameters) is user specified, we have to assume
> we will overflow.

Any chance we could just limit them?

> + u128 left, right;
> +
> + /*
> + * left and right are the two sides of the equation above,
> + * after a bit of shuffling to use multiplications instead
> + * of divisions.
> + *
> + * Note that none of the time values involved in the two
> + * multiplications are absolute: dl_deadline and dl_runtime
> + * are the relative deadline and the maximum runtime of each
> + * instance, runtime is the runtime left for the last instance
> + * and (deadline - t), since t is rq->clock, is the time left
> + * to the (absolute) deadline. Therefore, overflowing the u64
> + * type is very unlikely to occur in both cases.
> + */
> + left = mul_u64_u64(dl_se->dl_deadline, dl_se->runtime);
> + right = mul_u64_u64((dl_se->deadline - t), dl_se->dl_runtime);
> +
> + if (cmp_u128(left, right) > 0)
> + return true;
> +
> + return false;

So how often could we do this without doing the multiplication at all?

It's trivial to see that 'right > left' if the individual
multiplicands are both bigger, for example. Maybe that is common?

And even if it overflows in 64-bit does it overflow in 92? For 32-bit
machines, the difference there is quite noticeable.

So the above might actually be better written as a
"compare_64bit_multiply(a,b,c,d)". At the same time, are we
*seriously* ever talking about multi-second runtimes or deadlines?
Because even in nanoseconds, I assume that the common case *by*far* in
scheduling would be about values smaller than four seconds, in which
case all of the above values are 32-bit, making the compares *much*
cheaper.

So on a 32-bit machine (say, x86-32), you might just have:

- or all the high words together, jump to slow case if the result is non-zero
- otherwise, do just two 32x32 multiplies and check which of the two is bigger.

That's a *huge* reduction in expensive multiplications.

And *THAT* is why generic 128-bit math is stupid. Don't do it.

Linus

2012-10-26 18:12:38

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

Hi,
first of all thanks to everybody for all this comments!

On 10/26/2012 05:56 AM, Peter Zijlstra wrote:
> On Fri, 2012-10-26 at 12:44 +0200, Peter Zijlstra wrote:
>>> We can still have the user space interface handing in the information
>>> in nsec resolution, but it's reasonable to scale it down to something
>>> useful. Just shift the incoming information right by 10, so you're in
>>> the 1us resolution for all the internal math and all your limitation
>>> problems are gone. A shift by ten for converting back and forth to
>>> nsecs is not a real performance issue.
>>
>> I'm fine with that.. all I wanted was to not have the undefined overflow
>> we initially had.
>
> Note that we still need the constraint checking with this, although with
> both values shifted right 10 bits the range is now much bigger and
> shouldn't be a practical limit anymore.
>

I'll try to recap what seems to me you agreed and what will be the
changes for the next iteration.

- remove first two patches (u128 math) [and keep them in a safe place
just in case following constraints will annoy future generation users
:P]

- scale down (right by 10) incoming parameters as to do internal
math with ~1us resolution (and scale up outgoing params)

- insert new constraints on -dl entities parameters:

o since we have - dl_period >= dl_deadline >= dl_runtime - the
only constraint we have to add for the overflow problem should
be dl_period * dl_runtime < U64_MAX

o to rule out problems with <= 1000ns parameters just force the user
to pass > 1000ns parameters (in the end its our real resolution)

- WARN_ONCE() in proper places

- properly document all this (comments and Documentation)

What you think?

Thanks a lot and Regards,

- Juri

2012-10-26 18:29:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, 2012-10-26 at 11:12 -0700, Juri Lelli wrote:

> - scale down (right by 10) incoming parameters as to do internal
> math with ~1us resolution (and scale up outgoing params)

Would scaling down by 9 be sufficient? That way the resolution is still
just less than 1us.

-- Steve

2012-10-26 18:34:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On Fri, 26 Oct 2012, Steven Rostedt wrote:

> On Fri, 2012-10-26 at 11:12 -0700, Juri Lelli wrote:
>
> > - scale down (right by 10) incoming parameters as to do internal
> > math with ~1us resolution (and scale up outgoing params)
>
> Would scaling down by 9 be sufficient? That way the resolution is still
> just less than 1us.

You have to do rounding anyway. So it does not matter whether you
chose 10 or 9.

2012-10-26 18:41:45

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 01/16] math128: Introduce various 128bit primitives

On 10/26/2012 11:28 AM, Steven Rostedt wrote:
> On Fri, 2012-10-26 at 11:12 -0700, Juri Lelli wrote:
>
>> - scale down (right by 10) incoming parameters as to do internal
>> math with ~1us resolution (and scale up outgoing params)
>
> Would scaling down by 9 be sufficient? That way the resolution is still
> just less than 1us.
>

I don't see any practical issue. Anyway, if, - as Thomas said - "It's
really pointless to even think about anything below microseconds
resolution", having an internal resolution just above a microsecond
sounds to me easier to justify to users. But I'm pretty open about
this :).

Thanks and Regards,

- Juri