2009-11-15 19:04:52

by Stijn Devriendt

[permalink] [raw]
Subject: [RFC][PATCH] sched_wait_block: wait for blocked threads

Hi Ingo, Peter, all,

The attached patch is a prototype for a new system call which
allows threads to wait for other threads being blocked.

Its main use is to allow threading libraries to resume
executing more CPU-bound work when one of its threads is
blocked while not having to over-allocating threads in a
normal situation.

Benefit over asynchronous I/O is that a threadpool
thread that performs asynchronous I/O might not have
work enough in one item to keep the CPU busy during the
whole asynchronous operation and that not all operations
are async capable.
Giving control back to the library through a thread
waiting for the blocked one allows new workitems to be
executed as long as the former is blocked.

Code performing this wait could look like:
pid_t parent = ...;
while (waitpid(parent, NULL, WNOHANG) != 0)
{
if (sched_wait_block(parent, NULL) == 0)
{
// do work, possibly notify threadpool manager
// to start another thread blocked on this one
// first
}
}

Any feedback on the concept is much appreciated.

Regards,
Stijn Devriendt


2009-11-15 19:05:01

by Stijn Devriendt

[permalink] [raw]
Subject: [PATCH] Initial prototype version of sched_wait_block


Signed-off-by: Stijn Devriendt <[email protected]>
---
arch/alpha/include/asm/unistd.h | 3 +-
arch/arm/include/asm/unistd.h | 1 +
arch/avr32/include/asm/unistd.h | 3 +-
arch/blackfin/include/asm/unistd.h | 3 +-
arch/cris/include/asm/unistd.h | 3 +-
arch/frv/include/asm/unistd.h | 3 +-
arch/h8300/include/asm/unistd.h | 3 +-
arch/ia64/include/asm/unistd.h | 3 +-
arch/m32r/include/asm/unistd.h | 3 +-
arch/m68k/include/asm/unistd.h | 3 +-
arch/microblaze/include/asm/unistd.h | 3 +-
arch/mips/include/asm/unistd.h | 4 +-
arch/mn10300/include/asm/unistd.h | 3 +-
arch/parisc/include/asm/unistd.h | 3 +-
arch/powerpc/include/asm/unistd.h | 3 +-
arch/s390/include/asm/unistd.h | 4 ++-
arch/sh/include/asm/unistd_32.h | 3 +-
arch/sh/include/asm/unistd_64.h | 3 +-
arch/sparc/include/asm/unistd.h | 3 +-
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/include/asm/unistd_64.h | 2 +
arch/x86/kernel/syscall_table_32.S | 1 +
arch/xtensa/include/asm/unistd.h | 3 +-
include/asm-generic/unistd.h | 5 ++-
include/linux/init_task.h | 1 +
include/linux/sched.h | 3 +-
include/linux/syscalls.h | 1 +
include/linux/wait.h | 1 +
kernel/fork.c | 2 +
kernel/sched.c | 64 ++++++++++++++++++++++++++++++++--
31 files changed, 117 insertions(+), 25 deletions(-)

diff --git a/arch/alpha/include/asm/unistd.h b/arch/alpha/include/asm/unistd.h
index 5b5c174..3b1b200 100644
--- a/arch/alpha/include/asm/unistd.h
+++ b/arch/alpha/include/asm/unistd.h
@@ -433,10 +433,11 @@
#define __NR_signalfd 476
#define __NR_timerfd 477
#define __NR_eventfd 478
+#define __NR_sched_wait_block 479

#ifdef __KERNEL__

-#define NR_SYSCALLS 479
+#define NR_SYSCALLS 480

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 4e506d0..76364ce 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -391,6 +391,7 @@
#define __NR_pwritev (__NR_SYSCALL_BASE+362)
#define __NR_rt_tgsigqueueinfo (__NR_SYSCALL_BASE+363)
#define __NR_perf_event_open (__NR_SYSCALL_BASE+364)
+#define __NR_sched_wait_block (__NR_SYSCALL_BASE+365)

/*
* The following SWIs are ARM private.
diff --git a/arch/avr32/include/asm/unistd.h b/arch/avr32/include/asm/unistd.h
index 89861a2..eb892f6 100644
--- a/arch/avr32/include/asm/unistd.h
+++ b/arch/avr32/include/asm/unistd.h
@@ -299,9 +299,10 @@
#define __NR_signalfd 279
/* 280 was __NR_timerfd */
#define __NR_eventfd 281
+#define __NR_sched_wait_block 282

#ifdef __KERNEL__
-#define NR_syscalls 282
+#define NR_syscalls 283

/* Old stuff */
#define __IGNORE_uselib
diff --git a/arch/blackfin/include/asm/unistd.h b/arch/blackfin/include/asm/unistd.h
index 779be02..8535d69 100644
--- a/arch/blackfin/include/asm/unistd.h
+++ b/arch/blackfin/include/asm/unistd.h
@@ -388,8 +388,9 @@
#define __NR_pwritev 367
#define __NR_rt_tgsigqueueinfo 368
#define __NR_perf_event_open 369
+#define __NR_sched_wait_block 370

-#define __NR_syscall 370
+#define __NR_syscall 371
#define NR_syscalls __NR_syscall

/* Old optional stuff no one actually uses */
diff --git a/arch/cris/include/asm/unistd.h b/arch/cris/include/asm/unistd.h
index c170793..8360734 100644
--- a/arch/cris/include/asm/unistd.h
+++ b/arch/cris/include/asm/unistd.h
@@ -339,10 +339,11 @@
#define __NR_inotify_init1 332
#define __NR_preadv 333
#define __NR_pwritev 334
+#define __NR_sched_wait_block 335

#ifdef __KERNEL__

-#define NR_syscalls 335
+#define NR_syscalls 336

#include <arch/unistd.h>

diff --git a/arch/frv/include/asm/unistd.h b/arch/frv/include/asm/unistd.h
index be6ef0f..4c90687 100644
--- a/arch/frv/include/asm/unistd.h
+++ b/arch/frv/include/asm/unistd.h
@@ -343,10 +343,11 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
+#define __NR_sched_wait_block 337

#ifdef __KERNEL__

-#define NR_syscalls 337
+#define NR_syscalls 338

#define __ARCH_WANT_IPC_PARSE_VERSION
/* #define __ARCH_WANT_OLD_READDIR */
diff --git a/arch/h8300/include/asm/unistd.h b/arch/h8300/include/asm/unistd.h
index 99f3c35..eb73a9c 100644
--- a/arch/h8300/include/asm/unistd.h
+++ b/arch/h8300/include/asm/unistd.h
@@ -325,10 +325,11 @@
#define __NR_move_pages 317
#define __NR_getcpu 318
#define __NR_epoll_pwait 319
+#define __NR_sched_wait_block 320

#ifdef __KERNEL__

-#define NR_syscalls 320
+#define NR_syscalls 321

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/ia64/include/asm/unistd.h b/arch/ia64/include/asm/unistd.h
index 5a5347f..ccb2154 100644
--- a/arch/ia64/include/asm/unistd.h
+++ b/arch/ia64/include/asm/unistd.h
@@ -311,11 +311,12 @@
#define __NR_preadv 1319
#define __NR_pwritev 1320
#define __NR_rt_tgsigqueueinfo 1321
+#define __NR_sched_wait_block 1322

#ifdef __KERNEL__


-#define NR_syscalls 298 /* length of syscall table */
+#define NR_syscalls 299 /* length of syscall table */

/*
* The following defines stop scripts/checksyscalls.sh from complaining about
diff --git a/arch/m32r/include/asm/unistd.h b/arch/m32r/include/asm/unistd.h
index cf701c9..f4a0030 100644
--- a/arch/m32r/include/asm/unistd.h
+++ b/arch/m32r/include/asm/unistd.h
@@ -330,10 +330,11 @@
/* #define __NR_timerfd 322 removed */
#define __NR_eventfd 323
#define __NR_fallocate 324
+#define __NR_sched_wait_block 325

#ifdef __KERNEL__

-#define NR_syscalls 325
+#define NR_syscalls 326

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_STAT64
diff --git a/arch/m68k/include/asm/unistd.h b/arch/m68k/include/asm/unistd.h
index 48b87f5..e792350 100644
--- a/arch/m68k/include/asm/unistd.h
+++ b/arch/m68k/include/asm/unistd.h
@@ -336,10 +336,11 @@
#define __NR_pwritev 330
#define __NR_rt_tgsigqueueinfo 331
#define __NR_perf_event_open 332
+#define __NR_sched_wait_block 333

#ifdef __KERNEL__

-#define NR_syscalls 333
+#define NR_syscalls 334

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/microblaze/include/asm/unistd.h b/arch/microblaze/include/asm/unistd.h
index cb05a07..df787b2 100644
--- a/arch/microblaze/include/asm/unistd.h
+++ b/arch/microblaze/include/asm/unistd.h
@@ -382,8 +382,9 @@
#define __NR_pwritev 364 /* new */
#define __NR_rt_tgsigqueueinfo 365 /* new */
#define __NR_perf_event_open 366 /* new */
+#define __NR_sched_wait_block 367

-#define __NR_syscalls 367
+#define __NR_syscalls 368

#ifdef __KERNEL__
#ifndef __ASSEMBLY__
diff --git a/arch/mips/include/asm/unistd.h b/arch/mips/include/asm/unistd.h
index 8c9dfa9..ac3086f 100644
--- a/arch/mips/include/asm/unistd.h
+++ b/arch/mips/include/asm/unistd.h
@@ -355,11 +355,11 @@
#define __NR_rt_tgsigqueueinfo (__NR_Linux + 332)
#define __NR_perf_event_open (__NR_Linux + 333)
#define __NR_accept4 (__NR_Linux + 334)
-
+#define __NR_sched_wait_block (__NR_Linux + 335)
/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 334
+#define __NR_Linux_syscalls 335

#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */

diff --git a/arch/mn10300/include/asm/unistd.h b/arch/mn10300/include/asm/unistd.h
index 2a98393..3d0ec30 100644
--- a/arch/mn10300/include/asm/unistd.h
+++ b/arch/mn10300/include/asm/unistd.h
@@ -348,10 +348,11 @@
#define __NR_pwritev 335
#define __NR_rt_tgsigqueueinfo 336
#define __NR_perf_event_open 337
+#define __NR_sched_wait_block 338

#ifdef __KERNEL__

-#define NR_syscalls 338
+#define NR_syscalls 339

/*
* specify the deprecated syscalls we want to support on this arch
diff --git a/arch/parisc/include/asm/unistd.h b/arch/parisc/include/asm/unistd.h
index cda1583..383332b 100644
--- a/arch/parisc/include/asm/unistd.h
+++ b/arch/parisc/include/asm/unistd.h
@@ -811,8 +811,9 @@
#define __NR_pwritev (__NR_Linux + 316)
#define __NR_rt_tgsigqueueinfo (__NR_Linux + 317)
#define __NR_perf_event_open (__NR_Linux + 318)
+#define __NR_sched_wait_block (__NR_Linux + 319)

-#define __NR_Linux_syscalls (__NR_perf_event_open + 1)
+#define __NR_Linux_syscalls (__NR_sched_wait_block + 1)


#define __IGNORE_select /* newselect */
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index f6ca761..466aa45 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -345,10 +345,11 @@
#define __NR_preadv 320
#define __NR_pwritev 321
#define __NR_rt_tgsigqueueinfo 322
+#define __NR_sched_wait_block 323

#ifdef __KERNEL__

-#define __NR_syscalls 323
+#define __NR_syscalls 324

#define __NR__exit __NR_exit
#define NR_syscalls __NR_syscalls
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index cb5232d..c0dabc5 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,9 @@
#define __NR_pwritev 329
#define __NR_rt_tgsigqueueinfo 330
#define __NR_perf_event_open 331
-#define NR_syscalls 332
+#define __NR_sched_wait_block 332
+
+#define NR_syscalls 333

/*
* There are some system calls that are not present on 64 bit, some
diff --git a/arch/sh/include/asm/unistd_32.h b/arch/sh/include/asm/unistd_32.h
index f3fd1b9..16516d3 100644
--- a/arch/sh/include/asm/unistd_32.h
+++ b/arch/sh/include/asm/unistd_32.h
@@ -345,8 +345,9 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
+#define __NR_sched_wait_block 337

-#define NR_syscalls 337
+#define NR_syscalls 338

#ifdef __KERNEL__

diff --git a/arch/sh/include/asm/unistd_64.h b/arch/sh/include/asm/unistd_64.h
index 343ce8f..6bf4ab4 100644
--- a/arch/sh/include/asm/unistd_64.h
+++ b/arch/sh/include/asm/unistd_64.h
@@ -385,10 +385,11 @@
#define __NR_pwritev 362
#define __NR_rt_tgsigqueueinfo 363
#define __NR_perf_event_open 364
+#define __NR_sched_wait_block 365

#ifdef __KERNEL__

-#define NR_syscalls 365
+#define NR_syscalls 366

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/sparc/include/asm/unistd.h b/arch/sparc/include/asm/unistd.h
index 42f2316..482664d 100644
--- a/arch/sparc/include/asm/unistd.h
+++ b/arch/sparc/include/asm/unistd.h
@@ -396,8 +396,9 @@
#define __NR_pwritev 325
#define __NR_rt_tgsigqueueinfo 326
#define __NR_perf_event_open 327
+#define __NR_sched_wait_block 328

-#define NR_SYSCALLS 328
+#define NR_SYSCALLS 329

#ifdef __32bit_syscall_numbers__
/* Sparc 32-bit only has the "setresuid32", "getresuid32" variants,
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 581b056..a63001a 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -841,4 +841,5 @@ ia32_sys_call_table:
.quad compat_sys_pwritev
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
.quad sys_perf_event_open
+ .quad sys_sched_wait_block
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..91a9e56 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
+#define __NR_sched_wait_block 337

#ifdef __KERNEL__

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 8d3ad0a..11e943e 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
__SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 298
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_sched_wait_block 299
+__SYSCALL(__NR_sched_wait_block, sys_sched_wait_block)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..2f93cef 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
+ .long sys_sched_wait_block
diff --git a/arch/xtensa/include/asm/unistd.h b/arch/xtensa/include/asm/unistd.h
index c092c8f..834aa00 100644
--- a/arch/xtensa/include/asm/unistd.h
+++ b/arch/xtensa/include/asm/unistd.h
@@ -681,8 +681,9 @@ __SYSCALL(304, sys_signalfd, 3)
__SYSCALL(305, sys_ni_syscall, 0)
#define __NR_eventfd 306
__SYSCALL(306, sys_eventfd, 1)
+#define __NR_sched_wait_block 307

-#define __NR_syscall_count 307
+#define __NR_syscall_count 308

/*
* sysxtensa syscall handler
diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index d76b66a..ff21a1b 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -623,8 +623,11 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 241
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)

+#define __NR_sched_wait_block 242
+__SYSCALL(__NR_sched_wait_block, sys_sched_wait_block)
+
#undef __NR_syscalls
-#define __NR_syscalls 242
+#define __NR_syscalls 243

/*
* All syscalls below here should go away really,
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 21a6f5d..21ed548 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -145,6 +145,7 @@ extern struct cred init_cred;
.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \
.ptraced = LIST_HEAD_INIT(tsk.ptraced), \
.ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \
+ .task_waiters = __WAIT_QUEUE_HEAD_INITIALIZER(tsk.task_waiters), \
.real_parent = &tsk, \
.parent = &tsk, \
.children = LIST_HEAD_INIT(tsk.children), \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c46023..d4639b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1280,7 +1280,7 @@ struct task_struct {
unsigned in_execve:1; /* Tell the LSMs that the process is doing an
* execve */
unsigned in_iowait:1;
-
+ unsigned in_taskwait:1;

/* Revert to default priority/policy when forking */
unsigned sched_reset_on_fork:1;
@@ -1315,6 +1315,7 @@ struct task_struct {
struct list_head ptraced;
struct list_head ptrace_entry;

+ wait_queue_head_t task_waiters;
/*
* This is the tracer handle for the ptrace BTS extension.
* This field actually belongs to the ptracer task.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b50974a..e18d7e8 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -406,6 +406,7 @@ asmlinkage long sys_sched_rr_get_interval(pid_t pid,
struct timespec __user *interval);
asmlinkage long sys_setpriority(int which, int who, int niceval);
asmlinkage long sys_getpriority(int which, int who);
+asmlinkage long sys_sched_wait_block(pid_t pid, struct timespec __user *uts);

asmlinkage long sys_shutdown(int, int);
asmlinkage long sys_reboot(int magic1, int magic2, unsigned int cmd,
diff --git a/include/linux/wait.h b/include/linux/wait.h
index a48e16b..4afb340 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -157,6 +157,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
#define wake_up_nr(x, nr) __wake_up(x, TASK_NORMAL, nr, NULL)
#define wake_up_all(x) __wake_up(x, TASK_NORMAL, 0, NULL)
#define wake_up_locked(x) __wake_up_locked((x), TASK_NORMAL)
+#define wake_up_sync_all(x) __wake_up_sync((x), TASK_NORMAL, 0)

#define wake_up_interruptible(x) __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)
#define wake_up_interruptible_nr(x, nr) __wake_up(x, TASK_INTERRUPTIBLE, nr, NULL)
diff --git a/kernel/fork.c b/kernel/fork.c
index 166b8c4..e664442 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1121,6 +1121,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->blocked_on = NULL; /* not blocked yet */
#endif

+ init_waitqueue_head(&p->task_waiters);
+
p->bts = NULL;

p->stack_start = stack_start;
diff --git a/kernel/sched.c b/kernel/sched.c
index 7179c80..98f5480 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5440,6 +5440,7 @@ asmlinkage void __sched schedule(void)
struct rq *rq;
int cpu;

+
need_resched:
preempt_disable();
cpu = smp_processor_id();
@@ -5450,6 +5451,8 @@ need_resched:

release_kernel_lock(prev);
need_resched_nonpreemptible:
+ if (prev->state && !unlikely(prev->in_taskwait) && unlikely(waitqueue_active(&prev->task_waiters)))
+ wake_up_sync_all(&prev->task_waiters);

schedule_debug(prev);

@@ -5467,7 +5470,7 @@ need_resched_nonpreemptible:
deactivate_task(rq, prev, 1);
switch_count = &prev->nvcsw;
}
-
+
pre_schedule(rq, prev);

if (unlikely(!rq->nr_running))
@@ -5717,8 +5720,8 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
if (unlikely(!q))
return;

- if (unlikely(!nr_exclusive))
- wake_flags = 0;
+ /*if (unlikely(!nr_exclusive))
+ wake_flags = 0;*/

spin_lock_irqsave(&q->lock, flags);
__wake_up_common(q, mode, nr_exclusive, wake_flags, key);
@@ -6889,6 +6892,61 @@ out_unlock:
return retval;
}

+SYSCALL_DEFINE2(sched_wait_block, pid_t, pid, struct timespec __user *, uts)
+{
+ struct task_struct* p;
+ long retval;
+ struct timespec ts;
+ long timeout = MAX_SCHEDULE_TIMEOUT;
+
+ if (pid < 0)
+ {
+ printk(KERN_INFO "Unknown PID: %d\n", pid);
+ return -EINVAL;
+ }
+
+ if (uts)
+ {
+ if (copy_from_user(&ts, uts, sizeof(struct timespec)))
+ return -EINVAL;
+
+ timeout = timespec_to_jiffies(&ts);
+ printk(KERN_INFO "Timeout: %d\n", timeout);
+
+ if (timeout < 0)
+ return -EINVAL;
+ }
+
+ read_lock(&tasklist_lock);
+ p = find_process_by_pid(pid);
+ if (!p)
+ {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(p);
+ read_unlock(&tasklist_lock);
+
+ printk(KERN_INFO "Waiting...");
+ // logic
+ current->in_taskwait = 1;
+ retval = wait_event_interruptible_timeout(p->task_waiters, p->state, timeout);
+ current->in_taskwait = 0;
+ // endlogic
+ printk("Done: %d\n", retval);
+
+ if (retval == -ERESTARTSYS)
+ retval = -EINTR;
+ else if (p->state)
+ retval = 0;
+ else
+ retval = -EAGAIN;
+
+ put_task_struct(p);
+
+ return retval;
+}
+
static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;

void sched_show_task(struct task_struct *p)
--
1.6.3.3

2009-11-16 08:36:54

by Ingo Molnar

[permalink] [raw]
Subject: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)


(Cc:-ed more interested parties)

* Stijn Devriendt <[email protected]> wrote:

> Hi Ingo, Peter, all,
>
> The attached patch is a prototype for a new system call which allows
> threads to wait for other threads being blocked.
>
> Its main use is to allow threading libraries to resume executing more
> CPU-bound work when one of its threads is blocked while not having to
> over-allocating threads in a normal situation.
>
> Benefit over asynchronous I/O is that a threadpool thread that
> performs asynchronous I/O might not have work enough in one item to
> keep the CPU busy during the whole asynchronous operation and that not
> all operations are async capable. Giving control back to the library
> through a thread waiting for the blocked one allows new workitems to
> be executed as long as the former is blocked.
>
> Code performing this wait could look like:
> pid_t parent = ...;
> while (waitpid(parent, NULL, WNOHANG) != 0)
> {
> if (sched_wait_block(parent, NULL) == 0)
> {
> // do work, possibly notify threadpool manager
> // to start another thread blocked on this one
> // first
> }
> }
>
> Any feedback on the concept is much appreciated.

That is a ... rather interesting idea IMO.

Regarding the API and your patch, i think we can and should do something
different and more capable - while still keeping your basic idea:

Lets turn it all around on its head and add the capability to user-space
to observe the 'parallelism' of a workload (not limit it to the blocking
of a single task) and allow the poll()ing of that quantity - without
affecting workloads.

It should not be limited to a single task, and it should work with
existing syscall APIs - i.e. be fd based.

Incidentally we already have a syscall and a kernel subsystem that is
best suited to deal with such types of issues: perf events. I think we
can create a new, special performance event type that observes
task/workload (or CPU) parallelism:

PERF_TYPE_PARALLELISM

With a 'parallelism_threshold' attribute. (which is '1' for a single
task. See below.)

And then we can use poll() in the thread manager task to observe PIDs,
workloads or full CPUs. The poll() implementation of perf events is fast
and scalable.

( Note: there's no need to actually _capture_ the events into the
ring-buffer - this is done by not mmap()-ing the fd. I.e. we'll just
have a pure poll() wakeup overhead and no tracing overhead. )

The semantics are basically that we are observing task
schedule/unschedule events and keep a count and a threshold - and can
poll() on that. perf_event_attr can be used to inject a 'minimum
parallelism' threshold value (and maybe a 'max parallelism' value as
well).

Events are emitted (and poll() returns) if the observed workload gets
'blocked' according to the parallelism threshold - i.e. if the number of
runnable tasks drops below the threshold.

This fits very nicely into the existing perf events API and we wouldnt
have to add a new syscall.

Usage is very simple and straightforward, and can happen on various
levels of 'observation detail':

- the new fd can be attached to a specific PID (like your syscall).
perf_event_attr::threshold == 1 means we get the semantics of your
sched_wait_block() system call. Note that poll() wont have to do a
PID lookup (as it is already attached) so it will be much faster than
sched_wait_block().

- the new fd can be attached to a hieararchy of tasks and observe _all_
of the parallelism there. This has the advantage of not having to
track each thread in a pool of threads. (this is done via inherited
events, see include/linux/perf_event.h:perf_event_attr::inherit) In
this case a parallelism threshold value larger than 1 makes sense
too, to allow the workload to spread to a number of CPUs. On a 4-CPU
system if we set threshold==4 it means that we'll return from poll()
if the number of runnable tasks drops below 4.

- the new fd can be attached to a CPU - observing parallelism of a full
CPU without having to track all workloads. In this case threshold==1
means that we'll return from poll() if the last task on that CPU
schedules out - i.e. if the CPU becomes idle.

etc.

This would make a very powerful task queueing framework. It basically
allows a 'lazy' user-space scheduler, which only activates if the kernel
scheduler has run out of work.

What do you think?

Ingo

2009-11-16 18:03:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)



On Mon, 16 Nov 2009, Ingo Molnar wrote:
>
> Regarding the API and your patch, i think we can and should do something
> different and more capable - while still keeping your basic idea:

Actually, I'd suggest exactly the reverse.

Yes, do something different, but _less_ capable, and much simpler:
introduce the notion of "grouped thread scheduling", where a _group_ of
threads gets scheduled as one thread.

Think of it like a classic user-level threading package, where one process
implements multiple threads entirely in user space, and switches between
them. Except we'd do the exact reverse: create multiple threads in the
kernel, but only run _one_ of them at a time. So as far as the scheduler
is concerned, it acts as just a single thread - except it's a single
thread that has multiple instances associated with it.

And every time the "currently active" thread in that group runs out of CPU
time - or any time it sleeps - we'd just go on to the next thread in the
group.

There are potentially lots of cases where you want to use multiple threads
not because you want multiple CPU's, but because you want to have "another
thread ready" for when one thread sleeps on IO. Or you may use threads as
a container - again, you may not need a lot of CPU, but you split your
single load up into multiple execution contexts just because you had some
independent things going on (think UI threads).

As far as I can tell, that is pretty much what Stijn Devriendt wanted: he
may have lots of threads, but he effectively really just wants "one CPU"
worth of processing.

It's also what we often want with AIO-like threads: it's not that we want
CPU parallelism, and if the data is in caches, we'd like to run the IO
thread immediately and not switch CPU's at all, and actually do it all
synchronously. It's just that _if_ the AIO thread blocks, we'd like to
resume the original thread that may have better things to do.

No "observe CPU parallelism" or anything fancy at all. Just a "don't
consider these X threads to be parallel" flag to clone (or a separate
system call).

Imagine doing async system calls with the equivalent of

- create such an "affine" thread in kernel space
- run the IO in that affine thread - if it runs to completion without
blocking and in a timeslice, never schedule at all.

where these "affine" threads would be even more lightweight than regular
threads, because they don't even act as real scheduling entities, they are
grouped together with the original one.

Linus

2009-11-16 18:19:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)



On Mon, 16 Nov 2009, Linus Torvalds wrote:
>
> Think of it like a classic user-level threading package, where one process
> implements multiple threads entirely in user space, and switches between
> them. Except we'd do the exact reverse: create multiple threads in the
> kernel, but only run _one_ of them at a time. So as far as the scheduler
> is concerned, it acts as just a single thread - except it's a single
> thread that has multiple instances associated with it.

Side note: before anybody asks why not do threads in user space to begin
with, it's simple: IO, exceptions, and timers. All need kernel support.

User-level threading works very well, and is usually efficient as hell.
It's not that long since people constantly claimed that thread libraries
in user space were much better, and tried to use complex NxM models to do
them, exactly because user threads are so great.

But almost nobody does user-level threading now, except for specific
languages that are built around threading. Why? It's not because they
don't work, it's because they have a few very annoying problems that make
them not work at all for enough situations that it gets very painful. And
it's usually about IO and system calls, but sometimes it's about page
faults, and sometimes it's about the pain of multiplexing that annoying
timer signal and all the crap that involves.

So I'm suggesting that maybe we could look at doing kernel threads that do
what user-level threading does. It sounds idiotic, and maybe it is - after
all, traditionally one of the reasons for user-level threads is that you
can avoid all the kernel overheads. But if we can make some really
low-overhead "thread within a thread" model, maybe we could have close to
the best of both worlds.

Some people really want threads for multi-CPU workloads and spreading
things out. That's what we have now. But other loads want threads for
entirely different reasons, like just hiding IO latencies.

Linus

2009-11-16 18:30:06

by Alan

[permalink] [raw]
Subject: Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)

On Mon, 16 Nov 2009 10:02:50 -0800 (PST)
Linus Torvalds <[email protected]> wrote:

>
>
> On Mon, 16 Nov 2009, Ingo Molnar wrote:
> >
> > Regarding the API and your patch, i think we can and should do something
> > different and more capable - while still keeping your basic idea:
>
> Actually, I'd suggest exactly the reverse.
>
> Yes, do something different, but _less_ capable, and much simpler:
> introduce the notion of "grouped thread scheduling", where a _group_ of
> threads gets scheduled as one thread.

And preferably expose it so it can be used by groups of processes as well
as just "threads" in the userspace shared pid etc sense. There are quite
a few applications where both "only run one of us" and "don't pre-empt
within the group" are a useful semantic (you often don't want the
pre-empting within the group of threads because it messes up all the
improved locking opportunities). In particular the usual buzz locks don't
mix well with "only run one of us".

Alan

2009-11-16 19:13:17

by Stijn Devriendt

[permalink] [raw]
Subject: Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)

> It should not be limited to a single task, and it should work with
> existing syscall APIs - i.e. be fd based.
>
> Incidentally we already have a syscall and a kernel subsystem that is
> best suited to deal with such types of issues: perf events. I think we
> can create a new, special performance event type that observes
> task/workload (or CPU) parallelism:
>
> ? ? ? ?PERF_TYPE_PARALLELISM
>
> With a 'parallelism_threshold' attribute. (which is '1' for a single
> task. See below.)

On one side this looks like it's exactly where it belongs as you're
monitoring performance to keep it up to speed, but it does make
the userspace component depend on a profiling-oriented optional
kernel interface.

>
> And then we can use poll() in the thread manager task to observe PIDs,
> workloads or full CPUs. The poll() implementation of perf events is fast
> and scalable.

I've had a quick peek at the perf code and how it currently hooks into
the scheduler and at first glance it looks like 2 additional context switches
are required when using perf. The scheduler will first schedule the idle
thread to later find out that the schedule tail woke up another process
to run. My initial solution woke up the process before making a
scheduling decision. Depending on context switch times the original
blocking operation may have been unblocked (especially on SMP);
e.g. a blocked user-space mutex which was held shortly.
Feel free to correct me here as it was merely a quick peek.

>
> This would make a very powerful task queueing framework. It basically
> allows a 'lazy' user-space scheduler, which only activates if the kernel
> scheduler has run out of work.
>
> What do you think?
>
> ? ? ? ?Ingo

I definately like the way this approach can also work globally.

Stijn

2009-11-16 19:49:19

by Stijn Devriendt

[permalink] [raw]
Subject: Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)

> Think of it like a classic user-level threading package, where one process
> implements multiple threads entirely in user space, and switches between
> them. Except we'd do the exact reverse: create multiple threads in the
> kernel, but only run _one_ of them at a time. So as far as the scheduler
> is concerned, it acts as just a single thread - except it's a single
> thread that has multiple instances associated with it.
>
> And every time the "currently active" thread in that group runs out of CPU
> time - or any time it sleeps - we'd just go on to the next thread in the
> group.

It almost makes me think of, excuse me, fibers ;)
One of the problems with the original approach is that the waiting threads
are woken even if the CPU load is high enough to keep the CPU going.
I had been thinking of inheriting the timeslice for the woken thread, but
this approach does way more than that. I also suspect the original approach
of introducing unfairness.

> No "observe CPU parallelism" or anything fancy at all. Just a "don't
> consider these X threads to be parallel" flag to clone (or a separate
> system call).

I wouldn't rule it out completely as in a sense, it's comparable to a JVM
monitoring performance counters to see if it needs to run the JIT optimizer.

> Imagine doing async system calls with the equivalent of
>
> ?- create such an "affine" thread in kernel space
> ?- run the IO in that affine thread - if it runs to completion without
> ? blocking and in a timeslice, never schedule at all.
>
> where these "affine" threads would be even more lightweight than regular
> threads, because they don't even act as real scheduling entities, they are
> grouped together with the original one.
>
> ? ? ? ? ? ? ? ? ? ? ? ?Linus
>

One extra catch, I didn't even think of in the original approach is
that you still need a way of saying to the kernel: no more work here.

My original approach fails bluntly and I will happily take credit for that ;)
The perf-approach perfectly allows for this, by waking up the "controller"
thread which does exactly nothing as there's no work left.
The grouped thread approach can end up with all threads blocked to
indicate no more work, but I wonder how the lightweight threads will
end up being scheduled; When a threadpool-thread runs, you'll want
to try and run as much work as possible. This would mean that when
one thread blocks and another takes over, the latter will want to run
to completion (empty thread pool queue) starving the blocked thread.

Unless sched_yield is (ab?)used by these extra-threads to let the
scheduler consider the blocked thread again? With the risk that
the scheduler will schedule to the next extra-thread of the same group.

Basically, you're right about what I envisioned: threadpools are meant
to always have some extra work handy, so why not continue work when
one work item is blocked.

Stijn

2009-11-16 20:13:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)


* Stijn Devriendt <[email protected]> wrote:

> One extra catch, I didn't even think of in the original approach is
> that you still need a way of saying to the kernel: no more work here.
>
> My original approach fails bluntly and I will happily take credit for
> that ;) The perf-approach perfectly allows for this, by waking up the
> "controller" thread which does exactly nothing as there's no work
> left.

Note, the perf approach does not require a 'controller thread'.

The most efficient approach using perf-events would be:

- have the pool threads block in poll(perf_event_fd). (all threads
block in poll() on the same fd).

- blocking threads wake_up() the pool and cause them to drop out of
poll() (with no intermediary). [if there's less than
perf_event::min_concurrency tasks running.]

- waking threads observe the event state and only run if we are still
below perf_event::max_concurrency - otherwise they re-queue to the
poll() waitqueue.

Basically the perf-event fd creates the 'group of tasks'. This can be
created voluntarily by cooperating threads - or involuntarily as well
via PID attach or CPU attach.

There's no 'tracing' overhead or notification overhead: we maintain a
shared state and the 'notifications' are straight wakeups that bring the
pool members out of poll(), to drive the workload further.

Such a special sw-event, with min_concurrency==max_concurrency==1 would
implement Linus's interface - using standard facilities like poll().
(The only 'special' act is the set up of the group itself.)

So various concurrency controls could be implemented that way -
including the one Linus suggest - even a HPC workload-queueing daemon
could be done as well, which sheperds 100% uncooperative tasks.

I dont think this 'fancy' approach is actually a performance drag: it
would really do precisely the same thing Linus's facility does (unless
i'm missing something subtle - or something less subtle about Linus's
scheme), with the two parameters set to '1'.

( It would also enable a lot of other things, and it would not tie the
queueing implementation into the scheduler. )

Only trying would tell us for sure though - maybe i'm wrong.

Thanks,

Ingo

2009-11-16 20:22:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)


* Stijn Devriendt <[email protected]> wrote:

> > And then we can use poll() in the thread manager task to observe
> > PIDs, workloads or full CPUs. The poll() implementation of perf
> > events is fast and scalable.
>
> I've had a quick peek at the perf code and how it currently hooks into
> the scheduler and at first glance it looks like 2 additional context
> switches are required when using perf. The scheduler will first
> schedule the idle thread to later find out that the schedule tail woke
> up another process to run. My initial solution woke up the process
> before making a scheduling decision. Depending on context switch times
> the original blocking operation may have been unblocked (especially on
> SMP); e.g. a blocked user-space mutex which was held shortly. Feel
> free to correct me here as it was merely a quick peek.

( Btw., the PERF_TYPE_PARALLELISM name sucks. A better name would be
PERF_COUNT_SW_TASKS or PERF_COUNT_SW_THREAD_POOL or so. )

I'd definitely not advocate a 'controller thread' approach: it's an
unnecessary extra intermediary and it doubles the context switch cost
and tears cache footprint apart.

We want any such scheme to schedule 'naturally' and optimally: i.e. a
blocking thread will schedule an available thread - no ifs and when.

The only limit we want is on concurrency - and we can do that by waking
tasks from the poll() waitqueue if a task blocks - and by requeueing
woken tasks to the poll() waitqueue if a task wakes (and if the
concurrency threshold does not allow it to run)..

In a sense the poll() waitqueue becomes a mini-runqueue for 'ready'
tasks - and the 'number of tasks running' value of the sw event object a
rq->nr_running value. It does not make the tasks available to the real
scheduler - but it's a list of tasks that are willing to run.

This would be a perfect and suitable use of poll() concepts i think -
and well-optimized one as well. It could even be plugged into epoll().

Ingo

2009-11-18 09:30:28

by Stijn Devriendt

[permalink] [raw]
Subject: Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)

> This would be a perfect and suitable use of poll() concepts i think -
> and well-optimized one as well. It could even be plugged into epoll().
>
> ? ? ? ?Ingo

Sorry for not replying earlier, but I was knocked out on the couch
(and still am). I'll have a more thorough look later on when my head
stops spinning 'round and 'round.

I must say that even without applying this perf_event to userspace
scheduling, it's probably a nice event to have anyway in order
to measure concurrency in existing applications. That it would
be suitable to actually control parallellism is probably just a nice
extra. But let's look at things again when the clouds have left
the brain ;)

Stijn

2009-11-21 11:26:43

by Stijn Devriendt

[permalink] [raw]
Subject: Re: [RFC] observe and act upon workload parallelism: PERF_TYPE_PARALLELISM (Was: [RFC][PATCH] sched_wait_block: wait for blocked threads)

> Think of it like a classic user-level threading package, where one process
> implements multiple threads entirely in user space, and switches between
> them. Except we'd do the exact reverse: create multiple threads in the
> kernel, but only run _one_ of them at a time. So as far as the scheduler
> is concerned, it acts as just a single thread - except it's a single
> thread that has multiple instances associated with it.
>
> And every time the "currently active" thread in that group runs out of CPU
> time - or any time it sleeps - we'd just go on to the next thread in the
> group.

Without trying to sound selfish: after some thinking I can't see how this
solves my problem. This is fine for the case you mentioned later on,
like UI threads, but it's not powerful enough for what I'm trying to achieve.

Let's make the round-trip for the thread pool case and start with an empty
thread pool queue:
- All threads are blocked on the queue condition variable untill new work
is queued.
- Thread 1 happily wakes up and runs the work item untill it's blocked.
- A new work item arrives and Thread 2 is woken to handle the new work
item.
- As long as new work arrives and Thread 2 is not blocked (regardless
of preemption because the deal was that they will not preempt each
other) Thread 2 keeps running this work.
Even when Thread 1 is woken, it will not preempt Thread 1.

One solution would be to let Thread 2 call sched_yield, but the
question then is "when" and "how much". Every time a lightweight
thread yields, you'll incur context switches. Because you don't
know when or how much, you'll be penalized for context switching
even when not needed. (Consider 1 blocked thread and 4 extra threads
sched_yield'ing every 5 work items)

Another option is to have a group-leader. Non-leader threads will call
sched_yield once in a while in order to try and jump back to the group-leader.
The group-leader will always continue work without sched_yield'ing.
There's no preemption between these threads.
The down-side is that in case multiple of these threads are waiting for
an event, wake-ups must wake the group leader rather than the other
coop-scheduled threads for this model to work.
Another down-side is that when a non-leader thread is blocked and the
group-leader is run, the non-leader thread is treated unfair.

Either solution's end-result is a very unfair threadpool where one cannot
guarantee even a loose FIFO-model where items are handled more or
less in the order they are queued and a library that needs to make
trade-offs in performance to get this behaviour back.

The solution is great when the threads are blocked most of the time
and have little CPU processing to do (like UI threads), but doesn't
scale beyond that.

As ever, enlighten me when you have a great solution to this problem.

Stijn