Subject: sched: Introduce TASK_NOLOAD and TASK_IDLE
From: Peter Zijlstra <[email protected]>
Date: Fri May 8 14:23:45 CEST 2015
Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
all idle kthreads contribute to the loadavg is somewhat silly.
Now mostly this works OK, because kthreads have all their signals
masked. However there's a few sites where this is causing problems and
TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
This patch adds TASK_NOLOAD which, when combined with
TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
As most of imagined usage sites are loops where a thread wants to
idle, waiting for work, a helper TASK_IDLE is introduced.
Cc: Julian Anastasov <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/sched.h | 10 +++++++---
include/trace/events/sched.h | 3 ++-
2 files changed, 9 insertions(+), 4 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -213,9 +213,10 @@ print_cfs_rq(struct seq_file *m, int cpu
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
#define TASK_PARKED 512
-#define TASK_STATE_MAX 1024
+#define TASK_NOLOAD 1024
+#define TASK_STATE_MAX 2048
-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPN"
extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
@@ -225,6 +226,8 @@ extern char ___assert_task_state[1 - 2*!
#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)
#define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)
+#define TASK_IDLE (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)
+
/* Convenience macros for the sake of wake_up */
#define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
#define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED)
@@ -240,7 +243,8 @@ extern char ___assert_task_state[1 - 2*!
((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
#define task_contributes_to_load(task) \
((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
- (task->flags & PF_FROZEN) == 0)
+ (task->flags & PF_FROZEN) == 0 && \
+ (task->state & TASK_NOLOAD) == 0)
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -147,7 +147,8 @@ TRACE_EVENT(sched_switch,
__print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
{ 16, "Z" }, { 32, "X" }, { 64, "x" },
- { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
+ { 128, "K" }, { 256, "W" }, { 512, "P" },
+ { 1024, "N" }) : "R",
__entry->prev_state & TASK_STATE_MAX ? "+" : "",
__entry->next_comm, __entry->next_pid, __entry->next_prio)
);
Hello,
On Fri, 8 May 2015, Peter Zijlstra wrote:
> Subject: sched: Introduce TASK_NOLOAD and TASK_IDLE
> From: Peter Zijlstra <[email protected]>
> Date: Fri May 8 14:23:45 CEST 2015
>
> Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
> 'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
> all idle kthreads contribute to the loadavg is somewhat silly.
>
> Now mostly this works OK, because kthreads have all their signals
> masked. However there's a few sites where this is causing problems and
> TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
>
> This patch adds TASK_NOLOAD which, when combined with
> TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
>
> As most of imagined usage sites are loops where a thread wants to
> idle, waiting for work, a helper TASK_IDLE is introduced.
After checking our code in net/netfilter/ipvs/ip_vs_sync.c,
sync_thread_master(), we may also need some wrappers:
- schedule_timeout_idle (instead of schedule_timeout call):
__set_current_state(TASK_IDLE);
return schedule_timeout(timeout);
- we here are really idle, so "N" looks ok
- pair of __wait_event_idle(wq, condition) and
wait_event_idle(wq, condition) macros
- we here are write-blocked for socket, not sure
if this blocked vs idle difference is useful to
represent, in new bit for this blocked state "B"=2048,
with 2 TASK_NOLOAD variants: N(idle) and B(blocked,
2|1024|2048, eg. for read-blocked or write-blocked).
It will need additional argument 'state'/'blocked' for
*wait_event_idle().
Regards
--
Julian Anastasov <[email protected]>
On Sat, 9 May 2015 11:49:01 +0300 (EEST) Julian Anastasov <[email protected]> wrote:
>
> Hello,
>
> On Fri, 8 May 2015, Peter Zijlstra wrote:
>
> > Subject: sched: Introduce TASK_NOLOAD and TASK_IDLE
> > From: Peter Zijlstra <[email protected]>
> > Date: Fri May 8 14:23:45 CEST 2015
> >
> > Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
> > 'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
> > all idle kthreads contribute to the loadavg is somewhat silly.
> >
> > Now mostly this works OK, because kthreads have all their signals
> > masked. However there's a few sites where this is causing problems and
> > TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
> >
> > This patch adds TASK_NOLOAD which, when combined with
> > TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
> >
> > As most of imagined usage sites are loops where a thread wants to
> > idle, waiting for work, a helper TASK_IDLE is introduced.
>
> After checking our code in net/netfilter/ipvs/ip_vs_sync.c,
> sync_thread_master(), we may also need some wrappers:
>
> - schedule_timeout_idle (instead of schedule_timeout call):
> __set_current_state(TASK_IDLE);
> return schedule_timeout(timeout);
>
> - we here are really idle, so "N" looks ok
yes, I would want
wait_event_idle_timeout() and wait_event_idle()
but I'll be happy to take whatever I can get :-)
Thanks,
NeilBrown
>
> - pair of __wait_event_idle(wq, condition) and
> wait_event_idle(wq, condition) macros
>
> - we here are write-blocked for socket, not sure
> if this blocked vs idle difference is useful to
> represent, in new bit for this blocked state "B"=2048,
> with 2 TASK_NOLOAD variants: N(idle) and B(blocked,
> 2|1024|2048, eg. for read-blocked or write-blocked).
> It will need additional argument 'state'/'blocked' for
> *wait_event_idle().
>
> Regards
>
> --
> Julian Anastasov <[email protected]>
On Sat, May 09, 2015 at 11:49:01AM +0300, Julian Anastasov wrote:
> After checking our code in net/netfilter/ipvs/ip_vs_sync.c,
> sync_thread_master(), we may also need some wrappers:
>
> - schedule_timeout_idle (instead of schedule_timeout call):
> __set_current_state(TASK_IDLE);
> return schedule_timeout(timeout);
>
> - we here are really idle, so "N" looks ok
So I don't get the point of the schedule_timeout_*() stubs. What are
they for? Why would one use an unconditional schedule_timeout() call?
Isn't that what msleep() is for?
> - pair of __wait_event_idle(wq, condition) and
> wait_event_idle(wq, condition) macros
>
> - we here are write-blocked for socket, not sure
> if this blocked vs idle difference is useful to
> represent, in new bit for this blocked state "B"=2048,
> with 2 TASK_NOLOAD variants: N(idle) and B(blocked,
> 2|1024|2048, eg. for read-blocked or write-blocked).
> It will need additional argument 'state'/'blocked' for
> *wait_event_idle().
Completely untested.
---
include/linux/wait.h | 41 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 39 insertions(+), 2 deletions(-)
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2db83349865b..b9ef6bacf633 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -307,6 +307,26 @@ do { \
__ret; \
})
+#define __wait_event_idle(wq, condition) \
+ (void)___wait_event(wq, condition, TASK_IDLE, 0, 0, \
+ schedule())
+
+
+/**
+ * wait_event_idle - sleep until a condition get true; do not contribute to the loadavg
+ * @wq: the waitqueue to wait on
+ * @condition: a C expression for the event to wait for
+ *
+ * See wait_event()
+ */
+#define wait_event_idle(wq, condition) \
+do { \
+ might_sleep(); \
+ if (condition) \
+ break; \
+ __wait_event_idle(wq, condition); \
+} while (0)
+
#define __wait_event_timeout(wq, condition, timeout) \
___wait_event(wq, ___wait_cond_timeout(condition), \
TASK_UNINTERRUPTIBLE, 0, timeout, \
@@ -346,8 +366,8 @@ do { \
__ret = schedule_timeout(__ret); try_to_freeze())
/*
- * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid
- * increasing load and is freezable.
+ * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE
+ * and is freezable.
*/
#define wait_event_freezable_timeout(wq, condition, timeout) \
({ \
@@ -358,6 +378,23 @@ do { \
__ret; \
})
+#define __wait_event_idle_timeout(wq, condition, timeout) \
+ ___wait_event(wq, ___wait_cond_timeout(condition), \
+ TASK_IDLE, 0, timeout, \
+ __ret = schedule_timeout(__ret))
+
+/*
+ * like wait_event_timeout() -- except it uses TASK_IDLE to avoid loadavg
+ */
+#define wait_event_idle_timeout(wq, condition, timeout) \
+({ \
+ long __ret = timeout; \
+ might_sleep(); \
+ if (!___wait_cond_timeout(condition)) \
+ ret = __wait_event_idle_timeout(wq, condition, timeout);\
+ __ret; \
+})
+
#define __wait_event_cmd(wq, condition, cmd1, cmd2) \
(void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0, \
cmd1; schedule(); cmd2)
Hello,
On Mon, 11 May 2015, Peter Zijlstra wrote:
> > - schedule_timeout_idle (instead of schedule_timeout call):
> > __set_current_state(TASK_IDLE);
> > return schedule_timeout(timeout);
> >
> > - we here are really idle, so "N" looks ok
>
> So I don't get the point of the schedule_timeout_*() stubs. What are
> they for? Why would one use an unconditional schedule_timeout() call?
> Isn't that what msleep() is for?
msleep will not return until timeout has expired.
Instead, we want to notice the kthread_should_stop() event
immediately. Additionally, TASK_UNINTERRUPTIBLE will increase
the load average. We can do it with extra wait queue
and the new __wait_event_idle_timeout but I guess
schedule_timeout_idle will be a good replacement for
schedule_timeout_interruptible calls when used for kthreads.
> + * like wait_event_timeout() -- except it uses TASK_IDLE to avoid loadavg
> + */
> +#define wait_event_idle_timeout(wq, condition, timeout) \
> +({ \
> + long __ret = timeout; \
> + might_sleep(); \
> + if (!___wait_cond_timeout(condition)) \
> + ret = __wait_event_idle_timeout(wq, condition, timeout);\
ret may need underscores here...
Regards
--
Julian Anastasov <[email protected]>
On Mon, May 11, 2015 at 10:34:11PM +0300, Julian Anastasov wrote:
> On Mon, 11 May 2015, Peter Zijlstra wrote:
>
> > > - schedule_timeout_idle (instead of schedule_timeout call):
> > > __set_current_state(TASK_IDLE);
> > > return schedule_timeout(timeout);
> > >
> > > - we here are really idle, so "N" looks ok
> >
> > So I don't get the point of the schedule_timeout_*() stubs. What are
> > they for? Why would one use an unconditional schedule_timeout() call?
> > Isn't that what msleep() is for?
>
> msleep will not return until timeout has expired.
> Instead, we want to notice the kthread_should_stop() event
> immediately. Additionally, TASK_UNINTERRUPTIBLE will increase
> the load average. We can do it with extra wait queue
> and the new __wait_event_idle_timeout but I guess
> schedule_timeout_idle will be a good replacement for
> schedule_timeout_interruptible calls when used for kthreads.
Fair enough I suppose, but then calling it schedule*() is just plain
wrong, it does not behave/act like a normal schedule() call.
Lemme go look at how widely abused that is.
*sigh*, its all over the place :/
$ git grep "schedule_timeout_\(interruptible\|killable\|uninterruptible\)" | wc -l
392
That said; I still don't see the point of schedule_timeout_idle(), we
should not sleep poll for state like that. We should only use TASK_IDLE
when we are in fact _IDLE_ and do not have work to do, at which point
one should use an wait_event() like construct to wait for new work.
> > + * like wait_event_timeout() -- except it uses TASK_IDLE to avoid loadavg
> > + */
> > +#define wait_event_idle_timeout(wq, condition, timeout) \
> > +({ \
> > + long __ret = timeout; \
> > + might_sleep(); \
> > + if (!___wait_cond_timeout(condition)) \
> > + ret = __wait_event_idle_timeout(wq, condition, timeout);\
>
> ret may need underscores here...
I'm fairly sure that might aid in compilation indeed :-)
Hello,
On Tue, 12 May 2015, Peter Zijlstra wrote:
> > msleep will not return until timeout has expired.
> > Instead, we want to notice the kthread_should_stop() event
> > immediately. Additionally, TASK_UNINTERRUPTIBLE will increase
> > the load average. We can do it with extra wait queue
> > and the new __wait_event_idle_timeout but I guess
> > schedule_timeout_idle will be a good replacement for
> > schedule_timeout_interruptible calls when used for kthreads.
>
> Fair enough I suppose, but then calling it schedule*() is just plain
> wrong, it does not behave/act like a normal schedule() call.
>
> Lemme go look at how widely abused that is.
>
> *sigh*, its all over the place :/
>
> $ git grep "schedule_timeout_\(interruptible\|killable\|uninterruptible\)" | wc -l
> 392
>
> That said; I still don't see the point of schedule_timeout_idle(), we
> should not sleep poll for state like that. We should only use TASK_IDLE
> when we are in fact _IDLE_ and do not have work to do, at which point
> one should use an wait_event() like construct to wait for new work.
Probably. But some kthreads may want to sleep,
like in the IPVS case where there is a more complex
mechanism to wake up the kthread which is a socket writer
and does not poll the socket all time.
But I see that kthreads always need to check with
kthread_should_stop(), so if we add schedule_timeout_idle()
it should not be so simple, may be something like that
is race free on kthread_stop() event, if needed at all:
/* state: TASK_IDLE (idle) or TASK_UNINTERRUPTIBLE (busy) */
kthread_schedule_timeout(timeout, state)
{
/* note: no underscores => set_mb */
set_current_state(state);
/* test_bit after memory barrier */
if (kthread_should_stop())
return timeout;
return schedule_timeout(timeout);
}
Regards
--
Julian Anastasov <[email protected]>