2010-11-11 15:26:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Greetings from sunny Arizona!

On Tue, 2010-10-26 at 08:47 -0700, Linus Torvalds wrote:

> So I have a suggestion that may not be popular with you, because it
> does end up changing the approach of your patch a lot.
>
> And I have to say, I like how your last patch looked. It was
> surprisingly small, simple, and clean. So I hate saying "I think it
> should perhaps do things a bit differently". That said, I would
> suggest:
>
> - don't depend on "tsk->signal->tty" at all.
>
> - INSTEAD, introduce a "tsk->signal->sched_group" pointer that points
> to whatever the current auto-task_group is. Remember, long-term, we'd
> want to maybe have other heuristics than just the tty groups, so we'd
> want this separate from the tty logic _anyway_
>
> - at fork time, just copy the task_group pointer in copy_signal() if
> it is non-NULL, and increment the refcount (I don't think struct
> task_group is refcounted now, but this would require it).
>
> - at free_signal_struct(), just do a
> "put_task_group(sig->task_group);" before freeing it.
>
> - make the scheduler use the "tsk->signal->sched_group" as the
> default group if nothing else exists.
>
> Now, all the basic logic is _entirely_ unaware of any tty logic, and
> it's generic. And none of it has any races with some odd tty release
> logic or anything like that.
>
> Now, after this, the only thing you'd need to do is hook into
> __proc_set_tty(), which already holds the sighand lock, and _there_
> you would attach the task_group to the process. Notice how it would
> never be attached to a tty at all, so tty_release etc would never be
> involved in any taskgroup thing - it's not really the tty that owns
> the taskgroup, it's simply the act of becoming a tty task group leader
> that attaches the task to a new scheduling group.
>
> It also means, for example, that if a process loses its tty (and
> doesn't get a new one - think hangup), it still remains in whatever
> scheduling group it started out with. The tty really is immaterial.
>
> And the nice thing about this is that it should be trivial to make
> other things than tty's trigger this same thing, if we find a pattern
> (or create some new interface to let people ask for it) for something
> that should create a new group (like perhaps spawning a graphical
> application from the window manager rather than from a tty).
>
> Comments?

I _finally_ got back to this yesterday, and implemented your suggestion,
though with a couple minor variations. Putting the autogroup pointer in
the signal struct didn't look right to me, so I plugged it into the task
struct instead. I also didn't refcount taskgroups, wanted the patchlet
to be as self-contained as possible, so refcounted the autogroup struct
instead. I also left group movement on tty disassociation in place, but
may nuke it.

The below has withstood an all night thrashing in my laptop with a
PREEMPT_RT kernel, and looks kinda presentable to me, so...

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity. This patch implements an idea from Linus,
to automatically create task groups. This patch only implements Linus' per
tty task group suggestion, and only for fair class tasks, but leaves the way
open for enhancement.

Implementation: each task struct contains an inherited pointer to a refcounted
autogroup struct containing a task group pointer, the default for all tasks
pointing to the init_task_group. When a task calls __proc_set_tty(), the
task's reference to the default group is dropped, a new task group is created,
and the task is moved out of the old group and into the new. Children thereafter
inherit this task group, and increase it's refcount. Calls to __tty_hangup()
and proc_clear_tty() move the caller back to the init_task_group, and possibly
destroy the task group. On exit, reference to the current task group is dropped,
and the task group is potentially destroyed. At runqueue selection time, iff
a task has no cgroup assignment, it's current autogroup is used.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is
selected, but can be disabled via the boot option noautogroup, and can be
also be turned on/off on the fly via..
echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Some numbers.

A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10

About measurement proggy:
pert/sec = perturbations/sec
min/max/avg = scheduler service latencies in usecs
sum/s = time accrued by the competition per sample period (1 sec here)
overhead = %CPU received by the competition per sample period

pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24%
pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99%
pert/s: 24 >42150.22us: 12 min: 8.86 max:61265.91 avg:39459.91 sum/s:947038us overhead:92.20%
pert/s: 26 >42344.91us: 11 min: 3.83 max:52029.60 avg:36164.70 sum/s:940282us overhead:91.12%
pert/s: 24 >44262.90us: 14 min: 5.05 max:82735.15 avg:40314.33 sum/s:967544us overhead:92.22%

Same load with this patch applied.

pert/s: 229 >5484.43us: 41 min: 0.15 max:12069.42 avg:2193.81 sum/s:502382us overhead:50.24%
pert/s: 222 >5652.28us: 43 min: 0.46 max:12077.31 avg:2248.56 sum/s:499181us overhead:49.92%
pert/s: 211 >5809.38us: 43 min: 0.16 max:12064.78 avg:2381.70 sum/s:502538us overhead:50.25%
pert/s: 223 >6147.92us: 43 min: 0.15 max:16107.46 avg:2282.17 sum/s:508925us overhead:50.49%
pert/s: 218 >6252.64us: 43 min: 0.16 max:12066.13 avg:2324.11 sum/s:506656us overhead:50.27%

Average service latency is an order of magnitude better with autogroup.
(Imagine that pert were Xorg or whatnot instead)

Using Mathieu Desnoyers' wakeup-latency testcase:

With taskset -c 3 make -j 10 running..

taskset -c 3 ./wakeup-latency& sleep 30;killall wakeup-latency

without:
maximum latency: 42963.2 µs
average latency: 9077.0 µs
missed timer events: 0

with:
maximum latency: 4160.7 µs
average latency: 149.4 µs
missed timer events: 0

Signed-off-by: Mike Galbraith <[email protected]>
---
Documentation/kernel-parameters.txt | 2
drivers/char/tty_io.c | 4
include/linux/sched.h | 20 ++++
init/Kconfig | 12 ++
kernel/exit.c | 1
kernel/sched.c | 28 ++++--
kernel/sched_autogroup.c | 161 ++++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 10 ++
kernel/sysctl.c | 11 ++
9 files changed, 241 insertions(+), 8 deletions(-)

Index: linux-2.6.36.git/include/linux/sched.h
===================================================================
--- linux-2.6.36.git.orig/include/linux/sched.h
+++ linux-2.6.36.git/include/linux/sched.h
@@ -1159,6 +1159,7 @@ struct sched_rt_entity {
};

struct rcu_node;
+struct autogroup;

struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
@@ -1181,6 +1182,10 @@ struct task_struct {
struct sched_entity se;
struct sched_rt_entity rt;

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* list of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
@@ -1900,6 +1905,21 @@ int sched_rt_handler(struct ctl_table *t

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+extern unsigned int sysctl_sched_autogroup_enabled;
+
+int sched_autogroup_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detatch(struct task_struct *p);
+extern void sched_autogroup_exit(struct task_struct *p);
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detatch(struct task_struct *p) { }
+static inline void sched_autogroup_exit(struct task_struct *p) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
Index: linux-2.6.36.git/kernel/sched.c
===================================================================
--- linux-2.6.36.git.orig/kernel/sched.c
+++ linux-2.6.36.git/kernel/sched.c
@@ -78,6 +78,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -612,11 +613,16 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ autogroup_task_group(p, &tg);
+
+ return tg;
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -1920,6 +1926,7 @@ static void deactivate_task(struct rq *r
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
#endif
@@ -2569,6 +2576,7 @@ void sched_fork(struct task_struct *p, i
* Silence PROVE_RCU.
*/
rcu_read_lock();
+ autogroup_fork(p);
set_task_cpu(p, cpu);
rcu_read_unlock();

@@ -7749,7 +7757,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
@@ -8279,15 +8287,11 @@ void sched_destroy_group(struct task_gro
/* change task's runqueue when it moves between groups.
* The caller of this function should have put the task in its new group
* by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
- * reflect its new group.
+ * reflect its new group. Called with the runqueue lock held.
*/
-void sched_move_task(struct task_struct *tsk)
+void __sched_move_task(struct task_struct *tsk, struct rq *rq)
{
int on_rq, running;
- unsigned long flags;
- struct rq *rq;
-
- rq = task_rq_lock(tsk, &flags);

running = task_current(rq, tsk);
on_rq = tsk->se.on_rq;
@@ -8308,7 +8312,15 @@ void sched_move_task(struct task_struct
tsk->sched_class->set_curr_task(rq);
if (on_rq)
enqueue_task(rq, tsk, 0);
+}
+
+void sched_move_task(struct task_struct *tsk)
+{
+ struct rq *rq;
+ unsigned long flags;

+ rq = task_rq_lock(tsk, &flags);
+ __sched_move_task(tsk, rq);
task_rq_unlock(rq, &flags);
}
#endif /* CONFIG_CGROUP_SCHED */
Index: linux-2.6.36.git/drivers/char/tty_io.c
===================================================================
--- linux-2.6.36.git.orig/drivers/char/tty_io.c
+++ linux-2.6.36.git/drivers/char/tty_io.c
@@ -580,6 +580,7 @@ void __tty_hangup(struct tty_struct *tty
spin_lock_irq(&p->sighand->siglock);
if (p->signal->tty == tty) {
p->signal->tty = NULL;
+ sched_autogroup_detatch(p);
/* We defer the dereferences outside fo
the tasklist lock */
refs++;
@@ -3070,6 +3071,7 @@ void proc_clear_tty(struct task_struct *
spin_lock_irqsave(&p->sighand->siglock, flags);
tty = p->signal->tty;
p->signal->tty = NULL;
+ sched_autogroup_detatch(p);
spin_unlock_irqrestore(&p->sighand->siglock, flags);
tty_kref_put(tty);
}
@@ -3089,12 +3091,14 @@ static void __proc_set_tty(struct task_s
tty->session = get_pid(task_session(tsk));
if (tsk->signal->tty) {
printk(KERN_DEBUG "tty not NULL!!\n");
+ sched_autogroup_detatch(tsk);
tty_kref_put(tsk->signal->tty);
}
}
put_pid(tsk->signal->tty_old_pgrp);
tsk->signal->tty = tty_kref_get(tty);
tsk->signal->tty_old_pgrp = NULL;
+ sched_autogroup_create_attach(tsk);
}

static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty)
Index: linux-2.6.36.git/kernel/exit.c
===================================================================
--- linux-2.6.36.git.orig/kernel/exit.c
+++ linux-2.6.36.git/kernel/exit.c
@@ -174,6 +174,7 @@ repeat:
write_lock_irq(&tasklist_lock);
tracehook_finish_release_task(p);
__exit_signal(p);
+ sched_autogroup_exit(p);

/*
* If we are the last non-leader member of the thread
Index: linux-2.6.36.git/kernel/sched_autogroup.h
===================================================================
--- /dev/null
+++ linux-2.6.36.git/kernel/sched_autogroup.h
@@ -0,0 +1,10 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+static inline void
+autogroup_task_group(struct task_struct *p, struct task_group **tg);
+static void __sched_move_task(struct task_struct *tsk, struct rq *rq);
+#else /* !CONFIG_SCHED_AUTOGROUP */
+static inline void autogroup_init(struct task_struct *init_task) { }
+static inline void autogroup_fork(struct task_struct *p) { }
+static inline void
+autogroup_task_group(struct task_struct *p, struct task_group **tg) { }
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6.36.git/kernel/sched_autogroup.c
===================================================================
--- /dev/null
+++ linux-2.6.36.git/kernel/sched_autogroup.c
@@ -0,0 +1,161 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+
+struct autogroup {
+ struct kref kref;
+ struct task_group *tg;
+};
+
+static struct autogroup autogroup_default;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ kref_init(&autogroup_default.kref);
+ init_task->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+
+ sched_destroy_group(ag->tg);
+ kfree(ag);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kmalloc(sizeof(*ag), GFP_KERNEL);
+
+ if (!ag)
+ goto out_fail;
+
+ ag->tg = sched_create_group(&init_task_group);
+ kref_init(&ag->kref);
+
+ if (!(IS_ERR(ag->tg)))
+ return ag;
+
+out_fail:
+ if (ag) {
+ kfree(ag);
+ WARN_ON(1);
+ } else
+ WARN_ON(1);
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static void autogroup_fork(struct task_struct *p)
+{
+ p->autogroup = autogroup_kref_get(current->autogroup);
+}
+
+static inline void
+autogroup_task_group(struct task_struct *p, struct task_group **tg)
+{
+ int enabled = sysctl_sched_autogroup_enabled;
+
+ enabled &= (*tg == &root_task_group);
+ enabled &= (p->sched_class == &fair_sched_class);
+ enabled &= (!(p->flags & PF_EXITING));
+
+ if (enabled)
+ *tg = p->autogroup->tg;
+}
+
+static void
+autogroup_move_task(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(p, &flags);
+ prev = p->autogroup;
+ if (prev == ag) {
+ task_rq_unlock(rq, &flags);
+ return;
+ }
+
+ p->autogroup = autogroup_kref_get(ag);
+ __sched_move_task(p, rq);
+ task_rq_unlock(rq, &flags);
+
+ autogroup_kref_put(prev);
+}
+
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ autogroup_move_task(p, autogroup_create());
+
+ /*
+ * Correct freshly allocated group's refcount.
+ * Move takes a reference on destination, but
+ * create already initialized refcount to 1.
+ */
+ if (p->autogroup != &autogroup_default)
+ autogroup_kref_put(p->autogroup);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+void sched_autogroup_detatch(struct task_struct *p)
+{
+ autogroup_move_task(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detatch);
+
+void sched_autogroup_exit(struct task_struct *p)
+{
+ autogroup_kref_put(p->autogroup);
+}
+
+int sched_autogroup_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct task_struct *p, *t;
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (ret || !write)
+ return ret;
+
+ /*
+ * Exclude cgroup, task group and task create/destroy
+ * during global classification.
+ */
+ cgroup_lock();
+ spin_lock(&task_group_lock);
+ read_lock(&tasklist_lock);
+
+ do_each_thread(p, t) {
+ sched_move_task(t);
+ } while_each_thread(p, t);
+
+ read_unlock(&tasklist_lock);
+ spin_unlock(&task_group_lock);
+ cgroup_unlock();
+
+ return 0;
+}
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+#endif
Index: linux-2.6.36.git/kernel/sysctl.c
===================================================================
--- linux-2.6.36.git.orig/kernel/sysctl.c
+++ linux-2.6.36.git/kernel/sysctl.c
@@ -384,6 +384,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_autogroup_handler,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
Index: linux-2.6.36.git/init/Kconfig
===================================================================
--- linux-2.6.36.git.orig/init/Kconfig
+++ linux-2.6.36.git/init/Kconfig
@@ -652,6 +652,18 @@ config DEBUG_BLK_CGROUP

endif # CGROUPS

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task tty association.
+
config MM_OWNER
bool

Index: linux-2.6.36.git/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.36.git.orig/Documentation/kernel-parameters.txt
+++ linux-2.6.36.git/Documentation/kernel-parameters.txt
@@ -1610,6 +1610,8 @@ and is between 256 and 4096 characters.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.



2010-11-11 18:05:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups


* Mike Galbraith <[email protected]> wrote:

> I _finally_ got back to this yesterday, and implemented your suggestion, though
> with a couple minor variations. Putting the autogroup pointer in the signal
> struct didn't look right to me, so I plugged it into the task struct instead. I
> also didn't refcount taskgroups, wanted the patchlet to be as self-contained as
> possible, so refcounted the autogroup struct instead. I also left group movement
> on tty disassociation in place, but may nuke it.
>
> The below has withstood an all night thrashing in my laptop with a PREEMPT_RT
> kernel, and looks kinda presentable to me, so...

The patch and the diffstat gives me warm fuzzy feelings:

> ---
> Documentation/kernel-parameters.txt | 2
> drivers/char/tty_io.c | 4
> include/linux/sched.h | 20 ++++
> init/Kconfig | 12 ++
> kernel/exit.c | 1
> kernel/sched.c | 28 ++++--
> kernel/sched_autogroup.c | 161 ++++++++++++++++++++++++++++++++++++
> kernel/sched_autogroup.h | 10 ++
> kernel/sysctl.c | 11 ++
> 9 files changed, 241 insertions(+), 8 deletions(-)

Very well contained, minimally invasive to anything else!

( Noticed only one very small detail: sched_autogroup.h has an illness of lack of
newlines which makes it a bit hard to read - but this is cured easily. )

I'll test and apply this patch to the scheduler tree, so if anyone has objections
please holler now :-)

Linus, does this look OK to you too, can i add your Acked-by?

Thanks,

Ingo

2010-11-11 18:40:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, Nov 11, 2010 at 7:26 AM, Mike Galbraith <[email protected]> wrote:
>
> I _finally_ got back to this yesterday, and implemented your suggestion,
> though with a couple minor variations. ?Putting the autogroup pointer in
> the signal struct didn't look right to me, so I plugged it into the task
> struct instead. ?I also didn't refcount taskgroups, wanted the patchlet
> to be as self-contained as possible, so refcounted the autogroup struct
> instead. ?I also left group movement on tty disassociation in place, but
> may nuke it.

Ok, the patch looks fine, but I do have a few comments:

- the reason I suggested the signal struct was really that I thought
it would avoid extra (unnecessary) cost in thread creation/teardown.

Maybe I should have made that clear, but this seems to
unnecessarily do the whole atomic_inc/dec for each thread. That seems
a bit sad.

That said, if not having to dereference ->signal simplifies the
scheduler interaction, I guess the extra atomic ref at thread
creation/deletion is fine. So I don't think this is wrong, it's just
something I wanted to bring up.

- You misspelled "detach". That just drives me wild. Please fix.

- What I _do_ think is wrong is how I think you're trying to be "too
precise". I think that's fundamentally wrong, because I think we
should make it very clear that it's a heuristic. So I dislike seeing
these functions: sched_autogroup_handler() - we shouldn't care about
old state, sched_autogroup_detach() - even with the fixed spelling I
don't really see why a tty hangup should cause the process to go back
to the default group, for example.

IOW, I think you tried a bit _too_ hard to make it a 1:1 relationship
with the tty. I don't think it needs to be. Just because a process
loses its tty because of a hangup, I don't think that that should have
any real implications for the auto-group scheduling. Or maybe it
should, but that decision should be based on "does it help scheduling
behavior" rather than on "it always matches the tty". See what I'm
saying?

That said, I do love how the patch looks. I think this is absolutely
the right thing to do. My issues are small details. I'd Ack it even in
this form (well, as long as spelling is fixed, that really does rub me
the wrong way), and the other things are more details that are about
how I'm thinking about it rather than "you need to do it this way".

Linus

2010-11-11 19:08:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, 2010-11-11 at 10:34 -0800, Linus Torvalds wrote:
> On Thu, Nov 11, 2010 at 7:26 AM, Mike Galbraith <[email protected]> wrote:
> >
> > I _finally_ got back to this yesterday, and implemented your suggestion,
> > though with a couple minor variations. Putting the autogroup pointer in
> > the signal struct didn't look right to me, so I plugged it into the task
> > struct instead. I also didn't refcount taskgroups, wanted the patchlet
> > to be as self-contained as possible, so refcounted the autogroup struct
> > instead. I also left group movement on tty disassociation in place, but
> > may nuke it.
>
> Ok, the patch looks fine, but I do have a few comments:
>
> - the reason I suggested the signal struct was really that I thought
> it would avoid extra (unnecessary) cost in thread creation/teardown.
>
> Maybe I should have made that clear, but this seems to
> unnecessarily do the whole atomic_inc/dec for each thread. That seems
> a bit sad.
>
> That said, if not having to dereference ->signal simplifies the
> scheduler interaction, I guess the extra atomic ref at thread
> creation/deletion is fine. So I don't think this is wrong, it's just
> something I wanted to bring up.

Ah, ok. Anything that cuts overhead is worth doing.

> - You misspelled "detach". That just drives me wild. Please fix.

(well, _somebody_ has to keep the speeling police occupied;)

> - What I _do_ think is wrong is how I think you're trying to be "too
> precise". I think that's fundamentally wrong, because I think we
> should make it very clear that it's a heuristic. So I dislike seeing
> these functions: sched_autogroup_handler() - we shouldn't care about
> old state, sched_autogroup_detach() - even with the fixed spelling I
> don't really see why a tty hangup should cause the process to go back
> to the default group, for example.
>
> IOW, I think you tried a bit _too_ hard to make it a 1:1 relationship
> with the tty. I don't think it needs to be. Just because a process
> loses its tty because of a hangup, I don't think that that should have
> any real implications for the auto-group scheduling. Or maybe it
> should, but that decision should be based on "does it help scheduling
> behavior" rather than on "it always matches the tty". See what I'm
> saying?

Yeah, and it doesn't in the common case at least. The handler's
classifier was because a 100% pinned hog would never obey the user's
wishes, but I can whack it along with the hangup. Less is more in the
scheduler.

Thanks for the comments.

-Mike

2010-11-11 19:16:03

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 2010.11.11 at 08:26 -0700, Mike Galbraith wrote:
> I _finally_ got back to this yesterday, and implemented your suggestion,
> though with a couple minor variations. Putting the autogroup pointer in
> the signal struct didn't look right to me, so I plugged it into the task
> struct instead. I also didn't refcount taskgroups, wanted the patchlet
> to be as self-contained as possible, so refcounted the autogroup struct
> instead. I also left group movement on tty disassociation in place, but
> may nuke it.
...
>
> With taskset -c 3 make -j 10 running..
>
> taskset -c 3 ./wakeup-latency& sleep 30;killall wakeup-latency
>
> without:
> maximum latency: 42963.2 ?s
> average latency: 9077.0 ?s
> missed timer events: 0
>
> with:
> maximum latency: 4160.7 ?s
> average latency: 149.4 ?s
> missed timer events: 0

Just to add some data; here are the results from my machine (AMD 4
cores) running a -j4 kernel build, while I browsed the web:

1) perf sched record sleep 30

without:
total_wakeups: 44306
avg_wakeup_latency (ns): 36784
min_wakeup_latency (ns): 0
max_wakeup_latency (ns): 9378852

with:
total_wakeups: 43836
avg_wakeup_latency (ns): 67607
min_wakeup_latency (ns): 0
max_wakeup_latency (ns): 8983036

2) perf record -a -e sched:sched_switch -e sched:sched_wakeup sleep 10

without:
total_wakeups: 13195
avg_wakeup_latency (ns): 48484
min_wakeup_latency (ns): 0
max_wakeup_latency (ns): 8722497

with:
total_wakeups: 14106
avg_wakeup_latency (ns): 92532
min_wakeup_latency (ns): 20
max_wakeup_latency (ns): 5642393

So the avg_wakeup_latency nearly doubled with your patch, while the
max_wakeup_latency is lowered by a good amount.

--
Markus

2010-11-11 19:35:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, 2010-11-11 at 20:15 +0100, Markus Trippelsdorf wrote:

> Just to add some data; here are the results from my machine (AMD 4
> cores) running a -j4 kernel build, while I browsed the web:
>
> 1) perf sched record sleep 30
>
> without:
> total_wakeups: 44306
> avg_wakeup_latency (ns): 36784
> min_wakeup_latency (ns): 0
> max_wakeup_latency (ns): 9378852
>
> with:
> total_wakeups: 43836
> avg_wakeup_latency (ns): 67607
> min_wakeup_latency (ns): 0
> max_wakeup_latency (ns): 8983036
>
> 2) perf record -a -e sched:sched_switch -e sched:sched_wakeup sleep 10
>
> without:
> total_wakeups: 13195
> avg_wakeup_latency (ns): 48484
> min_wakeup_latency (ns): 0
> max_wakeup_latency (ns): 8722497
>
> with:
> total_wakeups: 14106
> avg_wakeup_latency (ns): 92532
> min_wakeup_latency (ns): 20
> max_wakeup_latency (ns): 5642393
>
> So the avg_wakeup_latency nearly doubled with your patch, while the
> max_wakeup_latency is lowered by a good amount.

When you say with/without, does that mean enabled/disabled, or
patched/virgin and/or cgroups/nocgroups?

-Mike

2010-11-11 19:38:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, Nov 11, 2010 at 11:08 AM, Mike Galbraith <[email protected]> wrote:
>
>> ?- the reason I suggested the signal struct was really that I thought
>> it would avoid extra (unnecessary) cost in thread creation/teardown.
>>
>> ? ?Maybe I should have made that clear, but this seems to
>> unnecessarily do the whole atomic_inc/dec for each thread. That seems
>> a bit sad.
>>
>> ? ?That said, if not having to dereference ->signal simplifies the
>> scheduler interaction, I guess the extra atomic ref at thread
>> creation/deletion is fine. So I don't think this is wrong, it's just
>> something I wanted to bring up.
>
> Ah, ok. ?Anything that cuts overhead is worth doing.

Well, it cuts both ways. Maybe your approach is simpler and avoids
overhead at scheduling time. And "tsk->signal" may not be reliable due
to races with exit etc, so it may well be that going through the
signal struct could end up being a source of nasty races. I didn't
look whether the scheduler already derefenced ->signal for some other
reason, for example.

So your patch may well have done the exact right thing.

Linus

2010-11-11 19:38:15

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 2010.11.11 at 12:35 -0700, Mike Galbraith wrote:
> On Thu, 2010-11-11 at 20:15 +0100, Markus Trippelsdorf wrote:
>
> > Just to add some data; here are the results from my machine (AMD 4
> > cores) running a -j4 kernel build, while I browsed the web:
> >
> > 1) perf sched record sleep 30
> >
> > without:
> > total_wakeups: 44306
> > avg_wakeup_latency (ns): 36784
> > min_wakeup_latency (ns): 0
> > max_wakeup_latency (ns): 9378852
> >
> > with:
> > total_wakeups: 43836
> > avg_wakeup_latency (ns): 67607
> > min_wakeup_latency (ns): 0
> > max_wakeup_latency (ns): 8983036
> >
> > 2) perf record -a -e sched:sched_switch -e sched:sched_wakeup sleep 10
> >
> > without:
> > total_wakeups: 13195
> > avg_wakeup_latency (ns): 48484
> > min_wakeup_latency (ns): 0
> > max_wakeup_latency (ns): 8722497
> >
> > with:
> > total_wakeups: 14106
> > avg_wakeup_latency (ns): 92532
> > min_wakeup_latency (ns): 20
> > max_wakeup_latency (ns): 5642393
> >
> > So the avg_wakeup_latency nearly doubled with your patch, while the
> > max_wakeup_latency is lowered by a good amount.
>
> When you say with/without, does that mean enabled/disabled, or
> patched/virgin and/or cgroups/nocgroups?

Patched/virgin and nocgroups
--
Markus

2010-11-11 19:58:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, 2010-11-11 at 20:38 +0100, Markus Trippelsdorf wrote:
> On 2010.11.11 at 12:35 -0700, Mike Galbraith wrote:

> > When you say with/without, does that mean enabled/disabled, or
> > patched/virgin and/or cgroups/nocgroups?
>
> Patched/virgin and nocgroups

Figures. As you can see, group scheduling is not wonderful for extreme
switchers. Fortunately, most apps do a bit of work in between.

-Mike

2010-11-11 20:34:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

I didn't read this patch carefully (yet) but,

On 11/11, Mike Galbraith wrote:
>
> @@ -2569,6 +2576,7 @@ void sched_fork(struct task_struct *p, i
> * Silence PROVE_RCU.
> */
> rcu_read_lock();
> + autogroup_fork(p);

Surely this doesn't need rcu.

But the real problem is that copy_process() can fail after that,
and in this case we have the unbalanced kref_get().

> +++ linux-2.6.36.git/kernel/exit.c
> @@ -174,6 +174,7 @@ repeat:
> write_lock_irq(&tasklist_lock);
> tracehook_finish_release_task(p);
> __exit_signal(p);
> + sched_autogroup_exit(p);

This doesn't look right. Note that "p" can run/sleep after that
(or in parallel), set_task_rq() can use the freed ->autogroup.

Btw, I can't apply this patch...

Oleg.

2010-11-11 20:36:05

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/11, Linus Torvalds wrote:
>
> Well, it cuts both ways. Maybe your approach is simpler and avoids
> overhead at scheduling time. And "tsk->signal" may not be reliable due
> to races with exit etc, so it may well be that going through the
> signal struct could end up being a source of nasty races. I didn't
> look whether the scheduler already derefenced ->signal for some other
> reason, for example.

Just in case, starting from 2.6.35 tsk->signal is reliable.

Oleg.

2010-11-11 22:20:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, 2010-11-11 at 21:27 +0100, Oleg Nesterov wrote:
> I didn't read this patch carefully (yet) but,
>
> On 11/11, Mike Galbraith wrote:
> >
> > @@ -2569,6 +2576,7 @@ void sched_fork(struct task_struct *p, i
> > * Silence PROVE_RCU.
> > */
> > rcu_read_lock();
> > + autogroup_fork(p);
>
> Surely this doesn't need rcu.

No, it was just a convenient spot.

> But the real problem is that copy_process() can fail after that,
> and in this case we have the unbalanced kref_get().

Memory leak, will fix.

> > +++ linux-2.6.36.git/kernel/exit.c
> > @@ -174,6 +174,7 @@ repeat:
> > write_lock_irq(&tasklist_lock);
> > tracehook_finish_release_task(p);
> > __exit_signal(p);
> > + sched_autogroup_exit(p);
>
> This doesn't look right. Note that "p" can run/sleep after that
> (or in parallel), set_task_rq() can use the freed ->autogroup.

So avoiding refcounting rcu released task_group backfired. Crud.

> Btw, I can't apply this patch...

It depends on the patch below from Peter, or manual fixup.

Subject: sched, cgroup: Fixup broken cgroup movement
From: Peter Zijlstra <[email protected]>
Date: Fri Oct 15 15:24:15 CEST 2010


Reported-by: Dima Zavin <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched.c | 8 ++++----
kernel/sched_fair.c | 25 +++++++++++++++++++------
3 files changed, 24 insertions(+), 11 deletions(-)

Index: linux-2.6.36.git/kernel/sched.c
===================================================================
--- linux-2.6.36.git.orig/kernel/sched.c
+++ linux-2.6.36.git/kernel/sched.c
@@ -8297,12 +8297,12 @@ void sched_move_task(struct task_struct
if (unlikely(running))
tsk->sched_class->put_prev_task(rq, tsk);

- set_task_rq(tsk, task_cpu(tsk));
-
#ifdef CONFIG_FAIR_GROUP_SCHED
- if (tsk->sched_class->moved_group)
- tsk->sched_class->moved_group(tsk, on_rq);
+ if (tsk->sched_class->task_move_group)
+ tsk->sched_class->task_move_group(tsk, on_rq);
+ else
#endif
+ set_task_rq(tsk, task_cpu(tsk));

if (unlikely(running))
tsk->sched_class->set_curr_task(rq);
Index: linux-2.6.36.git/include/linux/sched.h
===================================================================
--- linux-2.6.36.git.orig/include/linux/sched.h
+++ linux-2.6.36.git/include/linux/sched.h
@@ -1072,7 +1072,7 @@ struct sched_class {
struct task_struct *task);

#ifdef CONFIG_FAIR_GROUP_SCHED
- void (*moved_group) (struct task_struct *p, int on_rq);
+ void (*task_move_group) (struct task_struct *p, int on_rq);
#endif
};

Index: linux-2.6.36.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.36.git.orig/kernel/sched_fair.c
+++ linux-2.6.36.git/kernel/sched_fair.c
@@ -3824,13 +3824,26 @@ static void set_curr_task_fair(struct rq
}

#ifdef CONFIG_FAIR_GROUP_SCHED
-static void moved_group_fair(struct task_struct *p, int on_rq)
+static void task_move_group_fair(struct task_struct *p, int on_rq)
{
- struct cfs_rq *cfs_rq = task_cfs_rq(p);
-
- update_curr(cfs_rq);
+ /*
+ * If the task was not on the rq at the time of this cgroup movement
+ * it must have been asleep, sleeping tasks keep their ->vruntime
+ * absolute on their old rq until wakeup (needed for the fair sleeper
+ * bonus in place_entity()).
+ *
+ * If it was on the rq, we've just 'preempted' it, which does convert
+ * ->vruntime to a relative base.
+ *
+ * Make sure both cases convert their relative position when migrating
+ * to another cgroup's rq. This does somewhat interfere with the
+ * fair sleeper stuff for the first placement, but who cares.
+ */
+ if (!on_rq)
+ p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
+ set_task_rq(p, task_cpu(p));
if (!on_rq)
- place_entity(cfs_rq, &p->se, 1);
+ p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
}
#endif

@@ -3882,7 +3895,7 @@ static const struct sched_class fair_sch
.get_rr_interval = get_rr_interval_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED
- .moved_group = moved_group_fair,
+ .task_move_group = task_move_group_fair,
#endif
};


2010-11-12 18:20:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/11, Mike Galbraith wrote:
>
> On Thu, 2010-11-11 at 21:27 +0100, Oleg Nesterov wrote:
>
> > But the real problem is that copy_process() can fail after that,
> > and in this case we have the unbalanced kref_get().
>
> Memory leak, will fix.
>
> > > +++ linux-2.6.36.git/kernel/exit.c
> > > @@ -174,6 +174,7 @@ repeat:
> > > write_lock_irq(&tasklist_lock);
> > > tracehook_finish_release_task(p);
> > > __exit_signal(p);
> > > + sched_autogroup_exit(p);
> >
> > This doesn't look right. Note that "p" can run/sleep after that
> > (or in parallel), set_task_rq() can use the freed ->autogroup.
>
> So avoiding refcounting rcu released task_group backfired. Crud.

Just in case, the lock order may be wrong. sched_autogroup_exit()
takes task_group_lock under write_lock(tasklist), while
sched_autogroup_handler() takes them in reverse order.


I am not sure, but perhaps this can be simpler?
wake_up_new_task() does autogroup_fork(), and do_exit() does
sched_autogroup_exit() before the last schedule. Possible?


> > Btw, I can't apply this patch...
>
> It depends on the patch below from Peter, or manual fixup.

Thanks. It also applies cleanly to 2.6.36.


Very basic question. Currently sched_autogroup_create_attach()
has the only caller, __proc_set_tty(). It is a bit strange that
signal->tty change is process-wide, but sched_autogroup_create_attach()
move the single thread, the caller. What about other threads in
this thread group? The same for proc_clear_tty().


> +void sched_autogroup_create_attach(struct task_struct *p)
> +{
> + autogroup_move_task(p, autogroup_create());
> +
> + /*
> + * Correct freshly allocated group's refcount.
> + * Move takes a reference on destination, but
> + * create already initialized refcount to 1.
> + */
> + if (p->autogroup != &autogroup_default)
> + autogroup_kref_put(p->autogroup);
> +}

OK, but I don't understand "p->autogroup != &autogroup_default"
check. This is true if autogroup_create() succeeds. Otherwise
autogroup_create() does autogroup_kref_get(autogroup_default),
doesn't this mean we need unconditional _put ?

And can't resist, minor cosmetic nit,

> static inline struct task_group *task_group(struct task_struct *p)
> {
> + struct task_group *tg;
> struct cgroup_subsys_state *css;
>
> css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> lockdep_is_held(&task_rq(p)->lock));
> - return container_of(css, struct task_group, css);
> + tg = container_of(css, struct task_group, css);
> +
> + autogroup_task_group(p, &tg);

Fell free to ignore, but imho

return autogroup_task_group(p, tg);

looks a bit better. Why autogroup_task_group() returns its
result via pointer?

Oleg.

2010-11-13 11:42:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-12 at 19:12 +0100, Oleg Nesterov wrote:
> On 11/11, Mike Galbraith wrote:
> >
> > On Thu, 2010-11-11 at 21:27 +0100, Oleg Nesterov wrote:
> >
> > > But the real problem is that copy_process() can fail after that,
> > > and in this case we have the unbalanced kref_get().
> >
> > Memory leak, will fix.
> >
> > > > +++ linux-2.6.36.git/kernel/exit.c
> > > > @@ -174,6 +174,7 @@ repeat:
> > > > write_lock_irq(&tasklist_lock);
> > > > tracehook_finish_release_task(p);
> > > > __exit_signal(p);
> > > > + sched_autogroup_exit(p);
> > >
> > > This doesn't look right. Note that "p" can run/sleep after that
> > > (or in parallel), set_task_rq() can use the freed ->autogroup.
> >
> > So avoiding refcounting rcu released task_group backfired. Crud.
>
> Just in case, the lock order may be wrong. sched_autogroup_exit()
> takes task_group_lock under write_lock(tasklist), while
> sched_autogroup_handler() takes them in reverse order.

Bug self destructs when global classifier goes away.

> I am not sure, but perhaps this can be simpler?
> wake_up_new_task() does autogroup_fork(), and do_exit() does
> sched_autogroup_exit() before the last schedule. Possible?

That's what I was going to do. That said, I couldn't have had the
problem if I'd tied final put directly to life of container, and am
thinking I should do that instead when I go back to p->signal.

> Very basic question. Currently sched_autogroup_create_attach()
> has the only caller, __proc_set_tty(). It is a bit strange that
> signal->tty change is process-wide, but sched_autogroup_create_attach()
> move the single thread, the caller. What about other threads in
> this thread group? The same for proc_clear_tty().

Yeah, I really should (will) move all on the spot, though it doesn't
seem to matter in general practice, forks afterward land in the right
bucket. With per tty or p->signal, migration will pick up stragglers
lazily.. unless they're pinned.

> > +void sched_autogroup_create_attach(struct task_struct *p)
> > +{
> > + autogroup_move_task(p, autogroup_create());
> > +
> > + /*
> > + * Correct freshly allocated group's refcount.
> > + * Move takes a reference on destination, but
> > + * create already initialized refcount to 1.
> > + */
> > + if (p->autogroup != &autogroup_default)
> > + autogroup_kref_put(p->autogroup);
> > +}
>
> OK, but I don't understand "p->autogroup != &autogroup_default"
> check. This is true if autogroup_create() succeeds. Otherwise
> autogroup_create() does autogroup_kref_get(autogroup_default),
> doesn't this mean we need unconditional _put ?

D'oh, target fixation :) Thanks.

> And can't resist, minor cosmetic nit,
>
> > static inline struct task_group *task_group(struct task_struct *p)
> > {
> > + struct task_group *tg;
> > struct cgroup_subsys_state *css;
> >
> > css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> > lockdep_is_held(&task_rq(p)->lock));
> > - return container_of(css, struct task_group, css);
> > + tg = container_of(css, struct task_group, css);
> > +
> > + autogroup_task_group(p, &tg);
>
> Fell free to ignore, but imho
>
> return autogroup_task_group(p, tg);
>
> looks a bit better. Why autogroup_task_group() returns its
> result via pointer?

No particularly good reason, I'll do the cosmetic change.

Thanks,

-Mike

2010-11-14 17:19:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sat, 2010-11-13 at 04:42 -0700, Mike Galbraith wrote:
> On Fri, 2010-11-12 at 19:12 +0100, Oleg Nesterov wrote:
> > On 11/11, Mike Galbraith wrote:
> > >
> > > On Thu, 2010-11-11 at 21:27 +0100, Oleg Nesterov wrote:
> > >
> > > > But the real problem is that copy_process() can fail after that,
> > > > and in this case we have the unbalanced kref_get().
> > >
> > > Memory leak, will fix.
> > >
> > > > > +++ linux-2.6.36.git/kernel/exit.c
> > > > > @@ -174,6 +174,7 @@ repeat:
> > > > > write_lock_irq(&tasklist_lock);
> > > > > tracehook_finish_release_task(p);
> > > > > __exit_signal(p);
> > > > > + sched_autogroup_exit(p);
> > > >
> > > > This doesn't look right. Note that "p" can run/sleep after that
> > > > (or in parallel), set_task_rq() can use the freed ->autogroup.
> > >
> > > So avoiding refcounting rcu released task_group backfired. Crud.
> >
> > Just in case, the lock order may be wrong. sched_autogroup_exit()
> > takes task_group_lock under write_lock(tasklist), while
> > sched_autogroup_handler() takes them in reverse order.
>
> Bug self destructs when global classifier goes away.

I didn't nuke the handler, but did hide it under a debug option since it
is useful for testing. If the user enables it, and turns autogroup off,
imho off should means off NOW, so I stuck with it as is. I coded up a
lazy (tick time check) move to handle pinned tasks not otherwise being
moved, but that was too much for even my (lack of) taste to handle.

The locking should be fine as it was now, since autogroup_exit() isn't
under the tasklist lock any more. (surprising i didn't hit any problems
with this or use after free in rt kernel given how hard i beat on it)

Pondering adding some debug bits to identify autogroup tasks, maybe
in /proc/N/cgroup or such.

> > I am not sure, but perhaps this can be simpler?
> > wake_up_new_task() does autogroup_fork(), and do_exit() does
> > sched_autogroup_exit() before the last schedule. Possible?
>
> That's what I was going to do. That said, I couldn't have had the
> problem if I'd tied final put directly to life of container, and am
> thinking I should do that instead when I go back to p->signal.

I ended up tying it directly to p->signal's life, and beat on it with
CONFIG_PREEMPT. I wanted to give it a thrashing in PREEMPT_RT, but
when I snagged your signal patches, I apparently didn't snag quite
enough, as the rt kernel with those patches is early boot doorstop.

> > Very basic question. Currently sched_autogroup_create_attach()
> > has the only caller, __proc_set_tty(). It is a bit strange that
> > signal->tty change is process-wide, but sched_autogroup_create_attach()
> > move the single thread, the caller. What about other threads in
> > this thread group? The same for proc_clear_tty().
>
> Yeah, I really should (will) move all on the spot...

Did that, and the rest. This patch will apply to tip or .git.

patchlet:

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity. This patch implements an idea from Linus,
to automatically create task groups. This patch only implements Linus' per
tty task group suggestion, and only for fair class tasks, but leaves the way
open for enhancement.

Implementation: each task's signal struct contains an inherited pointer to a
refcounted autogroup struct containing a task group pointer, the default for
all tasks pointing to the init_task_group. When a task calls __proc_set_tty(),
the process wide reference to the default group is dropped, a new task group is
created, and the process is moved into the new task group. Children thereafter
inherit this task group, and increase it's refcount. On exit, a reference to the
current task group is dropped when the last reference to each signal struct is
dropped. The task group is destroyed when the last signal struct referencing
it is freed. At runqueue selection time, IFF a task has no cgroup assignment,
it's current autogroup is used.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is
selected, but can be disabled via the boot option noautogroup, and can be
also be turned on/off on the fly if CONFIG_SCHED_AUTOGROUP is enabled via..
echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Some numbers.

A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10

About measurement proggy:
pert/sec = perturbations/sec
min/max/avg = scheduler service latencies in usecs
sum/s = time accrued by the competition per sample period (1 sec here)
overhead = %CPU received by the competition per sample period

pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24%
pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99%
pert/s: 24 >42150.22us: 12 min: 8.86 max:61265.91 avg:39459.91 sum/s:947038us overhead:92.20%
pert/s: 26 >42344.91us: 11 min: 3.83 max:52029.60 avg:36164.70 sum/s:940282us overhead:91.12%
pert/s: 24 >44262.90us: 14 min: 5.05 max:82735.15 avg:40314.33 sum/s:967544us overhead:92.22%

Same load with this patch applied.

pert/s: 229 >5484.43us: 41 min: 0.15 max:12069.42 avg:2193.81 sum/s:502382us overhead:50.24%
pert/s: 222 >5652.28us: 43 min: 0.46 max:12077.31 avg:2248.56 sum/s:499181us overhead:49.92%
pert/s: 211 >5809.38us: 43 min: 0.16 max:12064.78 avg:2381.70 sum/s:502538us overhead:50.25%
pert/s: 223 >6147.92us: 43 min: 0.15 max:16107.46 avg:2282.17 sum/s:508925us overhead:50.49%
pert/s: 218 >6252.64us: 43 min: 0.16 max:12066.13 avg:2324.11 sum/s:506656us overhead:50.27%

Average service latency is an order of magnitude better with autogroup.
(Imagine that pert were Xorg or whatnot instead)

Using Mathieu Desnoyers' wakeup-latency testcase:

With taskset -c 3 make -j 10 running..

taskset -c 3 ./wakeup-latency& sleep 30;killall wakeup-latency

without:
maximum latency: 42963.2 µs
average latency: 9077.0 µs
missed timer events: 0

with:
maximum latency: 4160.7 µs
average latency: 149.4 µs
missed timer events: 0

Signed-off-by: Mike Galbraith <[email protected]>

---
Documentation/kernel-parameters.txt | 2
drivers/tty/tty_io.c | 1
include/linux/sched.h | 22 ++++
init/Kconfig | 20 ++++
kernel/fork.c | 5 -
kernel/sched.c | 25 +++--
kernel/sched_autogroup.c | 170 ++++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 18 +++
kernel/sysctl.c | 11 ++
9 files changed, 265 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -576,6 +578,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1931,6 +1936,23 @@ int sched_rt_handler(struct ctl_table *t

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+#ifdef CONFIG_SCHED_AUTOGROUP_DEBUG
+extern unsigned int sysctl_sched_autogroup_enabled;
+int sched_autogroup_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+#endif
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -78,6 +78,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -605,11 +606,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -2006,6 +2010,7 @@ static void sched_irq_time_avg_update(st
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7979,7 +7984,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
@@ -8509,15 +8514,11 @@ void sched_destroy_group(struct task_gro
/* change task's runqueue when it moves between groups.
* The caller of this function should have put the task in its new group
* by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
- * reflect its new group.
+ * reflect its new group. Called with the runqueue lock held.
*/
-void sched_move_task(struct task_struct *tsk)
+void __sched_move_task(struct task_struct *tsk, struct rq *rq)
{
int on_rq, running;
- unsigned long flags;
- struct rq *rq;
-
- rq = task_rq_lock(tsk, &flags);

running = task_current(rq, tsk);
on_rq = tsk->se.on_rq;
@@ -8538,7 +8539,15 @@ void sched_move_task(struct task_struct
tsk->sched_class->set_curr_task(rq);
if (on_rq)
enqueue_task(rq, tsk, 0);
+}

+void sched_move_task(struct task_struct *tsk)
+{
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(tsk, &flags);
+ __sched_move_task(tsk, rq);
task_rq_unlock(rq, &flags);
}
#endif /* CONFIG_CGROUP_SCHED */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(st

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clo
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
Index: linux-2.6/drivers/tty/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/tty/tty_io.c
+++ linux-2.6/drivers/tty/tty_io.c
@@ -3160,6 +3160,7 @@ static void __proc_set_tty(struct task_s
put_pid(tsk->signal->tty_old_pgrp);
tsk->signal->tty = tty_kref_get(tty);
tsk->signal->tty_old_pgrp = NULL;
+ sched_autogroup_create_attach(tsk);
}

static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty)
Index: linux-2.6/kernel/sched_autogroup.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.h
@@ -0,0 +1,18 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+static void __sched_move_task(struct task_struct *tsk, struct rq *rq);
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6/kernel/sched_autogroup.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.c
@@ -0,0 +1,170 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+
+struct autogroup {
+ struct kref kref;
+ struct task_group *tg;
+};
+
+static struct autogroup autogroup_default;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ kref_init(&autogroup_default.kref);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+ struct task_group *tg = ag->tg;
+
+ kfree(ag);
+ sched_destroy_group(tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kmalloc(sizeof(*ag), GFP_KERNEL);
+
+ if (!ag)
+ goto out_fail;
+
+ ag->tg = sched_create_group(&init_task_group);
+ kref_init(&ag->kref);
+
+ if (!(IS_ERR(ag->tg)))
+ return ag;
+
+out_fail:
+ if (ag) {
+ kfree(ag);
+ WARN_ON(1);
+ } else
+ WARN_ON(1);
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ enabled &= (tg == &root_task_group);
+ enabled &= (p->sched_class == &fair_sched_class);
+ enabled &= (!(p->flags & PF_EXITING));
+
+ if (enabled)
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(p, &flags);
+ prev = p->signal->autogroup;
+ if (prev == ag) {
+ task_rq_unlock(rq, &flags);
+ return;
+ }
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+ __sched_move_task(p, rq);
+ task_rq_unlock(rq, &flags);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(t, &p->thread_group, thread_group) {
+ sched_move_task(t);
+ }
+ rcu_read_unlock();
+
+ autogroup_kref_put(prev);
+}
+
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ sig->autogroup = autogroup_kref_get(current->signal->autogroup);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+#ifdef CONFIG_SCHED_AUTOGROUP_DEBUG
+int sched_autogroup_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct task_struct *p, *t;
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (ret || !write)
+ return ret;
+
+ /*
+ * Exclude cgroup, task group and task create/destroy
+ * during global classification.
+ */
+ cgroup_lock();
+ spin_lock(&task_group_lock);
+ read_lock(&tasklist_lock);
+
+ do_each_thread(p, t) {
+ sched_move_task(t);
+ } while_each_thread(p, t);
+
+ spin_unlock(&task_group_lock);
+ cgroup_unlock();
+
+ return 0;
+}
+#endif
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+#endif
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -382,6 +382,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP_DEBUG
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_autogroup_handler,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -728,6 +728,26 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task tty association.
+
+config SCHED_AUTOGROUP_DEBUG
+ bool "Enable Autogroup debugging"
+ depends on SCHED_AUTOGROUP
+ default n
+ help
+ This option allows the user to enable/disable autogroup on the fly
+ via echo [10] > /proc/sys/kernel/sched_autogroup_enabled.
+
config MM_OWNER
bool

Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.




2010-11-14 17:49:30

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 2010.11.14 at 10:19 -0700, Mike Galbraith wrote:
> On Sat, 2010-11-13 at 04:42 -0700, Mike Galbraith wrote:
> > On Fri, 2010-11-12 at 19:12 +0100, Oleg Nesterov wrote:
> > > On 11/11, Mike Galbraith wrote:
> > > >
> > > > On Thu, 2010-11-11 at 21:27 +0100, Oleg Nesterov wrote:
> > > >
> > > > > But the real problem is that copy_process() can fail after that,
> > > > > and in this case we have the unbalanced kref_get().
> > > >
> > > > Memory leak, will fix.
> > > >
> > > > > > +++ linux-2.6.36.git/kernel/exit.c
> > > > > > @@ -174,6 +174,7 @@ repeat:
> > > > > > write_lock_irq(&tasklist_lock);
> > > > > > tracehook_finish_release_task(p);
> > > > > > __exit_signal(p);
> > > > > > + sched_autogroup_exit(p);
> > > > >
> > > > > This doesn't look right. Note that "p" can run/sleep after that
> > > > > (or in parallel), set_task_rq() can use the freed ->autogroup.
> > > >
> > > > So avoiding refcounting rcu released task_group backfired. Crud.
> > >
> > > Just in case, the lock order may be wrong. sched_autogroup_exit()
> > > takes task_group_lock under write_lock(tasklist), while
> > > sched_autogroup_handler() takes them in reverse order.
> >
> > Bug self destructs when global classifier goes away.
>
> I didn't nuke the handler, but did hide it under a debug option since it
> is useful for testing. If the user enables it, and turns autogroup off,
> imho off should means off NOW, so I stuck with it as is. I coded up a
> lazy (tick time check) move to handle pinned tasks not otherwise being
> moved, but that was too much for even my (lack of) taste to handle.
>
> The locking should be fine as it was now, since autogroup_exit() isn't
> under the tasklist lock any more. (surprising i didn't hit any problems
> with this or use after free in rt kernel given how hard i beat on it)
>
> Pondering adding some debug bits to identify autogroup tasks, maybe
> in /proc/N/cgroup or such.
>
> > > I am not sure, but perhaps this can be simpler?
> > > wake_up_new_task() does autogroup_fork(), and do_exit() does
> > > sched_autogroup_exit() before the last schedule. Possible?
> >
> > That's what I was going to do. That said, I couldn't have had the
> > problem if I'd tied final put directly to life of container, and am
> > thinking I should do that instead when I go back to p->signal.
>
> I ended up tying it directly to p->signal's life, and beat on it with
> CONFIG_PREEMPT. I wanted to give it a thrashing in PREEMPT_RT, but
> when I snagged your signal patches, I apparently didn't snag quite
> enough, as the rt kernel with those patches is early boot doorstop.
>
> > > Very basic question. Currently sched_autogroup_create_attach()
> > > has the only caller, __proc_set_tty(). It is a bit strange that
> > > signal->tty change is process-wide, but sched_autogroup_create_attach()
> > > move the single thread, the caller. What about other threads in
> > > this thread group? The same for proc_clear_tty().
> >
> > Yeah, I really should (will) move all on the spot...
>
> Did that, and the rest. This patch will apply to tip or .git.

Unfortunately it won't. There's a missing "+" at line 507:

markus@arch linux % patch -p1 --dry-run < /home/markus/tty_autogroup2.patch
patching file include/linux/sched.h
Hunk #3 succeeded at 1935 (offset -1 lines).
patching file kernel/sched.c
Hunk #2 succeeded at 607 (offset 1 line).
Hunk #3 succeeded at 2011 with fuzz 2 (offset 1 line).
Hunk #4 succeeded at 7959 (offset -25 lines).
Hunk #5 succeeded at 8489 (offset -25 lines).
Hunk #6 succeeded at 8514 (offset -25 lines).
patching file kernel/fork.c
patching file drivers/tty/tty_io.c
patching file kernel/sched_autogroup.h
patching file kernel/sched_autogroup.c
patch: **** malformed patch at line 507: Index: linux-2.6/kernel/sysctl.c
--
Markus

2010-11-14 18:10:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, 2010-11-14 at 18:49 +0100, Markus Trippelsdorf wrote:
> On 2010.11.14 at 10:19 -0700, Mike Galbraith wrote:

> > Did that, and the rest. This patch will apply to tip or .git.
>
> Unfortunately it won't. There's a missing "+" at line 507:
>
> markus@arch linux % patch -p1 --dry-run < /home/markus/tty_autogroup2.patch
> patching file include/linux/sched.h
> Hunk #3 succeeded at 1935 (offset -1 lines).
> patching file kernel/sched.c
> Hunk #2 succeeded at 607 (offset 1 line).
> Hunk #3 succeeded at 2011 with fuzz 2 (offset 1 line).
> Hunk #4 succeeded at 7959 (offset -25 lines).
> Hunk #5 succeeded at 8489 (offset -25 lines).
> Hunk #6 succeeded at 8514 (offset -25 lines).
> patching file kernel/fork.c
> patching file drivers/tty/tty_io.c
> patching file kernel/sched_autogroup.h
> patching file kernel/sched_autogroup.c
> patch: **** malformed patch at line 507: Index: linux-2.6/kernel/sysctl.c

Oh well, yesterday ping (one of sister's laptop loving cats) opened
_187_ screenshots while I was away from the keyboard. I'll blame any
spelling errors on him too while I'm at it :)

Here's a fresh copy, no cats is sight.

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity. This patch implements an idea from Linus,
to automatically create task groups. This patch only implements Linus' per
tty task group suggestion, and only for fair class tasks, but leaves the way
open for enhancement.

Implementation: each task's signal struct contains an inherited pointer to a
refcounted autogroup struct containing a task group pointer, the default for
all tasks pointing to the init_task_group. When a task calls __proc_set_tty(),
the process wide reference to the default group is dropped, a new task group is
created, and the process is moved into the new task group. Children thereafter
inherit this task group, and increase it's refcount. On exit, a reference to the
current task group is dropped when the last reference to each signal struct is
dropped. The task group is destroyed when the last signal struct referencing
it is freed. At runqueue selection time, IFF a task has no cgroup assignment,
it's current autogroup is used.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is
selected, but can be disabled via the boot option noautogroup, and can be
also be turned on/off on the fly via..
echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Some numbers.

A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10

About measurement proggy:
pert/sec = perturbations/sec
min/max/avg = scheduler service latencies in usecs
sum/s = time accrued by the competition per sample period (1 sec here)
overhead = %CPU received by the competition per sample period

pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24%
pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99%
pert/s: 24 >42150.22us: 12 min: 8.86 max:61265.91 avg:39459.91 sum/s:947038us overhead:92.20%
pert/s: 26 >42344.91us: 11 min: 3.83 max:52029.60 avg:36164.70 sum/s:940282us overhead:91.12%
pert/s: 24 >44262.90us: 14 min: 5.05 max:82735.15 avg:40314.33 sum/s:967544us overhead:92.22%

Same load with this patch applied.

pert/s: 229 >5484.43us: 41 min: 0.15 max:12069.42 avg:2193.81 sum/s:502382us overhead:50.24%
pert/s: 222 >5652.28us: 43 min: 0.46 max:12077.31 avg:2248.56 sum/s:499181us overhead:49.92%
pert/s: 211 >5809.38us: 43 min: 0.16 max:12064.78 avg:2381.70 sum/s:502538us overhead:50.25%
pert/s: 223 >6147.92us: 43 min: 0.15 max:16107.46 avg:2282.17 sum/s:508925us overhead:50.49%
pert/s: 218 >6252.64us: 43 min: 0.16 max:12066.13 avg:2324.11 sum/s:506656us overhead:50.27%

Average service latency is an order of magnitude better with autogroup.
(Imagine that pert were Xorg or whatnot instead)

Using Mathieu Desnoyers' wakeup-latency testcase:

With taskset -c 3 make -j 10 running..

taskset -c 3 ./wakeup-latency& sleep 30;killall wakeup-latency

without:
maximum latency: 42963.2 µs
average latency: 9077.0 µs
missed timer events: 0

with:
maximum latency: 4160.7 µs
average latency: 149.4 µs
missed timer events: 0

Signed-off-by: Mike Galbraith <[email protected]>

---
Documentation/kernel-parameters.txt | 2
drivers/tty/tty_io.c | 1
include/linux/sched.h | 22 ++++
init/Kconfig | 20 ++++
kernel/fork.c | 5 -
kernel/sched.c | 25 +++--
kernel/sched_autogroup.c | 170 ++++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 18 +++
kernel/sysctl.c | 11 ++
9 files changed, 265 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -576,6 +578,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1931,6 +1936,23 @@ int sched_rt_handler(struct ctl_table *t

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+#ifdef CONFIG_SCHED_AUTOGROUP_DEBUG
+extern unsigned int sysctl_sched_autogroup_enabled;
+int sched_autogroup_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+#endif
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -78,6 +78,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -605,11 +606,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -2006,6 +2010,7 @@ static void sched_irq_time_avg_update(st
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7979,7 +7984,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
@@ -8509,15 +8514,11 @@ void sched_destroy_group(struct task_gro
/* change task's runqueue when it moves between groups.
* The caller of this function should have put the task in its new group
* by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
- * reflect its new group.
+ * reflect its new group. Called with the runqueue lock held.
*/
-void sched_move_task(struct task_struct *tsk)
+void __sched_move_task(struct task_struct *tsk, struct rq *rq)
{
int on_rq, running;
- unsigned long flags;
- struct rq *rq;
-
- rq = task_rq_lock(tsk, &flags);

running = task_current(rq, tsk);
on_rq = tsk->se.on_rq;
@@ -8538,7 +8539,15 @@ void sched_move_task(struct task_struct
tsk->sched_class->set_curr_task(rq);
if (on_rq)
enqueue_task(rq, tsk, 0);
+}

+void sched_move_task(struct task_struct *tsk)
+{
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(tsk, &flags);
+ __sched_move_task(tsk, rq);
task_rq_unlock(rq, &flags);
}
#endif /* CONFIG_CGROUP_SCHED */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(st

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clo
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
Index: linux-2.6/drivers/tty/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/tty/tty_io.c
+++ linux-2.6/drivers/tty/tty_io.c
@@ -3160,6 +3160,7 @@ static void __proc_set_tty(struct task_s
put_pid(tsk->signal->tty_old_pgrp);
tsk->signal->tty = tty_kref_get(tty);
tsk->signal->tty_old_pgrp = NULL;
+ sched_autogroup_create_attach(tsk);
}

static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty)
Index: linux-2.6/kernel/sched_autogroup.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.h
@@ -0,0 +1,18 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+static void __sched_move_task(struct task_struct *tsk, struct rq *rq);
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6/kernel/sched_autogroup.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.c
@@ -0,0 +1,170 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+
+struct autogroup {
+ struct kref kref;
+ struct task_group *tg;
+};
+
+static struct autogroup autogroup_default;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ kref_init(&autogroup_default.kref);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+ struct task_group *tg = ag->tg;
+
+ kfree(ag);
+ sched_destroy_group(tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kmalloc(sizeof(*ag), GFP_KERNEL);
+
+ if (!ag)
+ goto out_fail;
+
+ ag->tg = sched_create_group(&init_task_group);
+ kref_init(&ag->kref);
+
+ if (!(IS_ERR(ag->tg)))
+ return ag;
+
+out_fail:
+ if (ag) {
+ kfree(ag);
+ WARN_ON(1);
+ } else
+ WARN_ON(1);
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ enabled &= (tg == &root_task_group);
+ enabled &= (p->sched_class == &fair_sched_class);
+ enabled &= (!(p->flags & PF_EXITING));
+
+ if (enabled)
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(p, &flags);
+ prev = p->signal->autogroup;
+ if (prev == ag) {
+ task_rq_unlock(rq, &flags);
+ return;
+ }
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+ __sched_move_task(p, rq);
+ task_rq_unlock(rq, &flags);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(t, &p->thread_group, thread_group) {
+ sched_move_task(t);
+ }
+ rcu_read_unlock();
+
+ autogroup_kref_put(prev);
+}
+
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ sig->autogroup = autogroup_kref_get(current->signal->autogroup);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+#ifdef CONFIG_SCHED_AUTOGROUP_DEBUG
+int sched_autogroup_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct task_struct *p, *t;
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (ret || !write)
+ return ret;
+
+ /*
+ * Exclude cgroup, task group and task create/destroy
+ * during global classification.
+ */
+ cgroup_lock();
+ spin_lock(&task_group_lock);
+ read_lock(&tasklist_lock);
+
+ do_each_thread(p, t) {
+ sched_move_task(t);
+ } while_each_thread(p, t);
+
+ read_unlock(&tasklist_lock);
+ spin_unlock(&task_group_lock);
+ cgroup_unlock();
+
+ return 0;
+}
+#endif
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+#endif
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -382,6 +382,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP_DEBUG
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_autogroup_handler,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -728,6 +728,26 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task tty association.
+
+config SCHED_AUTOGROUP_DEBUG
+ bool "Enable Autogroup debugging"
+ depends on SCHED_AUTOGROUP
+ default n
+ help
+ This option allows the user to enable/disable autogroup on the fly
+ via echo [10] > /proc/sys/kernel/sched_autogroup_enabled.
+
config MM_OWNER
bool

Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.



2010-11-14 19:29:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, Nov 14, 2010 at 10:10 AM, Mike Galbraith <[email protected]> wrote:
>
> Oh well, yesterday ping (one of sister's laptop loving cats) opened
> _187_ screenshots while I was away from the keyboard. ?I'll blame any
> spelling errors on him too while I'm at it :)
>
> Here's a fresh copy, no cats is sight.

This catless version looks very good to me. Assuming testing and
approval by the scheduler people, this has an enthusiastic "ack" from
me.

Linus

2010-11-14 20:21:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, Nov 14, 2010 at 11:28 AM, Linus Torvalds
<[email protected]> wrote:
>
> This catless version looks very good to me

Oh, btw, maybe one comment: is CONFIG_SCHED_AUTOGROUP_DEBUG even worth
having as a config option? The only thing it enables is the small
/proc interface, and I don't see any downside to just always having
that if you have AUTOGROUP enabled in the first place.

It isn't like it's some traditional kernel debug feature that makes
things slower or adds tons of debug output.

Linus

2010-11-14 20:27:40

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 2010.11.14 at 12:20 -0800, Linus Torvalds wrote:
> On Sun, Nov 14, 2010 at 11:28 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > This catless version looks very good to me
>
> Oh, btw, maybe one comment: is CONFIG_SCHED_AUTOGROUP_DEBUG even worth
> having as a config option? The only thing it enables is the small
> /proc interface, and I don't see any downside to just always having
> that if you have AUTOGROUP enabled in the first place.
>
> It isn't like it's some traditional kernel debug feature that makes
> things slower or adds tons of debug output.

It also enables the sched_autogroup_handler, that you disliked seeing in
the previous version.
--
Markus

2010-11-14 20:49:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, Nov 14, 2010 at 12:27 PM, Markus Trippelsdorf
<[email protected]> wrote:
> On 2010.11.14 at 12:20 -0800, Linus Torvalds wrote:
>>
>> It isn't like it's some traditional kernel debug feature that makes
>> things slower or adds tons of debug output.
>
> It also enables the sched_autogroup_handler, that you disliked seeing in
> the previous version.

Yes, but that one exists only to make things "exact". I don't really
see why it exists. What's the point of doing the task movement, if the
group information is just going to be ignored anyway?

IOW, my dislike of the sched_autogroup_handler is not about the debug
option per se, it's about the same thing I said about tty hangups etc:
why do we care so deeply?

I think it would be perfectly acceptably to make the "enable" bit just
decide whether or not to take that autogroup information into account
for scheduling decision. Why do we care so deeply about having to move
the groups around right _then_?

Maybe there is some really deep reason for actually having to do the
reclassification and the sched_move_task() thing for each thread
synchronously when disabling/enabling that thing. If so, I'm just not
seeing it. Why can't we just let things be, and next time things get
scheduled they'll move on their own?

So my objection was really about the same "why do we have to try so
hard to keep the autogroups so 1:1 with the tty's"? It's just a
heuristic, trying to be exact about it seems to miss the point.

Linus

2010-11-14 23:43:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, 2010-11-14 at 12:48 -0800, Linus Torvalds wrote:
> On Sun, Nov 14, 2010 at 12:27 PM, Markus Trippelsdorf
> <[email protected]> wrote:
> > On 2010.11.14 at 12:20 -0800, Linus Torvalds wrote:
> >>
> >> It isn't like it's some traditional kernel debug feature that makes
> >> things slower or adds tons of debug output.
> >
> > It also enables the sched_autogroup_handler, that you disliked seeing in
> > the previous version.
>
> Yes, but that one exists only to make things "exact". I don't really
> see why it exists. What's the point of doing the task movement, if the
> group information is just going to be ignored anyway?

Not only. pinned tasks would stay in their autogroup until somebody
moved them to a cgroup. Them wandering back over time would be fine,
and all but pinned tasks will.

-Mike

2010-11-15 00:02:35

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, 2010-11-14 at 12:20 -0800, Linus Torvalds wrote:
> On Sun, Nov 14, 2010 at 11:28 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > This catless version looks very good to me
>
> Oh, btw, maybe one comment: is CONFIG_SCHED_AUTOGROUP_DEBUG even worth
> having as a config option? The only thing it enables is the small
> /proc interface, and I don't see any downside to just always having
> that if you have AUTOGROUP enabled in the first place.

If the on/off is available, you can silently strand pinned tasks forever
in an autogroup. So, I tied the on/off switch to the global move so the
user won't be surprised.

-Mike

2010-11-15 00:15:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, Nov 14, 2010 at 3:43 PM, Mike Galbraith <[email protected]> wrote:
>
> Not only. pinned tasks would stay in their autogroup until somebody
> moved them to a cgroup. ?Them wandering back over time would be fine,
> and all but pinned tasks will.

But why is a problem?

That's kind of my point. Why do we care about some special case that
(a) likely doesn't happen and (b) even if it happens, what's the
problem with it happening?

Let me put it this way: the autogroup thing modifies the scheduler in
some subtle ways, but it doesn't make any _semantic_ difference. And
it's not like enabling/disabling autogroup scheduling is something
critical to begin with. It's a heuristic.

THAT is why I think it's so silly to try to be so strict and walk over
all processes while holding a couple of spinlocks.

Ok, so you disable autogroups after having used them, _and_ having
done some really specific things, and some processes that used to be
in a group may end up still scheduling in that group. Why care? It's
not like random people can enable/disable autogroup scheduling and try
to use this as a way to get unfair scheduling.

In fact, I'd go as far as saying that for the tty-based autogroup
scheduling, if you want the autogroup policy to take effect, it would
be perfectly ok if it really only affected the actual group
_allocation_. So when you turn it on, old tty associations that
existed before autogroup being turned on would not actually be in
their own groups at all. And when you turn it off, existing tty groups
would continue to exist, and you'd actually have to open a new window
to get processes to no longer be part of any autogroup behavior.

See what I'm saying? I still think you're bending over backwards to be
"careful" in ways that don't seem to make much sense to me.

Now, if there is some really fundamental reason why processes _have_
to be disassociated with the group when the autogroup policy changes,
that would be a different issue. If the scheduler oopses when it hits
a process that was in a tty autogroup but autogrouping has been turned
off, _that_ would be a reason to say "recalculate everything when you
disable/enable policy groups". But I don't think that's the case.

Linus

2010-11-15 00:27:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, Nov 14, 2010 at 4:15 PM, Linus Torvalds
<[email protected]> wrote:
>
> THAT is why I think it's so silly to try to be so strict and walk over
> all processes while holding a couple of spinlocks.

Btw, let me say that I think the patch is great even with that thing
in. It looks clean, the thing I'm complaining about is not a big deal,
and it seems to perform very much as advertized. The difference with
autogroup scheduling is very noticeable with a simple "make -j64"
kernel compile.

So I really don't think it's a big deal. The sysctl handler isn't even
complicated. But boy does it hurt my eyes to see a spinlock held
around a "do_each_thread()". And I do get the feeling that the
simplest way to fix it would be to just remove the code entirely, and
just say that "enabling/disabling may be delayed for old processes
with existing autogroups".

Linus

2010-11-15 01:13:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, 2010-11-14 at 16:26 -0800, Linus Torvalds wrote:
> On Sun, Nov 14, 2010 at 4:15 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > THAT is why I think it's so silly to try to be so strict and walk over
> > all processes while holding a couple of spinlocks.
>
> Btw, let me say that I think the patch is great even with that thing
> in. It looks clean, the thing I'm complaining about is not a big deal,
> and it seems to perform very much as advertized. The difference with
> autogroup scheduling is very noticeable with a simple "make -j64"
> kernel compile.
>
> So I really don't think it's a big deal. The sysctl handler isn't even
> complicated. But boy does it hurt my eyes to see a spinlock held
> around a "do_each_thread()". And I do get the feeling that the
> simplest way to fix it would be to just remove the code entirely, and
> just say that "enabling/disabling may be delayed for old processes
> with existing autogroups".

Which is what I just did. If the oddball case isn't a big deal, the
patch shrinks, which is a good thing. I just wanted to cover all bases.

Patchlet with handler whacked:

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity. This patch implements an idea from Linus,
to automatically create task groups. This patch only implements Linus' per
tty task group suggestion, and only for fair class tasks, but leaves the way
open for enhancement.

Implementation: each task's signal struct contains an inherited pointer to a
refcounted autogroup struct containing a task group pointer, the default for
all tasks pointing to the init_task_group. When a task calls __proc_set_tty(),
the process wide reference to the default group is dropped, a new task group is
created, and the process is moved into the new task group. Children thereafter
inherit this task group, and increase it's refcount. On exit, a reference to the
current task group is dropped when the last reference to each signal struct is
dropped. The task group is destroyed when the last signal struct referencing
it is freed. At runqueue selection time, IFF a task has no cgroup assignment,
it's current autogroup is used.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is
selected, but can be disabled via the boot option noautogroup, and can be
also be turned on/off on the fly via..
echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Some numbers.

A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10

About measurement proggy:
pert/sec = perturbations/sec
min/max/avg = scheduler service latencies in usecs
sum/s = time accrued by the competition per sample period (1 sec here)
overhead = %CPU received by the competition per sample period

pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24%
pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99%
pert/s: 24 >42150.22us: 12 min: 8.86 max:61265.91 avg:39459.91 sum/s:947038us overhead:92.20%
pert/s: 26 >42344.91us: 11 min: 3.83 max:52029.60 avg:36164.70 sum/s:940282us overhead:91.12%
pert/s: 24 >44262.90us: 14 min: 5.05 max:82735.15 avg:40314.33 sum/s:967544us overhead:92.22%

Same load with this patch applied.

pert/s: 229 >5484.43us: 41 min: 0.15 max:12069.42 avg:2193.81 sum/s:502382us overhead:50.24%
pert/s: 222 >5652.28us: 43 min: 0.46 max:12077.31 avg:2248.56 sum/s:499181us overhead:49.92%
pert/s: 211 >5809.38us: 43 min: 0.16 max:12064.78 avg:2381.70 sum/s:502538us overhead:50.25%
pert/s: 223 >6147.92us: 43 min: 0.15 max:16107.46 avg:2282.17 sum/s:508925us overhead:50.49%
pert/s: 218 >6252.64us: 43 min: 0.16 max:12066.13 avg:2324.11 sum/s:506656us overhead:50.27%

Average service latency is an order of magnitude better with autogroup.
(Imagine that pert were Xorg or whatnot instead)

Using Mathieu Desnoyers' wakeup-latency testcase:

With taskset -c 3 make -j 10 running..

taskset -c 3 ./wakeup-latency& sleep 30;killall wakeup-latency

without:
maximum latency: 42963.2 µs
average latency: 9077.0 µs
missed timer events: 0

with:
maximum latency: 4160.7 µs
average latency: 149.4 µs
missed timer events: 0

Signed-off-by: Mike Galbraith <[email protected]>

---
Documentation/kernel-parameters.txt | 2
drivers/tty/tty_io.c | 1
include/linux/sched.h | 19 ++++
init/Kconfig | 12 +++
kernel/fork.c | 5 +
kernel/sched.c | 25 ++++--
kernel/sched_autogroup.c | 140 ++++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 18 ++++
kernel/sysctl.c | 11 ++
9 files changed, 224 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -576,6 +578,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1931,6 +1936,20 @@ int sched_rt_handler(struct ctl_table *t

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+extern unsigned int sysctl_sched_autogroup_enabled;
+
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -78,6 +78,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -605,11 +606,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -2006,6 +2010,7 @@ static void sched_irq_time_avg_update(st
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7979,7 +7984,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
@@ -8509,15 +8514,11 @@ void sched_destroy_group(struct task_gro
/* change task's runqueue when it moves between groups.
* The caller of this function should have put the task in its new group
* by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
- * reflect its new group.
+ * reflect its new group. Called with the runqueue lock held.
*/
-void sched_move_task(struct task_struct *tsk)
+void __sched_move_task(struct task_struct *tsk, struct rq *rq)
{
int on_rq, running;
- unsigned long flags;
- struct rq *rq;
-
- rq = task_rq_lock(tsk, &flags);

running = task_current(rq, tsk);
on_rq = tsk->se.on_rq;
@@ -8538,7 +8539,15 @@ void sched_move_task(struct task_struct
tsk->sched_class->set_curr_task(rq);
if (on_rq)
enqueue_task(rq, tsk, 0);
+}

+void sched_move_task(struct task_struct *tsk)
+{
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(tsk, &flags);
+ __sched_move_task(tsk, rq);
task_rq_unlock(rq, &flags);
}
#endif /* CONFIG_CGROUP_SCHED */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(st

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clo
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
Index: linux-2.6/drivers/tty/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/tty/tty_io.c
+++ linux-2.6/drivers/tty/tty_io.c
@@ -3160,6 +3160,7 @@ static void __proc_set_tty(struct task_s
put_pid(tsk->signal->tty_old_pgrp);
tsk->signal->tty = tty_kref_get(tty);
tsk->signal->tty_old_pgrp = NULL;
+ sched_autogroup_create_attach(tsk);
}

static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty)
Index: linux-2.6/kernel/sched_autogroup.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.h
@@ -0,0 +1,18 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+static void __sched_move_task(struct task_struct *tsk, struct rq *rq);
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6/kernel/sched_autogroup.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.c
@@ -0,0 +1,140 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+
+struct autogroup {
+ struct kref kref;
+ struct task_group *tg;
+};
+
+static struct autogroup autogroup_default;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ kref_init(&autogroup_default.kref);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+ struct task_group *tg = ag->tg;
+
+ kfree(ag);
+ sched_destroy_group(tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kmalloc(sizeof(*ag), GFP_KERNEL);
+
+ if (!ag)
+ goto out_fail;
+
+ ag->tg = sched_create_group(&init_task_group);
+ kref_init(&ag->kref);
+
+ if (!(IS_ERR(ag->tg)))
+ return ag;
+
+out_fail:
+ if (ag) {
+ kfree(ag);
+ WARN_ON(1);
+ } else
+ WARN_ON(1);
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ enabled &= (tg == &root_task_group);
+ enabled &= (p->sched_class == &fair_sched_class);
+ enabled &= (!(p->flags & PF_EXITING));
+
+ if (enabled)
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(p, &flags);
+ prev = p->signal->autogroup;
+ if (prev == ag) {
+ task_rq_unlock(rq, &flags);
+ return;
+ }
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+ __sched_move_task(p, rq);
+ task_rq_unlock(rq, &flags);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(t, &p->thread_group, thread_group) {
+ sched_move_task(t);
+ }
+ rcu_read_unlock();
+
+ autogroup_kref_put(prev);
+}
+
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ sig->autogroup = autogroup_kref_get(current->signal->autogroup);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+#endif
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -382,6 +382,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -728,6 +728,18 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task tty association.
+
config MM_OWNER
bool

Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.


2010-11-15 03:13:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, Nov 14, 2010 at 5:13 PM, Mike Galbraith <[email protected]> wrote:
>
> Which is what I just did. If the oddball case isn't a big deal, the
> patch shrinks, which is a good thing. I just wanted to cover all bases.

Yeah. And I have to say that I'm (very happily) surprised by just how
small that patch really ends up being, and how it's not intrusive or
ugly either.

I'm also very happy with just what it does to interactive performance.
Admittedly, my "testcase" is really trivial (reading email in a
web-browser, scrolling around a bit, while doing a "make -j64" on the
kernel at the same time), but it's a test-case that is very relevant
for me. And it is a _huge_ improvement.

It's an improvement for things like smooth scrolling around, but what
I found more interesting was how it seems to really make web pages
load a lot faster. Maybe it shouldn't have been surprising, but I
always associated that with network performance. But there's clearly
enough of a CPU load when loading a new web page that if you have a
load average of 50+ at the same time, you _will_ be starved for CPU in
the loading process, and probably won't get all the http requests out
quickly enough.

So I think this is firmly one of those "real improvement" patches.
Good job. Group scheduling goes from "useful for some specific server
loads" to "that's a killer feature".

Linus

2010-11-15 08:57:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, 2010-11-14 at 18:13 -0700, Mike Galbraith wrote:
> +static inline struct task_group *
> +autogroup_task_group(struct task_struct *p, struct task_group *tg)
> +{
> + int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
> +
> + enabled &= (tg == &root_task_group);
> + enabled &= (p->sched_class == &fair_sched_class);
> + enabled &= (!(p->flags & PF_EXITING));
> +
> + if (enabled)
> + return p->signal->autogroup->tg;
> +
> + return tg;
> +}


That made me go wtf.. curious way of writing things.

I guess the usual way of writing that (which is admittedly a little more
verbose) would be something like:

static bool
task_wants_autogroup(struct task_struct *tsk, struct task_group *tg)
{
if (tg != &root_task_group)
return false;

if (tsk->sched_class != &fair_sched_class)
return false;

if (tsk->flags & PF_EXITING)
return false;

return true;
}

static inline struct task_group *
autogroup_task_group(struct task_struct *tsk, struct task_group *tg)
{
if (task_wants_autogroup(tsk, tg))
return tsk->signal->autogroup->tg;

return tg;
}

2010-11-15 11:33:06

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 09:57 +0100, Peter Zijlstra wrote:
> On Sun, 2010-11-14 at 18:13 -0700, Mike Galbraith wrote:
> > +static inline struct task_group *
> > +autogroup_task_group(struct task_struct *p, struct task_group *tg)
> > +{
> > + int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
> > +
> > + enabled &= (tg == &root_task_group);
> > + enabled &= (p->sched_class == &fair_sched_class);
> > + enabled &= (!(p->flags & PF_EXITING));
> > +
> > + if (enabled)
> > + return p->signal->autogroup->tg;
> > +
> > + return tg;
> > +}
>
>
> That made me go wtf.. curious way of writing things.
>
> I guess the usual way of writing that (which is admittedly a little more
> verbose) would be something like:
>
> static bool
> task_wants_autogroup(struct task_struct *tsk, struct task_group *tg)
> {
> if (tg != &root_task_group)
> return false;
>
> if (tsk->sched_class != &fair_sched_class)
> return false;
>
> if (tsk->flags & PF_EXITING)
> return false;
>
> return true;
> }
>
> static inline struct task_group *
> autogroup_task_group(struct task_struct *tsk, struct task_group *tg)
> {
> if (task_wants_autogroup(tsk, tg))
> return tsk->signal->autogroup->tg;
>
> return tg;
> }


That is a bit easier on the eye, modulo my hysterical attachment to "p".

Ok, so it's hopefully now ready to either take wing or go splat.

Baked:

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity. This patch implements an idea from Linus,
to automatically create task groups. This patch only implements Linus' per
tty task group suggestion, and only for fair class tasks, but leaves the way
open for enhancement.

Implementation: each task's signal struct contains an inherited pointer to a
refcounted autogroup struct containing a task group pointer, the default for
all tasks pointing to the init_task_group. When a task calls __proc_set_tty(),
the process wide reference to the default group is dropped, a new task group is
created, and the process is moved into the new task group. Children thereafter
inherit this task group, and increase it's refcount. On exit, a reference to the
current task group is dropped when the last reference to each signal struct is
dropped. The task group is destroyed when the last signal struct referencing
it is freed. At runqueue selection time, IFF a task has no cgroup assignment,
it's current autogroup is used.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is
selected, but can be disabled via the boot option noautogroup, and can be
also be turned on/off on the fly via..
echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Some numbers.

A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10

About measurement proggy:
pert/sec = perturbations/sec
min/max/avg = scheduler service latencies in usecs
sum/s = time accrued by the competition per sample period (1 sec here)
overhead = %CPU received by the competition per sample period

pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24%
pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99%
pert/s: 24 >42150.22us: 12 min: 8.86 max:61265.91 avg:39459.91 sum/s:947038us overhead:92.20%
pert/s: 26 >42344.91us: 11 min: 3.83 max:52029.60 avg:36164.70 sum/s:940282us overhead:91.12%
pert/s: 24 >44262.90us: 14 min: 5.05 max:82735.15 avg:40314.33 sum/s:967544us overhead:92.22%

Same load with this patch applied.

pert/s: 229 >5484.43us: 41 min: 0.15 max:12069.42 avg:2193.81 sum/s:502382us overhead:50.24%
pert/s: 222 >5652.28us: 43 min: 0.46 max:12077.31 avg:2248.56 sum/s:499181us overhead:49.92%
pert/s: 211 >5809.38us: 43 min: 0.16 max:12064.78 avg:2381.70 sum/s:502538us overhead:50.25%
pert/s: 223 >6147.92us: 43 min: 0.15 max:16107.46 avg:2282.17 sum/s:508925us overhead:50.49%
pert/s: 218 >6252.64us: 43 min: 0.16 max:12066.13 avg:2324.11 sum/s:506656us overhead:50.27%

Average service latency is an order of magnitude better with autogroup.
(Imagine that pert were Xorg or whatnot instead)

Using Mathieu Desnoyers' wakeup-latency testcase:

With taskset -c 3 make -j 10 running..

taskset -c 3 ./wakeup-latency& sleep 30;killall wakeup-latency

without:
maximum latency: 42963.2 µs
average latency: 9077.0 µs
missed timer events: 0

with:
maximum latency: 4160.7 µs
average latency: 149.4 µs
missed timer events: 0

Signed-off-by: Mike Galbraith <[email protected]>

---
Documentation/kernel-parameters.txt | 2
drivers/tty/tty_io.c | 1
include/linux/sched.h | 19 ++++
init/Kconfig | 12 +++
kernel/fork.c | 5 +
kernel/sched.c | 25 ++++--
kernel/sched_autogroup.c | 140 ++++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 18 ++++
kernel/sysctl.c | 11 ++
9 files changed, 224 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -576,6 +578,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1931,6 +1936,20 @@ int sched_rt_handler(struct ctl_table *t

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+extern unsigned int sysctl_sched_autogroup_enabled;
+
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -78,6 +78,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -605,11 +606,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -2006,6 +2010,7 @@ static void sched_irq_time_avg_update(st
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7979,7 +7984,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
@@ -8509,15 +8514,11 @@ void sched_destroy_group(struct task_gro
/* change task's runqueue when it moves between groups.
* The caller of this function should have put the task in its new group
* by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
- * reflect its new group.
+ * reflect its new group. Called with the runqueue lock held.
*/
-void sched_move_task(struct task_struct *tsk)
+void __sched_move_task(struct task_struct *tsk, struct rq *rq)
{
int on_rq, running;
- unsigned long flags;
- struct rq *rq;
-
- rq = task_rq_lock(tsk, &flags);

running = task_current(rq, tsk);
on_rq = tsk->se.on_rq;
@@ -8538,7 +8539,15 @@ void sched_move_task(struct task_struct
tsk->sched_class->set_curr_task(rq);
if (on_rq)
enqueue_task(rq, tsk, 0);
+}

+void sched_move_task(struct task_struct *tsk)
+{
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(tsk, &flags);
+ __sched_move_task(tsk, rq);
task_rq_unlock(rq, &flags);
}
#endif /* CONFIG_CGROUP_SCHED */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(st

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clo
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
Index: linux-2.6/drivers/tty/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/tty/tty_io.c
+++ linux-2.6/drivers/tty/tty_io.c
@@ -3160,6 +3160,7 @@ static void __proc_set_tty(struct task_s
put_pid(tsk->signal->tty_old_pgrp);
tsk->signal->tty = tty_kref_get(tty);
tsk->signal->tty_old_pgrp = NULL;
+ sched_autogroup_create_attach(tsk);
}

static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty)
Index: linux-2.6/kernel/sched_autogroup.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.h
@@ -0,0 +1,18 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+static void __sched_move_task(struct task_struct *tsk, struct rq *rq);
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6/kernel/sched_autogroup.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.c
@@ -0,0 +1,140 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+
+struct autogroup {
+ struct kref kref;
+ struct task_group *tg;
+};
+
+static struct autogroup autogroup_default;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ kref_init(&autogroup_default.kref);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+ struct task_group *tg = ag->tg;
+
+ kfree(ag);
+ sched_destroy_group(tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kmalloc(sizeof(*ag), GFP_KERNEL);
+
+ if (!ag)
+ goto out_fail;
+
+ ag->tg = sched_create_group(&init_task_group);
+ kref_init(&ag->kref);
+
+ if (!(IS_ERR(ag->tg)))
+ return ag;
+
+out_fail:
+ if (ag) {
+ kfree(ag);
+ WARN_ON(1);
+ } else
+ WARN_ON(1);
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ enabled &= (tg == &root_task_group);
+ enabled &= (p->sched_class == &fair_sched_class);
+ enabled &= (!(p->flags & PF_EXITING));
+
+ if (enabled)
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(p, &flags);
+ prev = p->signal->autogroup;
+ if (prev == ag) {
+ task_rq_unlock(rq, &flags);
+ return;
+ }
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+ __sched_move_task(p, rq);
+ task_rq_unlock(rq, &flags);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(t, &p->thread_group, thread_group) {
+ sched_move_task(t);
+ }
+ rcu_read_unlock();
+
+ autogroup_kref_put(prev);
+}
+
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ sig->autogroup = autogroup_kref_get(current->signal->autogroup);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+#endif
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -382,6 +382,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -728,6 +728,18 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task tty association.
+
config MM_OWNER
bool

Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.



2010-11-15 11:46:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 04:32 -0700, Mike Galbraith wrote:
> On Mon, 2010-11-15 at 09:57 +0100, Peter Zijlstra wrote:
> > On Sun, 2010-11-14 at 18:13 -0700, Mike Galbraith wrote:
> > > +static inline struct task_group *
> > > +autogroup_task_group(struct task_struct *p, struct task_group *tg)
> > > +{
> > > + int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
> > > +
> > > + enabled &= (tg == &root_task_group);
> > > + enabled &= (p->sched_class == &fair_sched_class);
> > > + enabled &= (!(p->flags & PF_EXITING));
> > > +
> > > + if (enabled)
> > > + return p->signal->autogroup->tg;
> > > +
> > > + return tg;
> > > +}
> >
> >
> > That made me go wtf.. curious way of writing things.
> >
> > I guess the usual way of writing that (which is admittedly a little more
> > verbose) would be something like:
> >
> > static bool
> > task_wants_autogroup(struct task_struct *tsk, struct task_group *tg)
> > {
> > if (tg != &root_task_group)
> > return false;
> >
> > if (tsk->sched_class != &fair_sched_class)
> > return false;
> >
> > if (tsk->flags & PF_EXITING)
> > return false;
> >
> > return true;
> > }
> >
> > static inline struct task_group *
> > autogroup_task_group(struct task_struct *tsk, struct task_group *tg)
> > {
> > if (task_wants_autogroup(tsk, tg))
> > return tsk->signal->autogroup->tg;
> >
> > return tg;
> > }
>
>
> That is a bit easier on the eye, modulo my hysterical attachment to "p".
>
> Ok, so it's hopefully now ready to either take wing or go splat.
>
> Baked:

The cat forgot to quilt refresh :)

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity. This patch implements an idea from Linus,
to automatically create task groups. This patch only implements Linus' per
tty task group suggestion, and only for fair class tasks, but leaves the way
open for enhancement.

Implementation: each task's signal struct contains an inherited pointer to a
refcounted autogroup struct containing a task group pointer, the default for
all tasks pointing to the init_task_group. When a task calls __proc_set_tty(),
the process wide reference to the default group is dropped, a new task group is
created, and the process is moved into the new task group. Children thereafter
inherit this task group, and increase it's refcount. On exit, a reference to the
current task group is dropped when the last reference to each signal struct is
dropped. The task group is destroyed when the last signal struct referencing
it is freed. At runqueue selection time, IFF a task has no cgroup assignment,
it's current autogroup is used.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is
selected, but can be disabled via the boot option noautogroup, and can be
also be turned on/off on the fly via..
echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Some numbers.

A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10

About measurement proggy:
pert/sec = perturbations/sec
min/max/avg = scheduler service latencies in usecs
sum/s = time accrued by the competition per sample period (1 sec here)
overhead = %CPU received by the competition per sample period

pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24%
pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99%
pert/s: 24 >42150.22us: 12 min: 8.86 max:61265.91 avg:39459.91 sum/s:947038us overhead:92.20%
pert/s: 26 >42344.91us: 11 min: 3.83 max:52029.60 avg:36164.70 sum/s:940282us overhead:91.12%
pert/s: 24 >44262.90us: 14 min: 5.05 max:82735.15 avg:40314.33 sum/s:967544us overhead:92.22%

Same load with this patch applied.

pert/s: 229 >5484.43us: 41 min: 0.15 max:12069.42 avg:2193.81 sum/s:502382us overhead:50.24%
pert/s: 222 >5652.28us: 43 min: 0.46 max:12077.31 avg:2248.56 sum/s:499181us overhead:49.92%
pert/s: 211 >5809.38us: 43 min: 0.16 max:12064.78 avg:2381.70 sum/s:502538us overhead:50.25%
pert/s: 223 >6147.92us: 43 min: 0.15 max:16107.46 avg:2282.17 sum/s:508925us overhead:50.49%
pert/s: 218 >6252.64us: 43 min: 0.16 max:12066.13 avg:2324.11 sum/s:506656us overhead:50.27%

Average service latency is an order of magnitude better with autogroup.
(Imagine that pert were Xorg or whatnot instead)

Using Mathieu Desnoyers' wakeup-latency testcase:

With taskset -c 3 make -j 10 running..

taskset -c 3 ./wakeup-latency& sleep 30;killall wakeup-latency

without:
maximum latency: 42963.2 µs
average latency: 9077.0 µs
missed timer events: 0

with:
maximum latency: 4160.7 µs
average latency: 149.4 µs
missed timer events: 0

Signed-off-by: Mike Galbraith <[email protected]>

---
Documentation/kernel-parameters.txt | 2
drivers/tty/tty_io.c | 1
include/linux/sched.h | 19 ++++
init/Kconfig | 12 ++
kernel/fork.c | 5 -
kernel/sched.c | 25 ++++-
kernel/sched_autogroup.c | 151 ++++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 18 ++++
kernel/sysctl.c | 11 ++
9 files changed, 235 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -576,6 +578,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1931,6 +1936,20 @@ int sched_rt_handler(struct ctl_table *t

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+extern unsigned int sysctl_sched_autogroup_enabled;
+
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -78,6 +78,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -605,11 +606,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -2006,6 +2010,7 @@ static void sched_irq_time_avg_update(st
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7979,7 +7984,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
@@ -8509,15 +8514,11 @@ void sched_destroy_group(struct task_gro
/* change task's runqueue when it moves between groups.
* The caller of this function should have put the task in its new group
* by now. This function just updates tsk->se.cfs_rq and tsk->se.parent to
- * reflect its new group.
+ * reflect its new group. Called with the runqueue lock held.
*/
-void sched_move_task(struct task_struct *tsk)
+void __sched_move_task(struct task_struct *tsk, struct rq *rq)
{
int on_rq, running;
- unsigned long flags;
- struct rq *rq;
-
- rq = task_rq_lock(tsk, &flags);

running = task_current(rq, tsk);
on_rq = tsk->se.on_rq;
@@ -8538,7 +8539,15 @@ void sched_move_task(struct task_struct
tsk->sched_class->set_curr_task(rq);
if (on_rq)
enqueue_task(rq, tsk, 0);
+}

+void sched_move_task(struct task_struct *tsk)
+{
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(tsk, &flags);
+ __sched_move_task(tsk, rq);
task_rq_unlock(rq, &flags);
}
#endif /* CONFIG_CGROUP_SCHED */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(st

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clo
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
Index: linux-2.6/drivers/tty/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/tty/tty_io.c
+++ linux-2.6/drivers/tty/tty_io.c
@@ -3160,6 +3160,7 @@ static void __proc_set_tty(struct task_s
put_pid(tsk->signal->tty_old_pgrp);
tsk->signal->tty = tty_kref_get(tty);
tsk->signal->tty_old_pgrp = NULL;
+ sched_autogroup_create_attach(tsk);
}

static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty)
Index: linux-2.6/kernel/sched_autogroup.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.h
@@ -0,0 +1,18 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+static void __sched_move_task(struct task_struct *tsk, struct rq *rq);
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6/kernel/sched_autogroup.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.c
@@ -0,0 +1,151 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+
+struct autogroup {
+ struct kref kref;
+ struct task_group *tg;
+};
+
+static struct autogroup autogroup_default;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ kref_init(&autogroup_default.kref);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+ struct task_group *tg = ag->tg;
+
+ kfree(ag);
+ sched_destroy_group(tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kmalloc(sizeof(*ag), GFP_KERNEL);
+
+ if (!ag)
+ goto out_fail;
+
+ ag->tg = sched_create_group(&init_task_group);
+ kref_init(&ag->kref);
+
+ if (!(IS_ERR(ag->tg)))
+ return ag;
+
+out_fail:
+ if (ag) {
+ kfree(ag);
+ WARN_ON(1);
+ } else
+ WARN_ON(1);
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline bool
+task_wants_autogroup(struct task_struct *p, struct task_group *tg)
+{
+ if (tg != &root_task_group)
+ return false;
+
+ if (p->sched_class != &fair_sched_class)
+ return false;
+
+ if (p->flags & PF_EXITING)
+ return false;
+
+ return true;
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ if (enabled && task_wants_autogroup(p, tg))
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+ struct rq *rq;
+ unsigned long flags;
+
+ rq = task_rq_lock(p, &flags);
+ prev = p->signal->autogroup;
+ if (prev == ag) {
+ task_rq_unlock(rq, &flags);
+ return;
+ }
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+ __sched_move_task(p, rq);
+ task_rq_unlock(rq, &flags);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(t, &p->thread_group, thread_group) {
+ sched_move_task(t);
+ }
+ rcu_read_unlock();
+
+ autogroup_kref_put(prev);
+}
+
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ sig->autogroup = autogroup_kref_get(current->signal->autogroup);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+#endif
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -382,6 +382,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -728,6 +728,18 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task tty association.
+
config MM_OWNER
bool

Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.


2010-11-15 13:04:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

I continue to play the advocatus diaboli ;)

On 11/15, Mike Galbraith wrote:
>
> +static inline bool
> +task_wants_autogroup(struct task_struct *p, struct task_group *tg)
> +{
> + if (tg != &root_task_group)
> + return false;
> +
> + if (p->sched_class != &fair_sched_class)
> + return false;
> +
> + if (p->flags & PF_EXITING)
> + return false;

Hmm, why? Perhaps PF_EXITING was needed in the previous version to
avoid the race with release_task(). But now it is always safe to
use signal->autogroup.

And the exiting task can do a lot before it disappears, probably
we shouldn't ignore ->autogroup.

> +static void
> +autogroup_move_group(struct task_struct *p, struct autogroup *ag)
> +{
> + struct autogroup *prev;
> + struct task_struct *t;
> + struct rq *rq;
> + unsigned long flags;
> +
> + rq = task_rq_lock(p, &flags);
> + prev = p->signal->autogroup;
> + if (prev == ag) {
> + task_rq_unlock(rq, &flags);
> + return;
> + }
> +
> + p->signal->autogroup = autogroup_kref_get(ag);
> + __sched_move_task(p, rq);
> + task_rq_unlock(rq, &flags);
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(t, &p->thread_group, thread_group) {
> + sched_move_task(t);
> + }
> + rcu_read_unlock();

Not sure I understand why do we need rq->lock...

It can't protect the change of signal->autogroup, multiple callers
can use different rq's.

However. Currently the only callers holds ->siglock, so we are safe.
Perhaps we should just document that autogroup_move_group() needs
->siglock.

This also mean the patch can be simplified even more, __sched_move_task()
is not needed.

> +void sched_autogroup_fork(struct signal_struct *sig)
> +{
> + sig->autogroup = autogroup_kref_get(current->signal->autogroup);
> +}

Well, in theory this can race with another thread doing autogroup_move_group().
We can read the old ->autogroup, and then use it after it was already freed.

Probably this needs ->siglock too.

Oleg.

2010-11-15 14:01:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, 2010-11-14 at 19:12 -0800, Linus Torvalds wrote:

> I'm also very happy with just what it does to interactive performance.
> Admittedly, my "testcase" is really trivial (reading email in a
> web-browser, scrolling around a bit, while doing a "make -j64" on the
> kernel at the same time), but it's a test-case that is very relevant
> for me. And it is a _huge_ improvement.

Next logical auto-step would be to try to subvert cfq. At a glance,
io_context looks highly hackable for !CONFIG_CFQ_GROUP_IOSCHED case.
The other case may get interesting for someone not very familiar with
cfq innards, but I see a tempting looking subversion point.

-Mike

2010-11-15 21:26:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 13:57 +0100, Oleg Nesterov wrote:
> I continue to play the advocatus diaboli ;)

Much appreciated.

> On 11/15, Mike Galbraith wrote:
> >
> > +static inline bool
> > +task_wants_autogroup(struct task_struct *p, struct task_group *tg)
> > +{
> > + if (tg != &root_task_group)
> > + return false;
> > +
> > + if (p->sched_class != &fair_sched_class)
> > + return false;
> > +
> > + if (p->flags & PF_EXITING)
> > + return false;
>
> Hmm, why? Perhaps PF_EXITING was needed in the previous version to
> avoid the race with release_task(). But now it is always safe to
> use signal->autogroup.

That came into existence when I stress tested previous version in
PREEMPT_RT (boom). I see no good reason to bother an exiting task
though, so would prefer to leave it as is.

> And the exiting task can do a lot before it disappears, probably
> we shouldn't ignore ->autogroup.

I doubt it would add value.

> Not sure I understand why do we need rq->lock...

Indeed, should have been whacked. Now gone.

> It can't protect the change of signal->autogroup, multiple callers
> can use different rq's.

Guaranteed live ->autogroup should be good enough for heuristic use, and
had better be so. Having to take ->siglock in the fast path would kill
using ->signal.

> However. Currently the only callers holds ->siglock, so we are safe.
> Perhaps we should just document that autogroup_move_group() needs
> ->siglock.

Done.

> This also mean the patch can be simplified even more, __sched_move_task()
> is not needed.

(shrinkage:)

> > +void sched_autogroup_fork(struct signal_struct *sig)
> > +{
> > + sig->autogroup = autogroup_kref_get(current->signal->autogroup);
> > +}
>
> Well, in theory this can race with another thread doing autogroup_move_group().
> We can read the old ->autogroup, and then use it after it was already freed.
>
> Probably this needs ->siglock too.

Another landmine. Done.

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity. This patch implements an idea from Linus,
to automatically create task groups. This patch only implements Linus' per
tty task group suggestion, and only for fair class tasks, but leaves the way
open for enhancement.

Implementation: each task's signal struct contains an inherited pointer to a
refcounted autogroup struct containing a task group pointer, the default for
all tasks pointing to the init_task_group. When a task calls __proc_set_tty(),
the process wide reference to the default group is dropped, a new task group is
created, and the process is moved into the new task group. Children thereafter
inherit this task group, and increase it's refcount. On exit, a reference to the
current task group is dropped when the last reference to each signal struct is
dropped. The task group is destroyed when the last signal struct referencing
it is freed. At runqueue selection time, IFF a task has no cgroup assignment,
it's current autogroup is used.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is
selected, but can be disabled via the boot option noautogroup, and can be
also be turned on/off on the fly via..
echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Some numbers.

A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10

About measurement proggy:
pert/sec = perturbations/sec
min/max/avg = scheduler service latencies in usecs
sum/s = time accrued by the competition per sample period (1 sec here)
overhead = %CPU received by the competition per sample period

pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24%
pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99%
pert/s: 24 >42150.22us: 12 min: 8.86 max:61265.91 avg:39459.91 sum/s:947038us overhead:92.20%
pert/s: 26 >42344.91us: 11 min: 3.83 max:52029.60 avg:36164.70 sum/s:940282us overhead:91.12%
pert/s: 24 >44262.90us: 14 min: 5.05 max:82735.15 avg:40314.33 sum/s:967544us overhead:92.22%

Same load with this patch applied.

pert/s: 229 >5484.43us: 41 min: 0.15 max:12069.42 avg:2193.81 sum/s:502382us overhead:50.24%
pert/s: 222 >5652.28us: 43 min: 0.46 max:12077.31 avg:2248.56 sum/s:499181us overhead:49.92%
pert/s: 211 >5809.38us: 43 min: 0.16 max:12064.78 avg:2381.70 sum/s:502538us overhead:50.25%
pert/s: 223 >6147.92us: 43 min: 0.15 max:16107.46 avg:2282.17 sum/s:508925us overhead:50.49%
pert/s: 218 >6252.64us: 43 min: 0.16 max:12066.13 avg:2324.11 sum/s:506656us overhead:50.27%

Average service latency is an order of magnitude better with autogroup.
(Imagine that pert were Xorg or whatnot instead)

Using Mathieu Desnoyers' wakeup-latency testcase:

With taskset -c 3 make -j 10 running..

taskset -c 3 ./wakeup-latency& sleep 30;killall wakeup-latency

without:
maximum latency: 42963.2 µs
average latency: 9077.0 µs
missed timer events: 0

with:
maximum latency: 4160.7 µs
average latency: 149.4 µs
missed timer events: 0

Signed-off-by: Mike Galbraith <[email protected]>

---
Documentation/kernel-parameters.txt | 2
drivers/tty/tty_io.c | 1
include/linux/sched.h | 19 ++++
init/Kconfig | 12 ++
kernel/fork.c | 5 -
kernel/sched.c | 9 +-
kernel/sched_autogroup.c | 150 ++++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 16 +++
kernel/sysctl.c | 11 ++
9 files changed, 222 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -576,6 +578,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1931,6 +1936,20 @@ int sched_rt_handler(struct ctl_table *t

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+extern unsigned int sysctl_sched_autogroup_enabled;
+
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -78,6 +78,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -605,11 +606,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -2006,6 +2010,7 @@ static void sched_irq_time_avg_update(st
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7979,7 +7984,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(st

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clo
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
Index: linux-2.6/drivers/tty/tty_io.c
===================================================================
--- linux-2.6.orig/drivers/tty/tty_io.c
+++ linux-2.6/drivers/tty/tty_io.c
@@ -3160,6 +3160,7 @@ static void __proc_set_tty(struct task_s
put_pid(tsk->signal->tty_old_pgrp);
tsk->signal->tty = tty_kref_get(tty);
tsk->signal->tty_old_pgrp = NULL;
+ sched_autogroup_create_attach(tsk);
}

static void proc_set_tty(struct task_struct *tsk, struct tty_struct *tty)
Index: linux-2.6/kernel/sched_autogroup.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.h
@@ -0,0 +1,16 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6/kernel/sched_autogroup.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.c
@@ -0,0 +1,150 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+
+struct autogroup {
+ struct kref kref;
+ struct task_group *tg;
+};
+
+static struct autogroup autogroup_default;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ kref_init(&autogroup_default.kref);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+ struct task_group *tg = ag->tg;
+
+ kfree(ag);
+ sched_destroy_group(tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kmalloc(sizeof(*ag), GFP_KERNEL);
+
+ if (!ag)
+ goto out_fail;
+
+ ag->tg = sched_create_group(&init_task_group);
+ kref_init(&ag->kref);
+
+ if (!(IS_ERR(ag->tg)))
+ return ag;
+
+out_fail:
+ if (ag) {
+ kfree(ag);
+ WARN_ON(1);
+ } else
+ WARN_ON(1);
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline bool
+task_wants_autogroup(struct task_struct *p, struct task_group *tg)
+{
+ if (tg != &root_task_group)
+ return false;
+
+ if (p->sched_class != &fair_sched_class)
+ return false;
+
+ if (p->flags & PF_EXITING)
+ return false;
+
+ return true;
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ if (enabled && task_wants_autogroup(p, tg))
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+
+ prev = p->signal->autogroup;
+ if (prev == ag)
+ return;
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+ sched_move_task(p);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(t, &p->thread_group, thread_group) {
+ sched_move_task(t);
+ }
+ rcu_read_unlock();
+
+ autogroup_kref_put(prev);
+}
+
+/* Must be called with siglock held */
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* Must be called with siglock held. Currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ struct sighand_struct *sighand = current->sighand;
+
+ spin_lock(&sighand->siglock);
+ sig->autogroup = autogroup_kref_get(current->signal->autogroup);
+ spin_unlock(&sighand->siglock);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+#endif
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -382,6 +382,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -728,6 +728,18 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task tty association.
+
config MM_OWNER
bool

Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.


2010-11-15 22:42:20

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, 11 Nov 2010 08:26:40 MST, Mike Galbraith said:

> Implementation: each task struct contains an inherited pointer to a refcounted
> autogroup struct containing a task group pointer, the default for all tasks
> pointing to the init_task_group. When a task calls __proc_set_tty(), the
> task's reference to the default group is dropped, a new task group is created,
> and the task is moved out of the old group and into the new. Children thereafter
> inherit this task group, and increase it's refcount. Calls to __tty_hangup()
> and proc_clear_tty() move the caller back to the init_task_group, and possibly
> destroy the task group. On exit, reference to the current task group is dropped,
> and the task group is potentially destroyed. At runqueue selection time, iff
> a task has no cgroup assignment, it's current autogroup is used.

So the set of all tasks that never call proc_set_tty() ends up in the same one
big default group, correct? Do we have any provisions for making sure that if
a user has 8 or 10 windows open doing heavy work, the default group (with a lot
of important daemons/etc in it) doesn't get starved with only a 1/10th share of
the CPU? Or am I missing something here?

> +extern void sched_autogroup_detatch(struct task_struct *p);

sched_autogroup_detach() instead?


Attachments:
(No filename) (227.00 B)

2010-11-15 22:49:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 14:25 -0700, Mike Galbraith wrote:
> > However. Currently the only callers holds ->siglock, so we are safe.
> > Perhaps we should just document that autogroup_move_group() needs
> > ->siglock.
>
> Done.

lockdep_assert_held() is a good way to document these things.

2010-11-15 23:25:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, Nov 15, 2010 at 2:41 PM, <[email protected]> wrote:
>
> So the set of all tasks that never call proc_set_tty() ends up in the same one
> big default group, correct?

Well, yes and no.

Yes, that's what the code currently does. But I did ask Mike (and he
delivered) to try to make the code look and work in a way where the
whole "tty thing" is just one of the heuristics.

It's not clear exactly what the non-tty heuristics would be, but I do
have a few suggestions:

- I think it might be a good idea to associate a task group with the
current "cred" of a process, and fall back on it in the absense of a
tty-provided one.

Now, for desktop use, that probably doesn't often matter, but even
for just the desktop it would mean that "system daemons" would at
least get a group of their own, rather than be grouped with
"everything else"

(As is, I think the autogroup thing already ends up protecting
system daemons more than _not_ having the autogroup, since it will
basically mean that a "make -j64" won't be stealing all the CPU time
from everybody else - even if the system daemons all end up in that
single "default" group together with the non-tty graphical apps.)

- I suspect we could make kernel daemons be a group of their own.

>?Do we have any provisions for making sure that if
> a user has 8 or 10 windows open doing heavy work, the default group (with a lot
> of important daemons/etc in it) doesn't get starved with only a 1/10th share of
> the CPU? Or am I missing something here?

I think you're missing the fact that without the autogroups, it's
_worse_. If you do "make -j64" without autogroups, those important
daemons get starved by all that work. With the autogroups, they end up
being protected by being in the default group.

So the fact that they are in the default group with lots of other
users doesn't hurt, quite the reverse. User apps are likely to be in
their own groups, so they affect system apps less than they do now.

But I do agree that we might be able to make that all much more explicit.

Linus

2010-11-15 23:46:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 17:41 -0500, [email protected] wrote:
> On Thu, 11 Nov 2010 08:26:40 MST, Mike Galbraith said:
>
> > Implementation: each task struct contains an inherited pointer to a refcounted
> > autogroup struct containing a task group pointer, the default for all tasks
> > pointing to the init_task_group. When a task calls __proc_set_tty(), the
> > task's reference to the default group is dropped, a new task group is created,
> > and the task is moved out of the old group and into the new. Children thereafter
> > inherit this task group, and increase it's refcount. Calls to __tty_hangup()
> > and proc_clear_tty() move the caller back to the init_task_group, and possibly
> > destroy the task group. On exit, reference to the current task group is dropped,
> > and the task group is potentially destroyed. At runqueue selection time, iff
> > a task has no cgroup assignment, it's current autogroup is used.
>
> So the set of all tasks that never call proc_set_tty() ends up in the same one
> big default group, correct? Do we have any provisions for making sure that if
> a user has 8 or 10 windows open doing heavy work, the default group (with a lot
> of important daemons/etc in it) doesn't get starved with only a 1/10th share of
> the CPU? Or am I missing something here?

Yes, all tasks never having had a tty association are relegated to the
root task group, and no, there is no provision for the root task group
getting more than it's fair share of CPU.

The patch is only intended to (hopefully) better suit the general case
desktop. One size has zero chance of fitting all ;-)

> > +extern void sched_autogroup_detatch(struct task_struct *p);
>
> sched_autogroup_detach() instead?

Hm, why?

-Mike

-Mike

2010-11-15 23:50:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, Nov 15, 2010 at 3:46 PM, Mike Galbraith <[email protected]> wrote:
> On Mon, 2010-11-15 at 17:41 -0500, [email protected] wrote:
>
>> > +extern void sched_autogroup_detatch(struct task_struct *p);
>>
>> sched_autogroup_detach() instead?
>
> Hm, why?

You really aren't a good speller of that word, are you?

Linus

2010-11-16 00:04:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 15:50 -0800, Linus Torvalds wrote:
> On Mon, Nov 15, 2010 at 3:46 PM, Mike Galbraith <[email protected]> wrote:
> > On Mon, 2010-11-15 at 17:41 -0500, [email protected] wrote:
> >
> >> > +extern void sched_autogroup_detatch(struct task_struct *p);
> >>
> >> sched_autogroup_detach() instead?
> >
> > Hm, why?
>
> You really aren't a good speller of that word, are you?

<stare> d e t a c h... d e t a t....t? aw crap. Guess not.

Couldn't even see it.

-Mike

2010-11-16 01:19:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Hmm. Just found a bug. I'm not sure if it's the autogroup patches
themselves, or whether it's just the cgroup code that the autogroup
patch enables for me.

When I do

echo t > /proc/sysrq-trigger

(or "w") I get a NULL pointer dereference (offset 0x38 - decimal 56)
in "cgroup_path+0x7", with a call trace of sched_debug_show,
show_state_filter, sysrq_handle_showstate_blocked. I don't have the
whole oops, because the machine is really dead at that point
(presumably died holding the runqueue lock or some other critical
resource), but if required I could take a photo of it. However, I bet
it is repeatable, so I doubt you need it.

Anyway, that "cgroup_path+0x7" is the very first memory dereference:

movq 56(%rdi), %rsi # cgrp_5(D)->dentry, _________p1

so sched_debug_show() is apparently calling cgroup_path() with a NULL
cgroup. I think it's "print_task()" that is to blame, it does

cgroup_path(task_group(p)->css.cgroup, ..

without checking whether there _is_ any css.cgroup.

Peter, that looks like your code (commit d19ca30874f2)

Guys?

Linus

2010-11-16 01:55:40

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, Nov 15, 2010 at 5:18 PM, Linus Torvalds
<[email protected]> wrote:
>
> so sched_debug_show() is apparently calling cgroup_path() with a NULL
> cgroup. I think it's "print_task()" that is to blame, it does
>
> ? ? cgroup_path(task_group(p)->css.cgroup, ..
>
> without checking whether there _is_ any css.cgroup.

Right - previously the returned task_group would be always associated
with a cgroup. Now, it may not be.

The original task_group() should be made accessible for anything that
wants a real cgroup in the scheduler hierarchy, and called from the
new task_group() function. Not sure what the best naming convention
would be, maybe task_group() and effective_task_group() ?

Paul

2010-11-16 01:57:18

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, Nov 15, 2010 at 02:25:50PM -0700, Mike Galbraith wrote:

[..]
>
> A recurring complaint from CFS users is that parallel kbuild has a negative
> impact on desktop interactivity. This patch implements an idea from Linus,
> to automatically create task groups. This patch only implements Linus' per
> tty task group suggestion, and only for fair class tasks, but leaves the way
> open for enhancement.
>
> Implementation: each task's signal struct contains an inherited pointer to a
> refcounted autogroup struct containing a task group pointer, the default for
> all tasks pointing to the init_task_group. When a task calls __proc_set_tty(),
> the process wide reference to the default group is dropped, a new task group is
> created, and the process is moved into the new task group. Children thereafter
> inherit this task group, and increase it's refcount. On exit, a reference to the
> current task group is dropped when the last reference to each signal struct is
> dropped. The task group is destroyed when the last signal struct referencing
> it is freed. At runqueue selection time, IFF a task has no cgroup assignment,
> it's current autogroup is used.

Mike,

IIUC, this automatically created task group is invisible to user space? I
mean generally there is a task group associated with a cgroup and user space
tools can walk through cgroup hierarchy to figure out how system is
configured. Will that be possible with this patch.

I am wondering what will happen to things like some kind of per cgroup
stats. For example block controller keeps track of number of sectors
transferred per cgroup. Hence this information will not be available for
these internal task groups?

Looks like everybody likes the idea but let me still ask the following
question.

Should this kind of thing be done in user space? I mean what we are
essentially doing providing isolation between two groups. That's why
this cgroup infrastructure is in place. Just that currently how cgroups
are created is fully depends on user space and kernel does not create
cgroups of its own by default (ecept root cgroup).

I think systemd does something similar in the sense every system service
it puts in a cgroup of its own on system startup.

libcgroup daemon has the facility to listen for kernel events (through
netlink socket), and then put newly created tasks in cgroups as per
the user spcefied rules in a config file. For example, if one wants
isolation between tasks of two user ids, one can just write a rule and
once the user logs in, its login session will be automatically placed
in right cgroup. Hence one will be able to achieve isolation between
two users. I think now it also has rules for classifying executables
based on names/paths. So one can put "firefox" in one cgroup and say
"make -j64" in a separate cgroup and provide isolation between two
applications. It is just a matter of putting right rule in the config file.

This patch sounds like an extension to user id problem where we want
isolation between the processes of same user (process groups using
different terminals). Would it make sense to generate some kind of kernel
event for this and let user space execute the rules instead of creating
this functionality in kernel.

This way once we extend this functionality to other subsystems, we can
make it more flexible in user space. For example, create these groups
for cpu controller but lets say not for block controller. Otherwise
we will end up creating more kernel tunables for achieving same effect.

Thanks
Vivek

2010-11-16 02:19:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, Nov 15, 2010 at 5:56 PM, Vivek Goyal <[email protected]> wrote:
>
> Should this kind of thing be done in user space?

Almost certainly not.

First off, user-space is a fragmented mess. Just from a "let's get it
done" angle, it just doesn't work. There are lots of different thing
that create new tty's, and you can't have them all fixed. Plus it
would be _way_ more code in user space than it is in kernel space.

Secondly, user-space daemons are a total mess. We've tried it many
many times, and every time the _intention_ is to make things simpler
to debug and deploy. And it almost never is. The interfaces end up
being too painful, and the "part of the code is in kernel space, part
of it is in user space" means that things just break all the time.

Finally, the whole "user space is more flexible" is just a lie. It
simply doesn't end up being true. It will be _harder_ to configure
some user-space daemon than it is to just set a flag in /sys or
whatever. The "flexibility" tends to be more a flexibility to get
things wrong than any actual advantage.

Just look at the patch in question. It's simple, it's clean, and it
"just works". Doing the same thing in user space? It would be a total
nightmare, and exactly _because_ it would be a total nightmare, the
code would never be that simple or clean.

Linus

2010-11-16 12:58:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 17:55 -0800, Paul Menage wrote:
> On Mon, Nov 15, 2010 at 5:18 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > so sched_debug_show() is apparently calling cgroup_path() with a NULL
> > cgroup. I think it's "print_task()" that is to blame, it does
> >
> > cgroup_path(task_group(p)->css.cgroup, ..
> >
> > without checking whether there _is_ any css.cgroup.
>
> Right - previously the returned task_group would be always associated
> with a cgroup. Now, it may not be.
>
> The original task_group() should be made accessible for anything that
> wants a real cgroup in the scheduler hierarchy, and called from the
> new task_group() function. Not sure what the best naming convention
> would be, maybe task_group() and effective_task_group() ?

effective_task_group() works for me. Autogroup (currently at least)
only needs to interface with set_task_rq().

A tasty alternative would be to have autogroup be it's own subsystem,
with full cgroup userspace visibility/tweakability. I have no idea if
that's feasible though. I glanced at cgroup.c, but didn't see the dirt
simple I wanted, and quickly ran away.

---
kernel/sched.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -606,27 +606,33 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
- struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- tg = container_of(css, struct task_group, css);
+ return container_of(css, struct task_group, css);
+}

- return autogroup_task_group(p, tg);
+static inline struct task_group *effective_task_group(struct task_struct *p)
+{
+ return autogroup_task_group(p, task_group(p));
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
+#if (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+ struct task_group *tg = effective_task_group(p);
+#endif
+
#ifdef CONFIG_FAIR_GROUP_SCHED
- p->se.cfs_rq = task_group(p)->cfs_rq[cpu];
- p->se.parent = task_group(p)->se[cpu];
+ p->se.cfs_rq = tg->cfs_rq[cpu];
+ p->se.parent = tg->se[cpu];
#endif

#ifdef CONFIG_RT_GROUP_SCHED
- p->rt.rt_rq = task_group(p)->rt_rq[cpu];
- p->rt.parent = task_group(p)->rt_se[cpu];
+ p->rt.rt_rq = tg->rt_rq[cpu];
+ p->rt.parent = tg->rt_se[cpu];
#endif
}


2010-11-16 13:11:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/15, Mike Galbraith wrote:
>
> On Mon, 2010-11-15 at 13:57 +0100, Oleg Nesterov wrote:
>
> > And the exiting task can do a lot before it disappears, probably
> > we shouldn't ignore ->autogroup.

I don't really understand what makes the exiting task different,
but OK.

However, I must admit I dislike this check. Because, looking at this
code, it is not clear why do we check PF_EXITING. It looks as if it
is needed for correctness.

OK, this is minor. I think the patch is correct, just one nit below.

> > It can't protect the change of signal->autogroup, multiple callers
> > can use different rq's.
>
> Guaranteed live ->autogroup should be good enough for heuristic use, and
> had better be so. Having to take ->siglock in the fast path would kill
> using ->signal.

Yes, sure, rq->lock should ensure signal->autogroup can't go away.
(even if it can be changed under us). And it does, we are moving all
threads before kref_put().

> +static void
> +autogroup_move_group(struct task_struct *p, struct autogroup *ag)
> +{
> + struct autogroup *prev;
> + struct task_struct *t;
> +
> + prev = p->signal->autogroup;
> + if (prev == ag)
> + return;
> +
> + p->signal->autogroup = autogroup_kref_get(ag);
> + sched_move_task(p);
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(t, &p->thread_group, thread_group) {
> + sched_move_task(t);
> + }
> + rcu_read_unlock();
> +
> + autogroup_kref_put(prev);
> +}

Well, this looks a bit strange (but correct).

We are changing ->autogroup assuming the caller holds ->siglock.
But if we hold ->siglock we do not need rcu_read_lock() to iterate
over the thread_group, we can just do

p->signal->autogroup = autogroup_kref_get(ag);

t = p;
do {
sched_move_task(t);
} while_each_thread(p, t);

Again, this is minor, I won't insist.

Oleg.

2010-11-16 14:00:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 17:55 -0800, Paul Menage wrote:
> On Mon, Nov 15, 2010 at 5:18 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > so sched_debug_show() is apparently calling cgroup_path() with a NULL
> > cgroup. I think it's "print_task()" that is to blame, it does
> >
> > cgroup_path(task_group(p)->css.cgroup, ..
> >
> > without checking whether there _is_ any css.cgroup.
>
> Right - previously the returned task_group would be always associated
> with a cgroup. Now, it may not be.
>
> The original task_group() should be made accessible for anything that
> wants a real cgroup in the scheduler hierarchy, and called from the
> new task_group() function. Not sure what the best naming convention
> would be, maybe task_group() and effective_task_group() ?

Right, that doesn't solve the full problem though.

/proc/sched_debug should show these automagic task_groups, its just that
there's currently no way to properly name them, we can of course add
something like a name field to the struct autogroup thing, but what do
we fill it with? "autogroup-%d" and keep a sequence number for each
autogroup?

Then the below task_group_path() thing can try the autogroup name scheme
if it finds a NULL css.

Something like the below might avoid the explosion:

---
kernel/sched_debug.c | 28 ++++++++++++++--------------
1 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 2e1b0d1..9b5560f 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -87,6 +87,19 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu,
}
#endif

+#if defined(CONFIG_CGROUP_SCHED) && \
+ (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+static void task_group_path(struct task_group *tg, char *buf, int buflen)
+{
+ /* may be NULL if the underlying cgroup isn't fully-created yet */
+ if (!tg->css.cgroup) {
+ buf[0] = '\0';
+ return;
+ }
+ cgroup_path(tg->css.cgroup, buf, buflen);
+}
+#endif
+
static void
print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
{
@@ -115,7 +128,7 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
char path[64];

rcu_read_lock();
- cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
+ task_group_path(task_group(p), path, sizeof(path));
rcu_read_unlock();
SEQ_printf(m, " %s", path);
}
@@ -147,19 +160,6 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
read_unlock_irqrestore(&tasklist_lock, flags);
}

-#if defined(CONFIG_CGROUP_SCHED) && \
- (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
-static void task_group_path(struct task_group *tg, char *buf, int buflen)
-{
- /* may be NULL if the underlying cgroup isn't fully-created yet */
- if (!tg->css.cgroup) {
- buf[0] = '\0';
- return;
- }
- cgroup_path(tg->css.cgroup, buf, buflen);
-}
-#endif
-
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,

2010-11-16 14:01:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 14:25 -0700, Mike Galbraith wrote:
> > > + if (p->flags & PF_EXITING)
> > > + return false;
> >
> > Hmm, why? Perhaps PF_EXITING was needed in the previous version to
> > avoid the race with release_task(). But now it is always safe to
> > use signal->autogroup.
>
> That came into existence when I stress tested previous version in
> PREEMPT_RT (boom). I see no good reason to bother an exiting task
> though, so would prefer to leave it as is.

PREEMPT_RT has a slightly different exit path IIRC. If that was the only
thing you saw it explode on we could leave the check out for now and
revisit it in the -rt patches when and if it pops up. Hmm?

2010-11-16 14:03:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 2010-11-15 at 20:56 -0500, Vivek Goyal wrote:

> Mike,
>
> IIUC, this automatically created task group is invisible to user space? I
> mean generally there is a task group associated with a cgroup and user space
> tools can walk through cgroup hierarchy to figure out how system is
> configured. Will that be possible with this patch.

No, it's dirt simple automation only at this point.

> I am wondering what will happen to things like some kind of per cgroup
> stats. For example block controller keeps track of number of sectors
> transferred per cgroup. Hence this information will not be available for
> these internal task groups?

No, it won't. The target audience is those folks who don't _do_ the
configuration they _could_ do, folks who don't use SCHED_IDLE or nice,
or the power available through userspace cgroup tools.. folks who expect
their box to "just work", out of the box.

> Looks like everybody likes the idea but let me still ask the following
> question.
>
> Should this kind of thing be done in user space? I mean what we are
> essentially doing providing isolation between two groups. That's why
> this cgroup infrastructure is in place. Just that currently how cgroups
> are created is fully depends on user space and kernel does not create
> cgroups of its own by default (ecept root cgroup).

I was of the same mind when Linus first broached the subject, but Ingo
convinced me it was worth exploring because of the simple fact that
people are not using the available tools.

Sadly, this includes distros.

AFAIK, no distro has cgroups configured and ready for aunt tilly, no
distro has taught the GUI to use cgroups _at all_, even though it's
trivial to launch self-reaping task groups from userspace.

> I think systemd does something similar in the sense every system service
> it puts in a cgroup of its own on system startup.
>
> libcgroup daemon has the facility to listen for kernel events (through
> netlink socket), and then put newly created tasks in cgroups as per
> the user spcefied rules in a config file. For example, if one wants
> isolation between tasks of two user ids, one can just write a rule and
> once the user logs in, its login session will be automatically placed
> in right cgroup. Hence one will be able to achieve isolation between
> two users. I think now it also has rules for classifying executables
> based on names/paths. So one can put "firefox" in one cgroup and say
> "make -j64" in a separate cgroup and provide isolation between two
> applications. It is just a matter of putting right rule in the config file.

I fiddled with configuring my system, but found options lacking. For
instance, I found no way to automate per tty (or pgid in my case). I
had to cobble scripts together to get the job done. Nothing that my
distro delivered could just make it just happen for me.

When the tool mature, and distros use them, in kernel automation may
well become more or less moot, but in the here and now, there is a
target audience with a need that is not being serviced.

> This patch sounds like an extension to user id problem where we want
> isolation between the processes of same user (process groups using
> different terminals). Would it make sense to generate some kind of kernel
> event for this and let user space execute the rules instead of creating
> this functionality in kernel.

Per user isn't very useful. The typical workstation has one user
whacking away on the kbd/mouse. While you can identify firefox etc,
it's not being done, and requires identifying every application. Heck,
cgroups is built in, but userspace doesn't even mount. Nothing but
nothing uses cgroups.

> This way once we extend this functionality to other subsystems, we can
> make it more flexible in user space. For example, create these groups
> for cpu controller but lets say not for block controller. Otherwise
> we will end up creating more kernel tunables for achieving same effect.

I see your arguments, and agree to a large extent. As Linus noted,
there are other advantages to in kernel automation, but for me, it all
boils down to the fact that userspace is doing nothing with the tools.

ATM, cgroups is an enterprise or power user tool. The out of the box
distro kernel user sees zero benefit for the overhead investment.

-Mike

2010-11-16 14:11:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 07:02 -0700, Mike Galbraith wrote:
> While you can identify firefox etc,
> it's not being done, and requires identifying every application. Heck,
> cgroups is built in, but userspace doesn't even mount. Nothing but
> nothing uses cgroups.

The yet another init rewrite called systemd is supposedly cgroup happy..
No idea if its going to be useful though, I doubt its going to have an
effect on me launching a konsole or the like, or screen creating a bunch
of ttys.

2010-11-16 14:18:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 14:04 +0100, Oleg Nesterov wrote:
> On 11/15, Mike Galbraith wrote:
> >
> > On Mon, 2010-11-15 at 13:57 +0100, Oleg Nesterov wrote:
> >
> > > And the exiting task can do a lot before it disappears, probably
> > > we shouldn't ignore ->autogroup.
>
> I don't really understand what makes the exiting task different,
> but OK.
>
> However, I must admit I dislike this check. Because, looking at this
> code, it is not clear why do we check PF_EXITING. It looks as if it
> is needed for correctness.

Is _not_ needed I presume.

I'll remove it, I'm not overly attached (a t t a..;) to it.

> OK, this is minor. I think the patch is correct, just one nit below.
>
> > > It can't protect the change of signal->autogroup, multiple callers
> > > can use different rq's.
> >
> > Guaranteed live ->autogroup should be good enough for heuristic use, and
> > had better be so. Having to take ->siglock in the fast path would kill
> > using ->signal.
>
> Yes, sure, rq->lock should ensure signal->autogroup can't go away.
> (even if it can be changed under us). And it does, we are moving all
> threads before kref_put().

(yeah)

> > +static void
> > +autogroup_move_group(struct task_struct *p, struct autogroup *ag)
> > +{
> > + struct autogroup *prev;
> > + struct task_struct *t;
> > +
> > + prev = p->signal->autogroup;
> > + if (prev == ag)
> > + return;
> > +
> > + p->signal->autogroup = autogroup_kref_get(ag);
> > + sched_move_task(p);
> > +
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(t, &p->thread_group, thread_group) {
> > + sched_move_task(t);
> > + }
> > + rcu_read_unlock();
> > +
> > + autogroup_kref_put(prev);
> > +}
>
> Well, this looks a bit strange (but correct).

My mouse copied it.

> We are changing ->autogroup assuming the caller holds ->siglock.
> But if we hold ->siglock we do not need rcu_read_lock() to iterate
> over the thread_group, we can just do
>
> p->signal->autogroup = autogroup_kref_get(ag);
>
> t = p;
> do {
> sched_move_task(t);
> } while_each_thread(p, t);
>
> Again, this is minor, I won't insist.

I'll do it that way. I was pondering adding the option to move one or
all as cgroups does, but don't think that will ever be needed.

Thanks,

-Mike

2010-11-16 14:20:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 15:01 +0100, Peter Zijlstra wrote:
> On Mon, 2010-11-15 at 14:25 -0700, Mike Galbraith wrote:
> > > > + if (p->flags & PF_EXITING)
> > > > + return false;
> > >
> > > Hmm, why? Perhaps PF_EXITING was needed in the previous version to
> > > avoid the race with release_task(). But now it is always safe to
> > > use signal->autogroup.
> >
> > That came into existence when I stress tested previous version in
> > PREEMPT_RT (boom). I see no good reason to bother an exiting task
> > though, so would prefer to leave it as is.
>
> PREEMPT_RT has a slightly different exit path IIRC. If that was the only
> thing you saw it explode on we could leave the check out for now and
> revisit it in the -rt patches when and if it pops up. Hmm?

Yeah, I'm going to whack it. (and add your lockdep thingy)

-Mike

2010-11-16 14:27:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 14:59 +0100, Peter Zijlstra wrote:
> On Mon, 2010-11-15 at 17:55 -0800, Paul Menage wrote:
> > On Mon, Nov 15, 2010 at 5:18 PM, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > so sched_debug_show() is apparently calling cgroup_path() with a NULL
> > > cgroup. I think it's "print_task()" that is to blame, it does
> > >
> > > cgroup_path(task_group(p)->css.cgroup, ..
> > >
> > > without checking whether there _is_ any css.cgroup.
> >
> > Right - previously the returned task_group would be always associated
> > with a cgroup. Now, it may not be.
> >
> > The original task_group() should be made accessible for anything that
> > wants a real cgroup in the scheduler hierarchy, and called from the
> > new task_group() function. Not sure what the best naming convention
> > would be, maybe task_group() and effective_task_group() ?
>
> Right, that doesn't solve the full problem though.
>
> /proc/sched_debug should show these automagic task_groups, its just that
> there's currently no way to properly name them, we can of course add
> something like a name field to the struct autogroup thing, but what do
> we fill it with? "autogroup-%d" and keep a sequence number for each
> autogroup?

I was considering exactly that for /proc/N/cgroup visibility. Might get
thumped if I re-use that file though.

> Then the below task_group_path() thing can try the autogroup name scheme
> if it finds a NULL css.
>
> Something like the below might avoid the explosion:
>
> ---
> kernel/sched_debug.c | 28 ++++++++++++++--------------
> 1 files changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
> index 2e1b0d1..9b5560f 100644
> --- a/kernel/sched_debug.c
> +++ b/kernel/sched_debug.c
> @@ -87,6 +87,19 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu,
> }
> #endif
>
> +#if defined(CONFIG_CGROUP_SCHED) && \
> + (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
> +static void task_group_path(struct task_group *tg, char *buf, int buflen)
> +{
> + /* may be NULL if the underlying cgroup isn't fully-created yet */
> + if (!tg->css.cgroup) {
> + buf[0] = '\0';
> + return;
> + }
> + cgroup_path(tg->css.cgroup, buf, buflen);
> +}
> +#endif
> +
> static void
> print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
> {
> @@ -115,7 +128,7 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
> char path[64];
>
> rcu_read_lock();
> - cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
> + task_group_path(task_group(p), path, sizeof(path));
> rcu_read_unlock();
> SEQ_printf(m, " %s", path);
> }
> @@ -147,19 +160,6 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
> read_unlock_irqrestore(&tasklist_lock, flags);
> }
>
> -#if defined(CONFIG_CGROUP_SCHED) && \
> - (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
> -static void task_group_path(struct task_group *tg, char *buf, int buflen)
> -{
> - /* may be NULL if the underlying cgroup isn't fully-created yet */
> - if (!tg->css.cgroup) {
> - buf[0] = '\0';
> - return;
> - }
> - cgroup_path(tg->css.cgroup, buf, buflen);
> -}
> -#endif
> -
> void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> {
> s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
>

2010-11-16 14:47:44

by Dhaval Giani

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 3:11 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2010-11-16 at 07:02 -0700, Mike Galbraith wrote:
>> While you can identify firefox etc,
>> it's not being done, and requires identifying every application. ?Heck,
>> cgroups is built in, but userspace doesn't even mount. ?Nothing but
>> nothing uses cgroups.
>
> The yet another init rewrite called systemd is supposedly cgroup happy..
> No idea if its going to be useful though, I doubt its going to have an
> effect on me launching a konsole or the like, or screen creating a bunch
> of ttys.

systemd uses cgroups only for process tracking. No resource
management. Though afaik, Lennart has some plans of doing resource
management using systemd. I wonder how autogroups will interact with
systemd in that case.

Dhaval

2010-11-16 15:10:43

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/16, Mike Galbraith wrote:
>
> On Tue, 2010-11-16 at 14:04 +0100, Oleg Nesterov wrote:
> > However, I must admit I dislike this check. Because, looking at this
> > code, it is not clear why do we check PF_EXITING. It looks as if it
> > is needed for correctness.
>
> Is _not_ needed I presume.
>
> I'll remove it, I'm not overly attached (a t t a..;) to it.

Argh!

I was wrong, it _is_ needed for correctness. Yes, it is always safe
to read the pointer, but

> > Yes, sure, rq->lock should ensure signal->autogroup can't go away.
> > (even if it can be changed under us). And it does, we are moving all
> > threads before kref_put().
>
> (yeah)

Exactly. And this means we can _only_ assume it can't go away if
autogroup_move_group() can see us on ->thread_group list.

Perhaps this deserve a commen (unless I missed something again).

Mike, sorry for confusion.

Oleg.

2010-11-16 15:42:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 16:03 +0100, Oleg Nesterov wrote:
> On 11/16, Mike Galbraith wrote:
> >
> > On Tue, 2010-11-16 at 14:04 +0100, Oleg Nesterov wrote:
> > > However, I must admit I dislike this check. Because, looking at this
> > > code, it is not clear why do we check PF_EXITING. It looks as if it
> > > is needed for correctness.
> >
> > Is _not_ needed I presume.
> >
> > I'll remove it, I'm not overly attached (a t t a..;) to it.
>
> Argh!
>
> I was wrong, it _is_ needed for correctness. Yes, it is always safe
> to read the pointer, but
>
> > > Yes, sure, rq->lock should ensure signal->autogroup can't go away.
> > > (even if it can be changed under us). And it does, we are moving all
> > > threads before kref_put().
> >
> > (yeah)
>
> Exactly. And this means we can _only_ assume it can't go away if
> autogroup_move_group() can see us on ->thread_group list.

Aha!

> Perhaps this deserve a commen (unless I missed something again).
>
> Mike, sorry for confusion.

Oh no, thank you. I hadn't figured it out yet, was going to go back and
poke rt kernel with sharp sticks. (exit can be one scary beast)

-Mike

2010-11-16 17:12:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 9:03 AM, Lennart Poettering
<[email protected]> wrote:
>
> Binding something like this to TTYs is just backwards.

Numbers talk, bullshit walks.

The numbers have been quoted. The clear interactive behavior has been seen.

And you're just full of bullshit.

Come back when you have something working and with numbers and better
interactive performance. Until then, nobody cares.

Linus

2010-11-16 17:12:55

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 15:47, Dhaval Giani ([email protected]) wrote:

>
> On Tue, Nov 16, 2010 at 3:11 PM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2010-11-16 at 07:02 -0700, Mike Galbraith wrote:
> >> While you can identify firefox etc,
> >> it's not being done, and requires identifying every application. ?Heck,
> >> cgroups is built in, but userspace doesn't even mount. ?Nothing but
> >> nothing uses cgroups.
> >
> > The yet another init rewrite called systemd is supposedly cgroup happy..
> > No idea if its going to be useful though, I doubt its going to have an
> > effect on me launching a konsole or the like, or screen creating a bunch
> > of ttys.
>
> systemd uses cgroups only for process tracking. No resource
> management. Though afaik, Lennart has some plans of doing resource
> management using systemd. I wonder how autogroups will interact with
> systemd in that case.

systemd already creates a named cgroup for each user who logs in and each
session inside it. That's implemented via pam_systemd, which is enabled
in all distros doing systemd. We create those groups right now only in
the named "systemd" hierarchy, but iiuc then simply doing the same in
the "cpu" hierarchy would have the exact same behaviour as this patch,
but actually is based on a sane definition of what a session is.

Binding something like this to TTYs is just backwards. No graphical
session has a TTY attached anymore. And there might be multiple TTYs
used in the same session.

I really wonder why logic like this should live in kernel space at all,
since a) the kernel has no real notion of a session, except audit and b)
this is policy and as soon as people have this kind of group then they
probably want other kind of autogrouping as well for the other
controllers, which hence means userspace is a better, and configurable
place for this.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 17:28:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups


* Mike Galbraith <[email protected]> wrote:

> > Exactly. And this means we can _only_ assume it can't go away if
> > autogroup_move_group() can see us on ->thread_group list.
>
> Aha!

Mike,

Mind sending a new patch with a separate v2 announcement in a new thread, once you
have something i could apply to the scheduler tree (for a v2.6.38 merge)?

You sent a couple of iterations in this discussion and i'd rather not fish out the
wrong one.

Thanks,

Ingo

2010-11-16 17:42:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 18:28 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > > Exactly. And this means we can _only_ assume it can't go away if
> > > autogroup_move_group() can see us on ->thread_group list.
> >
> > Aha!
>
> Mike,
>
> Mind sending a new patch with a separate v2 announcement in a new thread, once you
> have something i could apply to the scheduler tree (for a v2.6.38 merge)?

Will do. (v3->936 became wider than expected;)

-Mike

2010-11-16 18:09:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 18:03 +0100, Lennart Poettering wrote:
> Binding something like this to TTYs is just backwards. No graphical
> session has a TTY attached anymore. And there might be multiple TTYs
> used in the same session.

Using a group per tty makes sense for us console jockeys..

Anyway, nobody uses systemd yet and afaik not all distro's even plan on
using it (I know I'm not waiting to learn yet another init variant).

2010-11-16 18:16:25

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 09:11, Linus Torvalds ([email protected]) wrote:

>
> On Tue, Nov 16, 2010 at 9:03 AM, Lennart Poettering
> <[email protected]> wrote:
> >
> > Binding something like this to TTYs is just backwards.
>
> Numbers talk, bullshit walks.
>
> The numbers have been quoted. The clear interactive behavior has been seen.

Here's my super-complex patch btw, to achieve exactly the same thing
from userspace without involving any kernel or systemd patching and
kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:

if [ "$PS1" ] ; then
mkdir -m 0700 /sys/fs/cgroup/cpu/user/$$
echo $$ > /sys/fs/cgroup/cpu/user/$$/tasks
fi

Then, as the superuser do this:

mount -t cgroup cgroup /sys/fs/cgroup/cpu -o cpu
mkdir -m 0777 /sys/fs/cgroup/cpu/user

Done. Same effect. However: not crazy.

I am not sure I myself will find the time to prep some 'numbers' for
you. They'd be the same as with the kernel patch anyway. But I am sure
somebody else will do it for you...

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 18:22:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 19:16 +0100, Lennart Poettering wrote:
> On Tue, 16.11.10 09:11, Linus Torvalds ([email protected]) wrote:
>
> >
> > On Tue, Nov 16, 2010 at 9:03 AM, Lennart Poettering
> > <[email protected]> wrote:
> > >
> > > Binding something like this to TTYs is just backwards.
> >
> > Numbers talk, bullshit walks.
> >
> > The numbers have been quoted. The clear interactive behavior has been seen.
>
> Here's my super-complex patch btw, to achieve exactly the same thing
> from userspace without involving any kernel or systemd patching and
> kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:
>
> if [ "$PS1" ] ; then
> mkdir -m 0700 /sys/fs/cgroup/cpu/user/$$
> echo $$ > /sys/fs/cgroup/cpu/user/$$/tasks
> fi
>
> Then, as the superuser do this:
>
> mount -t cgroup cgroup /sys/fs/cgroup/cpu -o cpu
> mkdir -m 0777 /sys/fs/cgroup/cpu/user
>
> Done. Same effect. However: not crazy.
>
> I am not sure I myself will find the time to prep some 'numbers' for
> you. They'd be the same as with the kernel patch anyway. But I am sure
> somebody else will do it for you...

Not quite the same, you're nesting one level deeper. But the reality is,
not a lot of people will change their userspace.

2010-11-16 18:26:19

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 4:58 AM, Mike Galbraith <[email protected]> wrote:
>
> A tasty alternative would be to have autogroup be it's own subsystem,
> with full cgroup userspace visibility/tweakability.

What exactly do you envisage by that? Having autogroup (in its current
incarnation) be a subsystem wouldn't really make sense - there's
already a cgroup subsystem for partitioning CPU scheduler groups. If
autogroups were integrated with cgroups I think that it would be as a
way of automatically creating (and destroying?) groups based on tty
connectedness.

We tried something like this with the ns subsystem, which would
create/enter a new cgroup whenever a new namespace was created; in the
end it turned out to be more of a nuisance than anything else.

People have proposed all sorts of in-kernel approaches for
auto-creation of cgroups based on things like userid, process name,
now tty, etc.

The previous effort for kernel process grouping (CKRM) started off
with a complex in-kernel rules engine that was ultimately dropped and
moved to userspace. My feeling is that userspace is a better place for
this - as Lennart pointed out, you can get a similar effect with a few
lines tweaking in a bash login script or a pam module that's much more
configurable from userspace and keeps all the existing cgroup stats
available.

Paul

2010-11-16 18:34:19

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 10:21 AM, Peter Zijlstra <[email protected]> wrote:
>
> Not quite the same, you're nesting one level deeper. But the reality is,
> not a lot of people will change their userspace.

That's a weak argument - not a lot of people will (explicitly) change
their kernel either - they'll get a fresh kernel via their distro
updates, as they would get userspace updates. So it's only a few
people (distros) that actually need to make such a change.

Paul

2010-11-16 18:56:29

by David Lang

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16 Nov 2010, Paul Menage wrote:

> On Tue, Nov 16, 2010 at 10:21 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> Not quite the same, you're nesting one level deeper. But the reality is,
>> not a lot of people will change their userspace.
>
> That's a weak argument - not a lot of people will (explicitly) change
> their kernel either - they'll get a fresh kernel via their distro
> updates, as they would get userspace updates. So it's only a few
> people (distros) that actually need to make such a change.

what is the downside of this patch going to be?

people who currently expect all the processes to compete equally will now
find that they no longer do so. If I am understanding this correctly, this
could mean that a box that was dedicated to running one application will
now have that application no longer dominate the system, instead it will
'share equally' with the housekeeping apps on the system.

what would need to be done to revert to the prior situation?

David Lang

2010-11-16 18:56:38

by Stephen Clark

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/16/2010 01:08 PM, Peter Zijlstra wrote:
> On Tue, 2010-11-16 at 18:03 +0100, Lennart Poettering wrote:
>
>> Binding something like this to TTYs is just backwards. No graphical
>> session has a TTY attached anymore. And there might be multiple TTYs
>> used in the same session.
>>
> Using a group per tty makes sense for us console jockeys..
>
> Anyway, nobody uses systemd yet and afaik not all distro's even plan on
> using it (I know I'm not waiting to learn yet another init variant).
>
Right on!

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>


--

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety." (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases." (Thomas Jefferson)


2010-11-16 18:57:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 10:16 AM, Lennart Poettering
<[email protected]> wrote:
>
> Here's my super-complex patch btw, to achieve exactly the same thing
> from userspace without involving any kernel or systemd patching and
> kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:

Right. And that's basically how this "patch" was actually tested
originally - by doing this by hand, without actually having a patch in
hand. I told people: this seems to work really well. Mike made it work
automatically.

Because it's something we want to do it for all users, and for all
shells, and make sure it gets done automatically. Including for users
that have old distributions etc, and make it easy to do in one place.
And then you do it for all the other heuristics we can see easily in
the kernel. And then you do it magically without users even having to
_notice_.

Suddenly it doesn't seem that wonderful any more to play with bashrc, does it?

That's the point. We can push out the kernel change, and everything
will "just work". We can make that feature we already have in the
kernel actually be _useful_.

User-level configuration for something that should just work is
annoying. We can do better.

Put another way: if we find a better way to do something, we should
_not_ say "well, if users want it, they can do this <technical thing
here>". If it really is a better way to do something, we should just
do it. Requiring user setup is _not_ a feature.

Now, I'm not saying that we shouldn't allow users to use cgroups. Of
course they can do things manually too. But we shouldn't require users
to do silly things that we can more easily do ourselves.

If the choice is between telling everybody "you should do this", and
"we should just do this for you", I'll take the second one every time.
We know it should be done. Why should we then tell somebody else to do
it for us?

Linus

2010-11-16 18:58:14

by Stephen Clark

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/16/2010 01:16 PM, Lennart Poettering wrote:
> On Tue, 16.11.10 09:11, Linus Torvalds ([email protected]) wrote:
>
>
>> On Tue, Nov 16, 2010 at 9:03 AM, Lennart Poettering
>> <[email protected]> wrote:
>>
>>> Binding something like this to TTYs is just backwards.
>>>
>> Numbers talk, bullshit walks.
>>
>> The numbers have been quoted. The clear interactive behavior has been seen.
>>
> Here's my super-complex patch btw, to achieve exactly the same thing
> from userspace without involving any kernel or systemd patching and
> kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:
>
> if [ "$PS1" ] ; then
> mkdir -m 0700 /sys/fs/cgroup/cpu/user/$$
> echo $$> /sys/fs/cgroup/cpu/user/$$/tasks
> fi
>
> Then, as the superuser do this:
>
> mount -t cgroup cgroup /sys/fs/cgroup/cpu -o cpu
> mkdir -m 0777 /sys/fs/cgroup/cpu/user
>
> Done. Same effect. However: not crazy.
>
> I am not sure I myself will find the time to prep some 'numbers' for
> you. They'd be the same as with the kernel patch anyway. But I am sure
> somebody else will do it for you...
>
> Lennart
>
>
So you have tested this and have a nice demo and numbers to back it up?

--

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety." (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases." (Thomas Jefferson)


2010-11-16 18:59:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 10:55 -0800, [email protected] wrote:
> On Tue, 16 Nov 2010, Paul Menage wrote:
>
> > On Tue, Nov 16, 2010 at 10:21 AM, Peter Zijlstra <[email protected]> wrote:
> >>
> >> Not quite the same, you're nesting one level deeper. But the reality is,
> >> not a lot of people will change their userspace.
> >
> > That's a weak argument - not a lot of people will (explicitly) change
> > their kernel either - they'll get a fresh kernel via their distro
> > updates, as they would get userspace updates. So it's only a few
> > people (distros) that actually need to make such a change.
>
> what is the downside of this patch going to be?
>
> people who currently expect all the processes to compete equally will now
> find that they no longer do so. If I am understanding this correctly, this
> could mean that a box that was dedicated to running one application will
> now have that application no longer dominate the system, instead it will
> 'share equally' with the housekeeping apps on the system.
>
> what would need to be done to revert to the prior situation?

build with: CONFIG_SCHED_AUTOGROUP=n,
boot with: noautogroup or
runtime: echo 0 > /proc/sysctl/kernel/sched_autogroup_enabled

(although the latter is a lazy one, it won't force existing tasks back
to the root group)

2010-11-16 19:03:59

by Pekka Enberg

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 8:49 PM, Linus Torvalds
<[email protected]> wrote:
> User-level configuration for something that should just work is
> annoying. We can do better.

Completely agreed. Desktop users should not be required to fiddle with
kernel knobs from userspace to fix interactivity problems. Having sane
defaults applies to the kernel as much as it does to userspace.

Pekka

2010-11-16 19:09:19

by David Lang

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16 Nov 2010, Linus Torvalds wrote:

> On Tue, Nov 16, 2010 at 10:16 AM, Lennart Poettering
> <[email protected]> wrote:
>>
>> Here's my super-complex patch btw, to achieve exactly the same thing
>> from userspace without involving any kernel or systemd patching and
>> kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:
>
> Right. And that's basically how this "patch" was actually tested
> originally - by doing this by hand, without actually having a patch in
> hand. I told people: this seems to work really well. Mike made it work
> automatically.
>
> Because it's something we want to do it for all users, and for all
> shells, and make sure it gets done automatically. Including for users
> that have old distributions etc, and make it easy to do in one place.
> And then you do it for all the other heuristics we can see easily in
> the kernel. And then you do it magically without users even having to
> _notice_.
>
> Suddenly it doesn't seem that wonderful any more to play with bashrc, does it?

agreed

> That's the point. We can push out the kernel change, and everything
> will "just work". We can make that feature we already have in the
> kernel actually be _useful_.
>
> User-level configuration for something that should just work is
> annoying. We can do better.
>
> Put another way: if we find a better way to do something, we should
> _not_ say "well, if users want it, they can do this <technical thing
> here>". If it really is a better way to do something, we should just
> do it. Requiring user setup is _not_ a feature.

agreed, the question is if this really is a better way. It's definantly a
change.

> Now, I'm not saying that we shouldn't allow users to use cgroups. Of
> course they can do things manually too. But we shouldn't require users
> to do silly things that we can more easily do ourselves.
>
> If the choice is between telling everybody "you should do this", and
> "we should just do this for you", I'll take the second one every time.
> We know it should be done. Why should we then tell somebody else to do
> it for us?

this is good for desktop interactivity because it no longer treats all
processes equally, it give more CPU to processes that are running
'stand-alone' then it will to processes that are forked off from one
master process.

In the desktop case where you really want something like 'make -j64' to be
in the background and not interfere with the other tasks, but what about
the situation of a dedicated web server? on that box the apache processes
will get less CPU because they will all be in one group, and other things
that happen on the box will get more CPU as they will be in different
groups.

Is this really the right change to make?

having an option to do this sounds like a wonderful idea, and I would
expect that desktop distros would want to have it default to 'on', but
should it really be a default?

David Lang

2010-11-16 19:10:00

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 07:59:25PM +0100, Peter Zijlstra wrote:
> On Tue, 2010-11-16 at 10:55 -0800, [email protected] wrote:
> > On Tue, 16 Nov 2010, Paul Menage wrote:
> >
> > > On Tue, Nov 16, 2010 at 10:21 AM, Peter Zijlstra <[email protected]> wrote:
> > >>
> > >> Not quite the same, you're nesting one level deeper. But the reality is,
> > >> not a lot of people will change their userspace.
> > >
> > > That's a weak argument - not a lot of people will (explicitly) change
> > > their kernel either - they'll get a fresh kernel via their distro
> > > updates, as they would get userspace updates. So it's only a few
> > > people (distros) that actually need to make such a change.
> >
> > what is the downside of this patch going to be?
> >
> > people who currently expect all the processes to compete equally will now
> > find that they no longer do so. If I am understanding this correctly, this
> > could mean that a box that was dedicated to running one application will
> > now have that application no longer dominate the system, instead it will
> > 'share equally' with the housekeeping apps on the system.
> >
> > what would need to be done to revert to the prior situation?
>
> build with: CONFIG_SCHED_AUTOGROUP=n,
> boot with: noautogroup or
> runtime: echo 0 > /proc/sysctl/kernel/sched_autogroup_enabled
>
> (although the latter is a lazy one, it won't force existing tasks back
> to the root group)

I think this might create some issues with controllers which support some
kind of upper limit on resource usage. These hidden group can practically
consume any amount of resource and because use space tools can't see these,
they will not be able to place a limit or control it.

If it is done from user space and cgroups are visible, then user can
atleast monitor the resource usage and do something about it.

Thanks
Vivek

2010-11-16 19:13:14

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 01:57:33PM -0500, Stephen Clark wrote:
> On 11/16/2010 01:16 PM, Lennart Poettering wrote:
> >On Tue, 16.11.10 09:11, Linus Torvalds ([email protected]) wrote:
> >
> >>On Tue, Nov 16, 2010 at 9:03 AM, Lennart Poettering
> >><[email protected]> wrote:
> >>>Binding something like this to TTYs is just backwards.
> >>Numbers talk, bullshit walks.
> >>
> >>The numbers have been quoted. The clear interactive behavior has been seen.
> >Here's my super-complex patch btw, to achieve exactly the same thing
> >from userspace without involving any kernel or systemd patching and
> >kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:
> >
> > if [ "$PS1" ] ; then
> > mkdir -m 0700 /sys/fs/cgroup/cpu/user/$$
> > echo $$> /sys/fs/cgroup/cpu/user/$$/tasks
> > fi
> >
> >Then, as the superuser do this:
> >
> > mount -t cgroup cgroup /sys/fs/cgroup/cpu -o cpu
> > mkdir -m 0777 /sys/fs/cgroup/cpu/user
> >
> >Done. Same effect. However: not crazy.
> >
> >I am not sure I myself will find the time to prep some 'numbers' for
> >you. They'd be the same as with the kernel patch anyway. But I am sure
> >somebody else will do it for you...
> >
> >Lennart
> >
> So you have tested this and have a nice demo and numbers to back it up?

I just modified my .bashrc and for my each ssh session, it seems to
be working fine and creating a cgroup as soon as I ssh into the box.

Just that it will need little modification to reap the group automatically
when shell exits.

Thanks
Vivek

2010-11-16 19:13:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 14:09 -0500, Vivek Goyal wrote:

> I think this might create some issues with controllers which support some
> kind of upper limit on resource usage. These hidden group can practically
> consume any amount of resource and because use space tools can't see these,
> they will not be able to place a limit or control it.
>
> If it is done from user space and cgroups are visible, then user can
> atleast monitor the resource usage and do something about it.

Its cpu-controller only, and then only for SCHED_OTHER tasks which are
proportionally fair.

2010-11-16 19:23:30

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 08:13:20PM +0100, Peter Zijlstra wrote:
> On Tue, 2010-11-16 at 14:09 -0500, Vivek Goyal wrote:
>
> > I think this might create some issues with controllers which support some
> > kind of upper limit on resource usage. These hidden group can practically
> > consume any amount of resource and because use space tools can't see these,
> > they will not be able to place a limit or control it.
> >
> > If it is done from user space and cgroups are visible, then user can
> > atleast monitor the resource usage and do something about it.
>
> Its cpu-controller only, and then only for SCHED_OTHER tasks which are
> proportionally fair.

In this thread there are already mentions of extending it to block
controller also which now supports the upper limit.

Secondly I am assuming at some point of time, cpu controller will also
support throttling (including SCHED_OTHER class).

So as it is not a problem but the moment we start extending this logic
to other controllers supporting upper limits it does pose the question
that how do we handle it.

Even if we automatically create groups inside the kernel (because for all the
reasons that why user space can't do it), it probably should create cgroups
also so that these are visible to user space and not hidden.

Thanks
Vivek

2010-11-16 19:25:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 14:22 -0500, Vivek Goyal wrote:
> In this thread there are already mentions of extending it to block
> controller also which now supports the upper limit.
>
> Secondly I am assuming at some point of time, cpu controller will also
> support throttling (including SCHED_OTHER class).

Possibly, but in both cases its voluntary -- that is, unless explicitly
configured a cgroup can use as much as possible. These auto groups will
use that.

If you want more control, create your own cgroup so you can poke at the
parameters at will.

2010-11-16 19:27:30

by Dhaval Giani

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 7:49 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Nov 16, 2010 at 10:16 AM, Lennart Poettering
> <[email protected]> wrote:
>>
>> Here's my super-complex patch btw, to achieve exactly the same thing
>> from userspace without involving any kernel or systemd patching and
>> kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:
>
> Right. And that's basically how this "patch" was actually tested
> originally - by doing this by hand, without actually having a patch in
> hand. I told people: this seems to work really well. Mike made it work
> automatically.
>
> Because it's something we want to do it for all users, and for all
> shells, and make sure it gets done automatically. Including for users
> that have old distributions etc, and make it easy to do in one place.
> And then you do it for all the other heuristics we can see easily in
> the kernel. And then you do it magically without users even having to
> _notice_.
>

So what do you think about something like systemd handling it. systemd
already does a lot of this stuff already in the form of process
tracking, so it is quite trivial to do this. And more happily avoids
all this complexity in the kernel.

Thanks,
Dhaval

2010-11-16 19:41:58

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 08:25:27PM +0100, Peter Zijlstra wrote:
> On Tue, 2010-11-16 at 14:22 -0500, Vivek Goyal wrote:
> > In this thread there are already mentions of extending it to block
> > controller also which now supports the upper limit.
> >
> > Secondly I am assuming at some point of time, cpu controller will also
> > support throttling (including SCHED_OTHER class).
>
> Possibly, but in both cases its voluntary -- that is, unless explicitly
> configured a cgroup can use as much as possible. These auto groups will
> use that.
>
> If you want more control, create your own cgroup so you can poke at the
> parameters at will.

I can create my own cgroups and control these. But one would like to
control these hidden groups also. For example, in a clustered environment
one might not want allow more than certain IOPS from machine X. In such
scenario one would like to see all the groups and possibly put the limit.

Now one can argue that thse groups are under root and put overall limit
on root and these groups get automatically controlled. But controllers
like block and memory also support flat mode where all the groups are
at same level.

pivot
/ | \
root g2 g3

Here g2 and g3 are hidden auto groups and because it is flat configuration
putting limit on root will not help and one can not limit g2 and g3 and
one can not control total amount of IO from the system.

Now one can say use hierarchy and not flat setup. I am writting patches
to enable hierarchy but that is disabled by default as flat setup came
first and also there are concerns of accounting overhead etc.

So point being that these autogroups being hidden is a concern. Can we
do something so that these groups show up in cgroup hierarchy and are
then user controllable.

Thanks
Vivek

2010-11-16 19:43:05

by Diego Calleja

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Martes, 16 de Noviembre de 2010 20:27:27 Dhaval Giani escribi?:
> So what do you think about something like systemd handling it. systemd
> already does a lot of this stuff already in the form of process
> tracking, so it is quite trivial to do this. And more happily avoids
> all this complexity in the kernel.

Note that even if this was a mistake, systemd can disable it and use its
own heuristics. It would be nice to know what Lennart thinks about that,
because it would be pointless to have a feature that eventually could
be disabled in each boot in most distros.

2010-11-16 19:43:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 14:40 -0500, Vivek Goyal wrote:
>
> So point being that these autogroups being hidden is a concern. Can we
> do something so that these groups show up in cgroup hierarchy and are
> then user controllable.

Well, either you disable the autogroup feature, or explicitly put each
task in a taskgroup other than root and you'll be fine.

Nothing avoids that from working.

2010-11-16 19:44:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 11:13 AM, Peter Zijlstra <[email protected]> wrote:
>
> Its cpu-controller only, and then only for SCHED_OTHER tasks which are
> proportionally fair.

Well, it's _currently_ CPU controller only. People have already
wondered if we should try to do something similar for IO scheduling
too.

So the thing I think is worth keeping in mind is that the "per-tty
scheduling group" is really just an implementation issue. There is
absolutely no question that it can't be about more than just
scheduling, and that it can't be about more than just tty's also.

And an important thing to keep in mind is that "user interfaces are
bad". The thinner the interface, the better. One of the reasons I
really like autogroup is that it has _no_ interface at all. It's very
much a heuristic, and it has zero user interface (apart from the knob
that turns it on and off, of course). That is a great feature, because
it means that you cannot break the interface. You will never need to
have applications that have special linux-specific hooks in them, or
system daemons who have to touch magical /proc files etc.

One of the problems I found annoying when just testing it using the
plain cgroup interface (before the patch) was the resource management.
You needed root, and they actually made sense requiring root, because
I don't think we _want_ to allow people creating infinite numbers of
cgroups. Vivek's "trivial patch" (shell script) is a major DoS thing,
for example. Letting normal users create cgroups willy-nilly is not a
good idea (and as Vivek already found out, his trivial script leaks
cgroups in a pretty fundamental way).

The tty approach is somewhat self-limiting in that it requires you to
get the tty to get an autogroup. But also, because it's very much a
heuristic and doesn't have any user-visible interfaces, from a kernel
perspective it's wonderful. There are no "semantics" to break. If it
turns out that there is some way to create excessive cgroups, we can
introduce per-user limits etc to say "the heuristic works up to X
cgroups and then you'll just get your own user group". And nobody
would ever notice.

So doing things automatically and without any user interface is about
_more_ than just convenience. If it can be done that way, it is
fundamentally better way to do things. Because it hides the
implementation details, and leaves us open to do totally different
things in the end.

For example, 'cgroups' itself is pretty heavy-weight, and is really
quite smart. Those things nest, etc etc. But with the "it's just a
heuristic", maybe somebody ends up doing a "simplified non-nesting
grouping thing", and if you don't want the whole cgroup thing (I have
always answered no to CONFIG_CGROUPS myself, for example), you could
still do the autogrouping. But you could _not_ cleanly do the
/proc/sys/cgroup/user scripting, because your implementation is no
longer based on the whole cgroups thing.

Now, will any of this ever happen? I dunno. I doubt it will matter.
But it's an example of why I think it's such a great approach, and why
"it just works" is such an important feature.

Linus

Linus

2010-11-16 19:46:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 11:27 AM, Dhaval Giani <[email protected]> wrote:
>
> So what do you think about something like systemd handling it. systemd
> already does a lot of this stuff already in the form of process
> tracking, so it is quite trivial to do this. And more happily avoids
> all this complexity in the kernel.

What complexity? Have you looked at the patch? It has no complexity anywhere.

It's a _lot_ less complex than having system daemons you don't
control. We have not had good results with that approach in the past.
System daemons tend to cause nasty problems, and debugging them is a
nightmare.

Have you ever tried to debug the interaction between acpid and the
kernel? I doubt it. It's not simple.

Did you ever try to understand how the old pcmcia code worked, with
the user daemon to handle the add/remove requests?

Have you ever really worked with the "interesting" situations that
come from the X server handling graphics mode switching, and thus
being involved in suspend, resume and VT switching? Trust me, just
designing the _interfaces_ to do that thing is nasty, and the code
itself was a morass. There's a reason the graphics people wanted to
put modeswitching in the kernel. Because doing it in a daemon is a
f*cking pain in the ass.

Put another way: "do it in user space" usually does _not_ make things simpler.

For example: how do you do reference counting for a cgroup in user
space, when processes fork and exit without you even knowing it? In
kernel space, it's _trivial_. That's what kernel/autogroup.c does, and
it has lots of support for it, because that kind of reference counting
is exactly what the kernel does.

In a system daemon? Good luck with that. It's a nightmare. Maybe you
could just poll all the cgroups, and try to remove them once a minute,
and if they are empty it works. Or something like that. But what a
hacky thing it would be.

And more importantly: I don't run systemd. Neither do a lot of other
people. The way the patch does things, "it just works".

Did you go to the phoronix forum to look at how people reacted to the
phoronix article about the patch? There were a number of testers. It
was just so _easy_ to test and set up. If you want people to run some
specific system daemon, it immediately gets much harder to set up and
do.

Linus

2010-11-16 19:48:08

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 2010.11.16 at 19:16 +0100, Lennart Poettering wrote:
> On Tue, 16.11.10 09:11, Linus Torvalds ([email protected]) wrote:
>
> >
> > On Tue, Nov 16, 2010 at 9:03 AM, Lennart Poettering
> > <[email protected]> wrote:
> > >
> > > Binding something like this to TTYs is just backwards.
> >
> > Numbers talk, bullshit walks.
> >
> > The numbers have been quoted. The clear interactive behavior has been seen.
>
> Here's my super-complex patch btw, to achieve exactly the same thing
> from userspace without involving any kernel or systemd patching and
> kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:
>
> if [ "$PS1" ] ; then
> mkdir -m 0700 /sys/fs/cgroup/cpu/user/$$
> echo $$ > /sys/fs/cgroup/cpu/user/$$/tasks
> fi
>
> Then, as the superuser do this:
>
> mount -t cgroup cgroup /sys/fs/cgroup/cpu -o cpu
> mkdir -m 0777 /sys/fs/cgroup/cpu/user
>
> Done. Same effect. However: not crazy.
>
> I am not sure I myself will find the time to prep some 'numbers' for
> you. They'd be the same as with the kernel patch anyway. But I am sure
> somebody else will do it for you...

OK, I've done some tests and the result is that Lennart's approach seems
to work best. It also _feels_ better interactively compared to the
vanilla kernel and in-kernel cgrougs on my machine. Also it's really
nice to have an interface to actually see what is going on. With the
kernel patch you're totally in the dark about what is going on right
now.

Here are some numbers all recorded while running a make -j4 job in one
shell.

perf sched record sleep 30
perf trace -s /usr/libexec/perf-core/scripts/perl/wakeup-latency.pl :

vanilla kernel without cgroups:
total_wakeups: 44306
avg_wakeup_latency (ns): 36784
min_wakeup_latency (ns): 0
max_wakeup_latency (ns): 9378852

with in-kernel patch:
total_wakeups: 43836
avg_wakeup_latency (ns): 67607
min_wakeup_latency (ns): 0
max_wakeup_latency (ns): 8983036

with Lennart's approach:
total_wakeups: 51070
avg_wakeup_latency (ns): 29047
min_wakeup_latency (ns): 0
max_wakeup_latency (ns): 10008237

perf record -a -e sched:sched_switch -e sched:sched_wakeup sleep 10
perf trace -s /usr/libexec/perf-core/scripts/perl/wakeup-latency.pl :

without cgroups:
total_wakeups: 13195
avg_wakeup_latency (ns): 48484
min_wakeup_latency (ns): 0
max_wakeup_latency (ns): 8722497

with in-kernel approach:
total_wakeups: 14106
avg_wakeup_latency (ns): 92532
min_wakeup_latency (ns): 20
max_wakeup_latency (ns): 5642393

Lennart's approach:
total_wakeups: 22215
avg_wakeup_latency (ns): 24118
min_wakeup_latency (ns): 0
max_wakeup_latency (ns): 8001142
--
Markus

2010-11-16 19:56:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 11:40 AM, Vivek Goyal <[email protected]> wrote:
>
> So point being that these autogroups being hidden is a concern. Can we
> do something so that these groups show up in cgroup hierarchy and are
> then user controllable.

I think that's valid, and I don't think it would be wrong to have some
interface to show them.

Of course, then it's an interface question, and you'd want to be a bit
careful in designing it.

Linus

2010-11-16 19:57:00

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 11:45 AM, Linus Torvalds
<[email protected]> wrote:
>
> In a system daemon? Good luck with that. It's a nightmare. Maybe you
> could just poll all the cgroups, and try to remove them once a minute,
> and if they are empty it works. Or something like that. But what a
> hacky thing it would be.

There's an existing cgroups API - release_agent and notify_on_release
- whereby the kernel will spawn a userspace command once a given
cgroup is completely empty. It's intended for pretty much exactly this
purpose.

Paul

2010-11-16 19:58:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 14:12 -0500, Vivek Goyal wrote:
> I just modified my .bashrc and for my each ssh session, it seems to
> be working fine and creating a cgroup as soon as I ssh into the box.
>
> Just that it will need little modification to reap the group automatically
> when shell exits.

Yeah, self-reaping launcher scripts work fine, I cobbled one together.
There's no question that the infrastructure works.

-Mike

2010-11-16 20:04:18

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 19:21, Peter Zijlstra ([email protected]) wrote:

>
> On Tue, 2010-11-16 at 19:16 +0100, Lennart Poettering wrote:
> > On Tue, 16.11.10 09:11, Linus Torvalds ([email protected]) wrote:
> >
> > >
> > > On Tue, Nov 16, 2010 at 9:03 AM, Lennart Poettering
> > > <[email protected]> wrote:
> > > >
> > > > Binding something like this to TTYs is just backwards.
> > >
> > > Numbers talk, bullshit walks.
> > >
> > > The numbers have been quoted. The clear interactive behavior has been seen.
> >
> > Here's my super-complex patch btw, to achieve exactly the same thing
> > from userspace without involving any kernel or systemd patching and
> > kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:
> >
> > if [ "$PS1" ] ; then
> > mkdir -m 0700 /sys/fs/cgroup/cpu/user/$$
> > echo $$ > /sys/fs/cgroup/cpu/user/$$/tasks
> > fi
> >
> > Then, as the superuser do this:
> >
> > mount -t cgroup cgroup /sys/fs/cgroup/cpu -o cpu
> > mkdir -m 0777 /sys/fs/cgroup/cpu/user
> >
> > Done. Same effect. However: not crazy.
> >
> > I am not sure I myself will find the time to prep some 'numbers' for
> > you. They'd be the same as with the kernel patch anyway. But I am sure
> > somebody else will do it for you...
>
> Not quite the same, you're nesting one level deeper. But the reality is,
> not a lot of people will change their userspace.

Well, remove the 'user' part of the path and you have the exact same behaviour.

Userspace usually gets updated way more frequently in most distributions
than the kernel is. Maybe *you* never update userspace. But well, you
are not the examplary Linux user, are you?

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 20:06:08

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 19:08, Peter Zijlstra ([email protected]) wrote:

>
> On Tue, 2010-11-16 at 18:03 +0100, Lennart Poettering wrote:
> > Binding something like this to TTYs is just backwards. No graphical
> > session has a TTY attached anymore. And there might be multiple TTYs
> > used in the same session.
>
> Using a group per tty makes sense for us console jockeys..

Well, then maybe you shouldn't claim this was relevant for anybody but
yourself. Because it is irrelevant for most users if it is bound to the TTY.

> Anyway, nobody uses systemd yet and afaik not all distro's even plan on
> using it (I know I'm not waiting to learn yet another init variant).

Well, the way it looks right now we have convinced all big distros.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 20:12:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 21:03 +0100, Lennart Poettering wrote:
> Userspace usually gets updated way more frequently in most distributions
> than the kernel is. Maybe *you* never update userspace. But well, you
> are not the examplary Linux user, are you?

I run very recent userspace on my desktop and laptop, and I get really
tired of fixing it every upgrade. That said, I do run more recent
kernels than get shipped, in fact I never run distro kernels at all.

2010-11-16 20:15:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 21:05 +0100, Lennart Poettering wrote:
> > Using a group per tty makes sense for us console jockeys..
>
> Well, then maybe you shouldn't claim this was relevant for anybody but
> yourself. Because it is irrelevant for most users if it is bound to the TTY.

I never made such a claim, in fact my kernels will have this =n.

> > Anyway, nobody uses systemd yet and afaik not all distro's even plan on
> > using it (I know I'm not waiting to learn yet another init variant).
>
> Well, the way it looks right now we have convinced all big distros.

Crap, means I get to blame you for more than non working sound. I get
really fed up having to re-discover how to fix broken init scripts on
every upgrade :/

2010-11-16 20:18:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 11:56:33AM -0800, Paul Menage wrote:
> On Tue, Nov 16, 2010 at 11:45 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > In a system daemon? Good luck with that. It's a nightmare. Maybe you
> > could just poll all the cgroups, and try to remove them once a minute,
> > and if they are empty it works. Or something like that. But what a
> > hacky thing it would be.
>
> There's an existing cgroups API - release_agent and notify_on_release
> - whereby the kernel will spawn a userspace command once a given
> cgroup is completely empty. It's intended for pretty much exactly this
> purpose.

Yes. And it seems to be working just fine for me. I modiefied Lennart's
script a bit to achieve that.

Addition to my .bashrc.

if [ "$PS1" ] ; then
mkdir -m 0700 -p /cgroup/cpu/$$
echo 1 > /cgroup/cpu/$$/notify_on_release
echo $$ > /cgroup/cpu/$$/tasks
fi

I created one file /bin/rmcgroup to clean up the cgroup.

#!/bin/bash
rmdir /cgroup/cpu/$1

And did following to mount cgroup and setup empty group notification.

mount -t cgroup -o cpu none /cgroup/cpu
echo "/bin/rmcgroup" > /cgroup/cpu/release_agent

And it works fine. Upon ssh to my box, a cpu cgroup is automatically
created and upon exiting the shell, this group is automatically destroyed.

So API/interface for automatically reclaiming the cgroup once it is empty
seems to be pretty simple and works.

Thanks
Vivek

2010-11-16 20:22:13

by Kay Sievers

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 20:03, Pekka Enberg <[email protected]> wrote:
> On Tue, Nov 16, 2010 at 8:49 PM, Linus Torvalds
> <[email protected]> wrote:
>> User-level configuration for something that should just work is
>> annoying. We can do better.
>
> Completely agreed. Desktop users should not be required to fiddle with
> kernel knobs from userspace to fix interactivity problems. Having sane
> defaults applies to the kernel as much as it does to userspace.

The only problem is that *desktop* users use *desktop* apps, which
never have a controlling tty.

This is a *nerd* config not helping any regular desktop user. You guys
are optimizing the system for people who build kernels and watch
movies at the same time, not for desktop users. Hooking into TTYs
works for almost nobody sane out there. :)

It's all fine, no comment on the patch, it's a nice hack. But please
stop talking about the *desktop* here, it has almost nothing to do
with it.

Kay

2010-11-16 20:29:02

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 10:49, Linus Torvalds ([email protected]) wrote:

> Because it's something we want to do it for all users, and for all
> shells, and make sure it gets done automatically. Including for users
> that have old distributions etc, and make it easy to do in one place.
> And then you do it for all the other heuristics we can see easily in
> the kernel. And then you do it magically without users even having to
> _notice_.

Well, this feature is pretty much interesting only for kernel hackers
anyway. It is completely irrelevant for normal desktop people. Because
a) normal users don't use many terminals anyway and that very seldom and
b) if they do that they do not create gazillion of processes from one of
the terminals and only very few in the other. Because only then this
patch becomes relevant.

Heck, the patch wouldn't even have any effect on my machine (and I am
hacker), because I tend to run my builds from within emacs. And since
emacs has no TTY (since it is a X11/gtk build) it wouldn't be in its own
scheduling group.

So, this patch only has an effect of people who build kernels from an
xterm with make -j all day, and at the same time want to watch a movie,
from a player they also start from a terminal, but from another one.

Only in such a setup this patch matters. And for everybody else it is
completely irrelevant. If you don't use terminals it has no effect. If
you don't run massivily parallel CPU hoggers it has no effect. If you do
not run your mplayer from a terminal it has no effect.

The kernel bears your name, but that doesn't mean your own use of it was
typical for more than you and a number kernel hackers like you.

Also, this implicit assumption that nobody would ever see this because
people upgrade their kernels often and userspace seldom is just nonsense
anyway. That's how *you* do it. But in reality userspace is updated for
most folks way more often then the kernel is.

Or to turn this around: think how awesome that would be if we did this
in userspace, then we could support older kernels too and wouldn't have
to upgrade everything to your not-yet-released new kernel.

Suddenly it doesn't seem that wonderful anymore to play with the kernel
to add this, does it?

> Suddenly it doesn't seem that wonderful any more to play with bashrc,
> does it?

Well, I have no plans in pushing anybody to do this in bash really. All
I am saying that tying this to a tty is crazy. And this is policy, and
should be done in userspace, and we are almost there to do this in
userspace.

In fact, I just prepped a patch to systemd to move every service and
every user session into its own cgroup in the 'cpu' hierarchy (in
addition to the group it already creates in the 'systemd' hierarchy). On
a system that runs systemd for both managing users and sessions this
means you are already half-way at what you want.

> User-level configuration for something that should just work is
> annoying. We can do better.

Well, that says you, as a kernel hacker. For you it is easier to hack
the kernel than to fiddle with the rest of the stack. Doesn't make it the
right fix. You know, there are a lot of userspace folsk being afraid of
and too lazy to hacking the kernel and hence rather workaround kernel
fuckups in userspace then fixing it properly. But you are doing it the
other way round, since userspace gives you the creeps you want to add
something to the kernel that doesn't really have any value for anybody
but a number of kernel folks, and is policy anyway, and what userspace
people are working on anyway.

> Put another way: if we find a better way to do something, we should
> _not_ say "well, if users want it, they can do this <technical thing
> here>". If it really is a better way to do something, we should just
> do it. Requiring user setup is _not_ a feature.

Well, if I make behaviour like this default in systemd, then this means
there won't be user setup for this. Because the distros shipping systemd
will get this as default behaviour.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 20:31:25

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 21:03, Pekka Enberg ([email protected]) wrote:

>
> On Tue, Nov 16, 2010 at 8:49 PM, Linus Torvalds
> <[email protected]> wrote:
> > User-level configuration for something that should just work is
> > annoying. We can do better.
>
> Completely agreed. Desktop users should not be required to fiddle with
> kernel knobs from userspace to fix interactivity problems. Having sane
> defaults applies to the kernel as much as it does to userspace.

Jeez. Don't mentione the desktop. On the desktop this is compleltely
irrelevant. There are not TTYs on the desktop. There's no "make -j" of
the kernel tree on the desktop.

The kernel patch discussed here *has* *no* *relevance* for normal users.

The kernel patch discussed here is only relevant for people which start
mplayer from one terminal, and "make -j" from another.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 20:33:29

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 11:08, [email protected] ([email protected]) wrote:

> >If the choice is between telling everybody "you should do this", and
> >"we should just do this for you", I'll take the second one every time.
> >We know it should be done. Why should we then tell somebody else to do
> >it for us?
>
> this is good for desktop interactivity because it no longer treats
> all processes equally, it give more CPU to processes that are
> running 'stand-alone' then it will to processes that are forked off
> from one master process.

This isn#t good for desktop interatctivey. It is *irrelevant* for
desktop interactivity -- unless you define running "make -j" a typical
desktop usecase. Which it isn't.

Just stop bringing about the word "desktop" here. It has no point in
this discussion.

> In the desktop case where you really want something like 'make -j64'

No you don't. Because that is not a desktop use case.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 20:36:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 12:21 PM, Kay Sievers <[email protected]> wrote:
>
> The only problem is that *desktop* users use *desktop* apps, which
> never have a controlling tty.

Yes. And have you noticed people complaining about stuttering while
they run just their desktop app? No.

That's the thing. If you run a web browser and you use a flash player
to view youtube or whatever in it, and only use other interactive
desktop apps anyway, you won't see any problems _regardless_. It's not
like the other desktop app you have open (and is idle, because you're
not touching it) will matter.

Ask yourself who is complaining? _I_ have been complaining for years
about desktop latency. I usually do it in private to developers, but
trust me, I do it. Much of it has been about IO (the whole fsync
fiasco), but some of it has been about the scheduler.

Look at who were trying out Con's patches. They were compiling things
and running games at the same time. That's literally the kind of loads
that people were looking at. Partly because that's the kinds of loads
we haven't been good at. And it's the kind of load that this helps
with.

Anyway, I find it depressing that now that this is solved, people come
out of the woodwork and say "hey you could do this". Where were you
guys a year ago or more?

Tough. I found out that I can solve it using cgroups, I asked people
to comment and help, and I think the kernel approach is wonderful and
_way_ simpler than the scripts I've seen. Yes, I'm biased ("kernels
are easy - user space maintenance is a big pain").

Linus

2010-11-16 20:37:05

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 13:57, Stephen Clark ([email protected]) wrote:

> >I am not sure I myself will find the time to prep some 'numbers' for
> >you. They'd be the same as with the kernel patch anyway. But I am sure
> >somebody else will do it for you...
>
> So you have tested this and have a nice demo and numbers to back it up?

Yes, I tested this. And I thought the paragraph was explicit enough
about the numbers.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 20:39:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 12:33 PM, Lennart Poettering
<[email protected]> wrote:
>
> No you don't. Because that is not a desktop use case.

See my other response. You don't care AT ALL, because by your
judgement, all desktop is is a web browser and a word processor.

So why are you even discussing this? It's irrelevant for you. cgroups
will _never_ matter for what you are talking about, and that has
nothing to do with ttys, automation, scripting or anything else.

Because your definition of "desktop" seems to be "only interactive
apps". So this i all irrelevant.

Which is fine by me. It's not what the patch is supposed to affect.

Linus

2010-11-16 20:44:32

by Pekka Enberg

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 11:08, [email protected] ([email protected]) wrote:
>> >If the choice is between telling everybody "you should do this", and
>> >"we should just do this for you", I'll take the second one every time.
>> >We know it should be done. Why should we then tell somebody else to do
>> >it for us?
>>
>> this is good for desktop interactivity because it no longer treats
>> all processes equally, it give more CPU to processes that are
>> running 'stand-alone' then it will to processes that are forked off
>> from one master process.

On Tue, Nov 16, 2010 at 10:33 PM, Lennart Poettering
<[email protected]> wrote:
> This isn#t good for desktop interatctivey. It is *irrelevant* for
> desktop interactivity -- unless you define running "make -j" a typical
> desktop usecase. Which it isn't.
>
> Just stop bringing about the word "desktop" here. It has no point in
> this discussion.
>
>> In the desktop case where you really want something like 'make -j64'
>
> No you don't. Because that is not a desktop use case.

How about you stop bringing *your* narrow definition of "desktop" to
the discussion? I am a kernel hacker but that doesn't mean I'm not
also a desktop user. I shouldn't need to play with cgroups from
userspace to keep music playing while I compile kernels. Things should
just work.

Pekka

2010-11-16 20:45:54

by David Miller

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

From: Lennart Poettering <[email protected]>
Date: Tue, 16 Nov 2010 21:28:39 +0100

> Heck, the patch wouldn't even have any effect on my machine (and I am
> hacker), because I tend to run my builds from within emacs. And since
> emacs has no TTY (since it is a X11/gtk build) it wouldn't be in its own
> scheduling group.

Type 'tty' in that emacs shell, what does it give you?

It does get it's own TTY and it will thus get it's own scheduling
group.

2010-11-16 20:51:08

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 11:45, Linus Torvalds ([email protected]) wrote:

>
> On Tue, Nov 16, 2010 at 11:27 AM, Dhaval Giani <[email protected]> wrote:
> >
> > So what do you think about something like systemd handling it. systemd
> > already does a lot of this stuff already in the form of process
> > tracking, so it is quite trivial to do this. And more happily avoids
> > all this complexity in the kernel.
>
> What complexity? Have you looked at the patch? It has no complexity anywhere.
>
> It's a _lot_ less complex than having system daemons you don't
> control. We have not had good results with that approach in the past.
> System daemons tend to cause nasty problems, and debugging them is a
> nightmare.

Well, userspace doesn't bite.

> For example: how do you do reference counting for a cgroup in user
> space, when processes fork and exit without you even knowing it? In
> kernel space, it's _trivial_. That's what kernel/autogroup.c does, and
> it has lots of support for it, because that kind of reference counting
> is exactly what the kernel does.

You can just do an rmdir from the cgroup release handler. Heck, "rmdir"
is a pretty good GC in itself, since it deletes a cgroup only if it is
empty.

> In a system daemon? Good luck with that. It's a nightmare. Maybe you
> could just poll all the cgroups, and try to remove them once a minute,
> and if they are empty it works. Or something like that. But what a
> hacky thing it would be.

Well, that nightmare already exists. It's systemd. We use the cgroup
release handler. If you ask me it's an aweful interface, but works
fine.

> And more importantly: I don't run systemd. Neither do a lot of other
> people. The way the patch does things, "it just works".

So this basically boils down to the fact that this is useful for your
particular usecase. Because you don't want to update userspace. But
don't claim this would be useful for anybody but you. It is definitely
irrelevant for the usual desktop usecase.

> Did you go to the phoronix forum to look at how people reacted to the
> phoronix article about the patch? There were a number of testers. It
> was just so _easy_ to test and set up. If you want people to run some
> specific system daemon, it immediately gets much harder to set up and
> do.

Jeez. Phoronix!

If you truly believe that the Phoronix usecase of running "make -j64"
over the kernel tree was in any way relevant in real life for anybody
but kernel developers, then I can't help you.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 20:55:16

by Alan

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

> Well, if I make behaviour like this default in systemd, then this means
> there won't be user setup for this. Because the distros shipping systemd
> will get this as default behaviour.

And within the desktop where would you put this - in the window manager
on the basis of top level windows or in the app startup ?

Alan

2010-11-16 21:08:26

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 12:46, David Miller ([email protected]) wrote:

>
> From: Lennart Poettering <[email protected]>
> Date: Tue, 16 Nov 2010 21:28:39 +0100
>
> > Heck, the patch wouldn't even have any effect on my machine (and I am
> > hacker), because I tend to run my builds from within emacs. And since
> > emacs has no TTY (since it is a X11/gtk build) it wouldn't be in its own
> > scheduling group.
>
> Type 'tty' in that emacs shell, what does it give you?

I am running "make -j", not a shell from emacs. it doesn't have a tty.

I just want to build something, not have terminal emulator in emacs.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 21:09:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 12:52 PM, Alan Cox <[email protected]> wrote:
>> Well, if I make behaviour like this default in systemd, then this means
>> there won't be user setup for this. Because the distros shipping systemd
>> will get this as default behaviour.
>
> And within the desktop where would you put this - in the window manager
> on the basis of top level windows or in the app startup ?

Btw, I suspect either of these are reasonable. In fact, I don't think
it would be at all wrong to have the desktop launcher have an option
to "launch in a group" (although I think it would need to be named
better than that). Right now, when you create desktop launchers under
at least gnome, it allows you to specify a "type" for the application
("Application" or "Application in Terminal"), and maybe there could be
a "CPU-bound application" choice that would set it in a CPU group of
its own. Or whatever.

So I do _not_ believe that the autogroup feature should necessarily
mean that you cannot do other grouping decisions too. I just do think
that the whole notion of "it got started from a tty" is actually a
very useful thing for legacy applications, and one where it's just
simpler to do it in the kernel than build up any extra infrastructure
for it.

So it's not necessarily at all an "either-or" thing.

Linus

2010-11-16 21:14:09

by David Miller

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

From: Lennart Poettering <[email protected]>
Date: Tue, 16 Nov 2010 22:08:04 +0100

> On Tue, 16.11.10 12:46, David Miller ([email protected]) wrote:
>
>>
>> From: Lennart Poettering <[email protected]>
>> Date: Tue, 16 Nov 2010 21:28:39 +0100
>>
>> > Heck, the patch wouldn't even have any effect on my machine (and I am
>> > hacker), because I tend to run my builds from within emacs. And since
>> > emacs has no TTY (since it is a X11/gtk build) it wouldn't be in its own
>> > scheduling group.
>>
>> Type 'tty' in that emacs shell, what does it give you?
>
> I am running "make -j", not a shell from emacs. it doesn't have a tty.
>
> I just want to build something, not have terminal emulator in emacs.

It does the same exact thing. Please actually check and verify
your facts instead of tossing back kneejerk responses.

davem@sunset:~/src/GIT$ cat Makefile
all:
tty
davem@sunset:~/src/GIT$

"M-x compile" in that directory gives:

--------------------
-*- mode: compilation; default-directory: "~/src/GIT/" -*-
Compilation started at Tue Nov 16 13:13:01

make -k
tty
/dev/pts/2

Compilation finished at Tue Nov 16 13:13:01
--------------------

2010-11-16 21:14:53

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 12:38, Linus Torvalds ([email protected]) wrote:

>
> On Tue, Nov 16, 2010 at 12:33 PM, Lennart Poettering
> <[email protected]> wrote:
> >
> > No you don't. Because that is not a desktop use case.
>
> See my other response. You don't care AT ALL, because by your
> judgement, all desktop is is a web browser and a word processor.

Well, I do care. But I care more about *real* problems. For example the
fact that "updatedb" makes your system sluggish while it runs. Or
"man-db". Or anything else that runs from cron in the background.

Doing this tty dance won't help you much with background tasks such as
man-db, updatedb and cron and its jobs, will it? They don't have
ttys. Sorry for you. meh! Meh! meh! meh! meh!

(And along comes systemd, which actually handles this properly, since it
actually has a proper notion of what a service is, and what a session
is, and what an app is. And which hence can control all this sanely.)

Binding this to a tty is just solves a tiny bit of the real problem:
i.e. your own use of make -j. End of story.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 21:17:56

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 20:52, Alan Cox ([email protected]) wrote:

>
> > Well, if I make behaviour like this default in systemd, then this means
> > there won't be user setup for this. Because the distros shipping systemd
> > will get this as default behaviour.
>
> And within the desktop where would you put this - in the window manager
> on the basis of top level windows or in the app startup ?

The plan with systemd is to make it manage both the system and the
sessions. It's along the lines of what launchd does on MacOS: one
instance for the system, another one for the user, because starting and
supervising a system service and a session service are actually very
very similar things.

In F15 we'll introduce systemd as an init system only. The next step
will be to make it manage sessions too, and replace/augment
gnome-session.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 21:19:30

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 13:08, Linus Torvalds ([email protected]) wrote:

>
> On Tue, Nov 16, 2010 at 12:52 PM, Alan Cox <[email protected]> wrote:
> >> Well, if I make behaviour like this default in systemd, then this means
> >> there won't be user setup for this. Because the distros shipping systemd
> >> will get this as default behaviour.
> >
> > And within the desktop where would you put this - in the window manager
> > on the basis of top level windows or in the app startup ?
>
> Btw, I suspect either of these are reasonable. In fact, I don't think
> it would be at all wrong to have the desktop launcher have an option
> to "launch in a group" (although I think it would need to be named
> better than that). Right now, when you create desktop launchers under
> at least gnome, it allows you to specify a "type" for the application
> ("Application" or "Application in Terminal"), and maybe there could be
> a "CPU-bound application" choice that would set it in a CPU group of
> its own. Or whatever.

Well, my plan was actually to by default put everything into its own
group, and then let users opt-out of that for specific processes, if the
want to.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-16 23:40:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 10:19:09PM +0100, Lennart Poettering wrote:
> Well, my plan was actually to by default put everything into its own
> group, and then let users opt-out of that for specific processes, if the
> want to.

How many users are likely to do this, though?

I think you really want to make this be something which the
application can specify by default that they should start in their own
cgroup. One idea might be to it to the applications menu entry:

http://standards.freedesktop.org/desktop-entry-spec/latest/

... so there would be a new key value pair, "start_in_cgroup", that
would allow the user to start an application in their own cgroup. It
would be up to the desktop launcher to honor that if it was present.

One nice thing about having the desktop launch each application in its
own cgroup is that it becomes easier for an desktop task manager to
have a UI listing that lists things in a format which will be somewhat
easier to understand than process listing. The cgroup would be a
useful way to organize what is going on for each launched application,
and it would allow people to see how much memory some application like
evolution really requires. (On the other hand, maybe they would be
happier not knowing. :-)

- Ted

2010-11-17 00:22:21

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 18:39, Ted Ts'o ([email protected]) wrote:

>
> On Tue, Nov 16, 2010 at 10:19:09PM +0100, Lennart Poettering wrote:
> > Well, my plan was actually to by default put everything into its own
> > group, and then let users opt-out of that for specific processes, if the
> > want to.
>
> How many users are likely to do this, though?
>
> I think you really want to make this be something which the
> application can specify by default that they should start in their own
> cgroup. One idea might be to it to the applications menu entry:
>
> http://standards.freedesktop.org/desktop-entry-spec/latest/
>
> ... so there would be a new key value pair, "start_in_cgroup", that
> would allow the user to start an application in their own cgroup. It
> would be up to the desktop launcher to honor that if it was present.

This is pretty much in line with what I want to do, except I want
opt-out, not opt-in behaviour here.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-17 01:32:09

by Kyle McMartin

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Hi Mike,

On Mon, Nov 15, 2010 at 02:25:50PM -0700, Mike Galbraith wrote:
> --- linux-2.6.orig/drivers/tty/tty_io.c
> +++ linux-2.6/drivers/tty/tty_io.c
> @@ -3160,6 +3160,7 @@ static void __proc_set_tty(struct task_s
> put_pid(tsk->signal->tty_old_pgrp);
> tsk->signal->tty = tty_kref_get(tty);
> tsk->signal->tty_old_pgrp = NULL;
> + sched_autogroup_create_attach(tsk);
> }
>

This is a bit of a problem, as it's called in_atomic context and kmalloc's
under GFP_KERNEL (which can sleep.) This results in sleep-under-spinlock
prints when CONFIG_DEBUG_SPINLOCK_SLEEP=y.

I spent a bit of time thinking about how to fix that, but it's a bit
difficult because of the nested spin_lock_irq in that bit of the
tty_ioctl callchain.

I'll think about it some more tonight and follow-up if I can think of a
way to fix it.

regards, Kyle

2010-11-17 01:51:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 5:31 PM, Kyle McMartin <[email protected]> wrote:
>
> This is a bit of a problem, as it's called in_atomic context and kmalloc's
> under GFP_KERNEL (which can sleep.) This results in sleep-under-spinlock
> prints when CONFIG_DEBUG_SPINLOCK_SLEEP=y.

Blame me, I threw that out as a single point where this can be done.

In fact, holding the signal spinlock was seen as a bonus, since that
was used to serialize the access to the signal->autogroup access.
Which I think is required.

But yes, it does create problems for the allocation. It could be done
as just a GFP_ATOMIC, of course, and on allocation failure you'd just
punt and not do it. Not pretty, but functional.

Linus

2010-11-17 01:57:11

by Kyle McMartin

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 05:50:41PM -0800, Linus Torvalds wrote:
> Blame me, I threw that out as a single point where this can be done.
>
> In fact, holding the signal spinlock was seen as a bonus, since that
> was used to serialize the access to the signal->autogroup access.
> Which I think is required.
>
> But yes, it does create problems for the allocation. It could be done
> as just a GFP_ATOMIC, of course, and on allocation failure you'd just
> punt and not do it. Not pretty, but functional.
>

Yeah, I didn't look any deeper than kernel/sched.c::sched_create_group,
but that would need to GFP_ATOMIC as well.

Looking at it now, so would alloc_rt_sched_group/alloc_fair_sched_group,
and we're looking at an awful lot of sleepless allocations. Not sure
that's a feasible plan.

--Kyle

2010-11-17 02:06:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Wed, Nov 17, 2010 at 01:21:59AM +0100, Lennart Poettering wrote:
> > I think you really want to make this be something which the
> > application can specify by default that they should start in their own
> > cgroup. One idea might be to it to the applications menu entry:
> >
> > http://standards.freedesktop.org/desktop-entry-spec/latest/
> >
> > ... so there would be a new key value pair, "start_in_cgroup", that
> > would allow the user to start an application in their own cgroup. It
> > would be up to the desktop launcher to honor that if it was present.
>
> This is pretty much in line with what I want to do, except I want
> opt-out, not opt-in behaviour here.

That works for me. One suggestion is that in addition to "opt-out",
it should be also possible for an application launcher file to specify
a specific cgroup name that should be used. That would allow multiple
applications in a group to assigned to the same cgroup.

There also needs to be a way to start applications that are started
via the GNOME and KDE session manager in a specified cgroup as well.
I assume that's in your plan as well?

- Ted

2010-11-17 02:14:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 2010-11-16 at 20:31 -0500, Kyle McMartin wrote:
> Hi Mike,
>
> On Mon, Nov 15, 2010 at 02:25:50PM -0700, Mike Galbraith wrote:
> > --- linux-2.6.orig/drivers/tty/tty_io.c
> > +++ linux-2.6/drivers/tty/tty_io.c
> > @@ -3160,6 +3160,7 @@ static void __proc_set_tty(struct task_s
> > put_pid(tsk->signal->tty_old_pgrp);
> > tsk->signal->tty = tty_kref_get(tty);
> > tsk->signal->tty_old_pgrp = NULL;
> > + sched_autogroup_create_attach(tsk);
> > }
> >
>
> This is a bit of a problem, as it's called in_atomic context and kmalloc's
> under GFP_KERNEL (which can sleep.) This results in sleep-under-spinlock
> prints when CONFIG_DEBUG_SPINLOCK_SLEEP=y.

Yeah, I got another report about that today.

-Mike

2010-11-17 08:07:06

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

* Linus Torvalds <[email protected]> [2010-11-15 18:18:20]:

> On Mon, Nov 15, 2010 at 5:56 PM, Vivek Goyal <[email protected]> wrote:
> >
> > Should this kind of thing be done in user space?
>
> Almost certainly not.
>
> First off, user-space is a fragmented mess. Just from a "let's get it
> done" angle, it just doesn't work. There are lots of different thing
> that create new tty's, and you can't have them all fixed. Plus it
> would be _way_ more code in user space than it is in kernel space.
>
> Secondly, user-space daemons are a total mess. We've tried it many
> many times, and every time the _intention_ is to make things simpler
> to debug and deploy. And it almost never is. The interfaces end up
> being too painful, and the "part of the code is in kernel space, part
> of it is in user space" means that things just break all the time.
>

Please elaborate, is this a generic statement or a comment on
cgclassify or cgroup user rules.

> Finally, the whole "user space is more flexible" is just a lie. It
> simply doesn't end up being true. It will be _harder_ to configure
> some user-space daemon than it is to just set a flag in /sys or
> whatever. The "flexibility" tends to be more a flexibility to get
> things wrong than any actual advantage.
>
> Just look at the patch in question. It's simple, it's clean, and it
> "just works". Doing the same thing in user space? It would be a total
> nightmare, and exactly _because_ it would be a total nightmare, the
> code would never be that simple or clean.
>
> Linus

--
Three Cheers,
Balbir

2010-11-17 13:22:42

by Stephen Clark

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/16/2010 03:31 PM, Lennart Poettering wrote:
> On Tue, 16.11.10 21:03, Pekka Enberg ([email protected]) wrote:
>
>
>> On Tue, Nov 16, 2010 at 8:49 PM, Linus Torvalds
>> <[email protected]> wrote:
>>
>>> User-level configuration for something that should just work is
>>> annoying. We can do better.
>>>
>> Completely agreed. Desktop users should not be required to fiddle with
>> kernel knobs from userspace to fix interactivity problems. Having sane
>> defaults applies to the kernel as much as it does to userspace.
>>
> Jeez. Don't mentione the desktop. On the desktop this is compleltely
> irrelevant. There are not TTYs on the desktop. There's no "make -j" of
> the kernel tree on the desktop.
>
> The kernel patch discussed here *has* *no* *relevance* for normal users.
>
> The kernel patch discussed here is only relevant for people which start
> mplayer from one terminal, and "make -j" from another.
>
> Lennart
>
>
Really, who uses Linux that is not a geek?

--

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety." (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases." (Thomas Jefferson)


2010-11-17 13:24:14

by Stephen Clark

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/16/2010 04:14 PM, Lennart Poettering wrote:
> On Tue, 16.11.10 12:38, Linus Torvalds ([email protected]) wrote:
>
>
>> On Tue, Nov 16, 2010 at 12:33 PM, Lennart Poettering
>> <[email protected]> wrote:
>>
>>> No you don't. Because that is not a desktop use case.
>>>
>> See my other response. You don't care AT ALL, because by your
>> judgement, all desktop is is a web browser and a word processor.
>>
> Well, I do care. But I care more about *real* problems. For example the
> fact that "updatedb" makes your system sluggish while it runs. Or
> "man-db". Or anything else that runs from cron in the background.
>
> Doing this tty dance won't help you much with background tasks such as
> man-db, updatedb and cron and its jobs, will it? They don't have
> ttys. Sorry for you. meh! Meh! meh! meh! meh!
>
> (And along comes systemd, which actually handles this properly, since it
> actually has a proper notion of what a service is, and what a session
> is, and what an app is. And which hence can control all this sanely.)
>
> Binding this to a tty is just solves a tiny bit of the real problem:
> i.e. your own use of make -j. End of story.
>
> Lennart
>
>
Oh, you mean real problems like systemd can boot 10 seconds faster. What
is 10 seconds when your system is up days or weeks?

--

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety." (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases." (Thomas Jefferson)


2010-11-17 14:58:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 09:06:26PM -0500, Ted Ts'o wrote:
> On Wed, Nov 17, 2010 at 01:21:59AM +0100, Lennart Poettering wrote:
> > > I think you really want to make this be something which the
> > > application can specify by default that they should start in their own
> > > cgroup. One idea might be to it to the applications menu entry:
> > >
> > > http://standards.freedesktop.org/desktop-entry-spec/latest/
> > >
> > > ... so there would be a new key value pair, "start_in_cgroup", that
> > > would allow the user to start an application in their own cgroup. It
> > > would be up to the desktop launcher to honor that if it was present.
> >
> > This is pretty much in line with what I want to do, except I want
> > opt-out, not opt-in behaviour here.
>
> That works for me. One suggestion is that in addition to "opt-out",
> it should be also possible for an application launcher file to specify
> a specific cgroup name that should be used. That would allow multiple
> applications in a group to assigned to the same cgroup.

Being able to specify cgroup name/path is a good idea. That way one can
make use of cgroup hierarchy also.

Thinking more about opt-in vs opt-out issue. Generally cgroups provide
some kind of isolation between application groups and in the process
can be somewhat expensive. More memory allocation, more accounting overhead
and for CFQ block controller it can also mean additional idling and can result
in overall reduced throughput.

Keeping that in mind, is it really a good idea to launch each application
in a separate group. Will it be better to let user decide if the
application should be launched in a separate cgroup?

The flip side is that how many people will really know about the functionality
and will really launch application in a separate group. And may be it is
a good idea to put everybody in a seprate cgroup by default even it means
some cost so that if a application starts consuming too much of resources
(make -j64), then its impact on rest of the groups can be contained.

I really don't have strong inclination for one over other. Just thinking
loud...

Vivek

2010-11-17 15:02:18

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Wed, 17.11.10 09:57, Vivek Goyal ([email protected]) wrote:

> Being able to specify cgroup name/path is a good idea. That way one can
> make use of cgroup hierarchy also.
>
> Thinking more about opt-in vs opt-out issue. Generally cgroups provide
> some kind of isolation between application groups and in the process
> can be somewhat expensive. More memory allocation, more accounting overhead
> and for CFQ block controller it can also mean additional idling and can result
> in overall reduced throughput.
>
> Keeping that in mind, is it really a good idea to launch each application
> in a separate group. Will it be better to let user decide if the
> application should be launched in a separate cgroup?
>
> The flip side is that how many people will really know about the functionality
> and will really launch application in a separate group. And may be it is
> a good idea to put everybody in a seprate cgroup by default even it means
> some cost so that if a application starts consuming too much of resources
> (make -j64), then its impact on rest of the groups can be contained.
>
> I really don't have strong inclination for one over other. Just thinking
> loud...

I wouldn't be too concerned here. It's not that we end up with 1000s of
groups here. It's way < 40 or in the end, for a single user
machine. Which I think isn't that bad.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-17 17:17:33

by John Stoffel

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

>>>>> "Lennart" == Lennart Poettering <[email protected]> writes:

Lennart> I wouldn't be too concerned here. It's not that we end up
Lennart> with 1000s of groups here. It's way < 40 or in the end, for a
Lennart> single user machine. Which I think isn't that bad.

Ok, so how will this impact users on a single machine which is used to
host VNC sessions? I've got machines with 30+ users all running one
or more VNC sessions each on there. Nothing CPU bound generally,
unless something freaks out and starts hogging a CPU. But will the
overhead of 1000 groups be noticeable?

Thanks,
John

2010-11-17 21:21:35

by James Cloos

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

>>>>> "LP" == Lennart Poettering <[email protected]> writes:

LP> Heck, the patch wouldn't even have any effect on my machine (and I am
LP> hacker), because I tend to run my builds from within emacs. And since
LP> emacs has no TTY (since it is a X11/gtk build) it wouldn't be in its own
LP> scheduling group.

M-x compile uses a pty to talk to its sub-process, so if ptys get their
own cgroup then a compile w/in emacs will, too.

(Try using tty as the command M-x compile should run to confirm.)

-JimC
--
James Cloos <[email protected]> OpenPGP: 1024D/ED7DAEA6

2010-11-17 22:34:32

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, 16.11.10 13:08, Linus Torvalds ([email protected]) wrote:

> So I do _not_ believe that the autogroup feature should necessarily
> mean that you cannot do other grouping decisions too. I just do think
> that the whole notion of "it got started from a tty" is actually a
> very useful thing for legacy applications, and one where it's just
> simpler to do it in the kernel than build up any extra infrastructure
> for it.
>
> So it's not necessarily at all an "either-or" thing.

As a last update to this messy discussion, here's a commit I made to
systemd which ensures that every user session by default gets its own
cgroup in the 'cpu' hierarchy.

http://cgit.freedesktop.org/systemd/commit/?id=74fe1fe36e35a26d764f1e3119d5f6d014db573c

And here's the one that does the same to ensure that every service
systemd manages gets by default its own cgroup in the 'cpu' hierarchy.

http://cgit.freedesktop.org/systemd/commit/?id=d686d8a97bd7945af0a61504392d01a3167b576f

And finally, here's bugzilla entry for a patch I prepped for
gnome-terminal's livte which creates a subgroup for each terminal widget
shown by the same g-t instance:

https://bugzilla.gnome.org/show_bug.cgi?id=635119

With this all together we get quite a bit further than with the kernel
patch, since we also cover all kinds of services, and treat user
sessions equal to each other, and even users. (i.e. user A cannot get
double the amount of CPU time simply by spawning double the amount of
processes or double the amount of session thatn user B).

This should fix things for people with systemd and GNOME. Yes, all
others are left in the cold. Sorry for that.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-17 22:38:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Wed, 2010-11-17 at 23:34 +0100, Lennart Poettering wrote:
>
> This should fix things for people with systemd and GNOME. Yes, all
> others are left in the cold. Sorry for that.

Is there an easy opt out for that, other than booting a CONFIG_CGROUP=n
kernel?

2010-11-17 22:46:09

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Wed, 17.11.10 23:37, Peter Zijlstra ([email protected]) wrote:

>
> On Wed, 2010-11-17 at 23:34 +0100, Lennart Poettering wrote:
> >
> > This should fix things for people with systemd and GNOME. Yes, all
> > others are left in the cold. Sorry for that.
>
> Is there an easy opt out for that, other than booting a CONFIG_CGROUP=n
> kernel?

systemd relies on CONFIG_CGROUP=y, since it useses it for service
management. It creates its own name=systemd hierarchy for that with no
controllers attached. If you turn that off, then systemd will refuse to
boot. However, it does not rely on any of the controllers, and hence you
are welcome to disable all cgroup controlls and systemd won't complain.

If you want to disable the automatic creation of groups in the 'cpu'
hierarchy for user sessions then you can tell pam_systemd that by passing
"controllers=" on the PAM config line. ("controllers=cpu" is the implied
default.)

There's currently no global option to disable the same logic in systemd
when it creates 'cpu' cgroups for the various services it runs. However,
you can disable that individually with "ControlGroups=cpu:/" in the
.service files. I will now add a global option as well.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-17 22:53:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Wed, 2010-11-17 at 23:45 +0100, Lennart Poettering wrote:
> On Wed, 17.11.10 23:37, Peter Zijlstra ([email protected]) wrote:
>
> >
> > On Wed, 2010-11-17 at 23:34 +0100, Lennart Poettering wrote:
> > >
> > > This should fix things for people with systemd and GNOME. Yes, all
> > > others are left in the cold. Sorry for that.
> >
> > Is there an easy opt out for that, other than booting a CONFIG_CGROUP=n
> > kernel?
>
> systemd relies on CONFIG_CGROUP=y, since it useses it for service
> management. It creates its own name=systemd hierarchy for that with no
> controllers attached. If you turn that off, then systemd will refuse to
> boot.

Do expect distro bugzilla entries when this 'awesome'-ness hits the
street.

> However, it does not rely on any of the controllers, and hence you
> are welcome to disable all cgroup controlls and systemd won't complain.
>
> If you want to disable the automatic creation of groups in the 'cpu'
> hierarchy for user sessions then you can tell pam_systemd that by passing
> "controllers=" on the PAM config line. ("controllers=cpu" is the implied
> default.)
>
> There's currently no global option to disable the same logic in systemd
> when it creates 'cpu' cgroups for the various services it runs. However,
> you can disable that individually with "ControlGroups=cpu:/" in the
> .service files. I will now add a global option as well.

A global knob is a must -- preferably with neon signs on so I can find
it. Luckily I don't use this GNOME junk, otherwise I'd have had to ask
how to revert that crap as well.

2010-11-17 23:49:49

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Wed, 17.11.10 23:45, Lennart Poettering ([email protected]) wrote:

> There's currently no global option to disable the same logic in systemd
> when it creates 'cpu' cgroups for the various services it runs. However,
> you can disable that individually with "ControlGroups=cpu:/" in the
> .service files. I will now add a global option as well.

This global option is now commited to systemd git.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-18 15:11:50

by Stephen Clark

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/17/2010 05:52 PM, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 23:45 +0100, Lennart Poettering wrote:
>
>> On Wed, 17.11.10 23:37, Peter Zijlstra ([email protected]) wrote:
>>
>>
>>> On Wed, 2010-11-17 at 23:34 +0100, Lennart Poettering wrote:
>>>
>>>> This should fix things for people with systemd and GNOME. Yes, all
>>>> others are left in the cold. Sorry for that.
>>>>
>>> Is there an easy opt out for that, other than booting a CONFIG_CGROUP=n
>>> kernel?
>>>
>> systemd relies on CONFIG_CGROUP=y, since it useses it for service
>> management. It creates its own name=systemd hierarchy for that with no
>> controllers attached. If you turn that off, then systemd will refuse to
>> boot.
>>
> Do expect distro bugzilla entries when this 'awesome'-ness hits the
> street.
>
>
>> However, it does not rely on any of the controllers, and hence you
>> are welcome to disable all cgroup controlls and systemd won't complain.
>>
>> If you want to disable the automatic creation of groups in the 'cpu'
>> hierarchy for user sessions then you can tell pam_systemd that by passing
>> "controllers=" on the PAM config line. ("controllers=cpu" is the implied
>> default.)
>>
>> There's currently no global option to disable the same logic in systemd
>> when it creates 'cpu' cgroups for the various services it runs. However,
>> you can disable that individually with "ControlGroups=cpu:/" in the
>> .service files. I will now add a global option as well.
>>
> A global knob is a must -- preferably with neon signs on so I can find
> it. Luckily I don't use this GNOME junk, otherwise I'd have had to ask
> how to revert that crap as well.
>
>
>
Amen!!



--

"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety." (Ben Franklin)

"The course of history shows that as a government grows, liberty
decreases." (Thomas Jefferson)


2010-11-18 22:34:43

by Hans-Peter Jansen

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tuesday 16 November 2010, 22:14:31 Lennart Poettering wrote:
> On Tue, 16.11.10 12:38, Linus Torvalds ([email protected])
wrote:
> > On Tue, Nov 16, 2010 at 12:33 PM, Lennart Poettering
> >
> > <[email protected]> wrote:
> > > No you don't. Because that is not a desktop use case.
> >
> > See my other response. You don't care AT ALL, because by your
> > judgement, all desktop is is a web browser and a word processor.
>
> Well, I do care. But I care more about *real* problems. For example
> the fact that "updatedb" makes your system sluggish while it runs. Or
> "man-db". Or anything else that runs from cron in the background.
>
> Doing this tty dance won't help you much with background tasks such
> as man-db, updatedb and cron and its jobs, will it? They don't have
> ttys. Sorry for you. meh! Meh! meh! meh! meh!
>
> (And along comes systemd, which actually handles this properly, since
> it actually has a proper notion of what a service is, and what a
> session is, and what an app is. And which hence can control all this
> sanely.)
>
> Binding this to a tty is just solves a tiny bit of the real problem:
> i.e. your own use of make -j. End of story.

Lennart, would you mind pointing me the the paragraph that states,
autogroup excludes any other improvements in this area?

In contrary, Linus clearly states, that this solves a long standing use
case, that _he_ is suffering from a lot, and I bet, most of us in one
or another way.. And it contains all that is needed: a fine selection
of knobs for switching on/off that beast. Hopefully it can be taught to
reveal some of its internal mechanics to the world, then all is fine.

If you think, that systemd can solve this and probably other aspects of
responsiveness, go for it, compete with it, and _prove_ it with real
facts and numbers, not just hand waving.

As already mentioned countless times (and some of it was even renamed
for this very fact): the grouping by tty is just a starter. There are
plenty of other possibilities to group the scheduling. The hard part is
to find the right grouping concepts, that are making sense in the
usability department _and_ are easy enough to be picked up from our
favorite system and desktop environments. That's where the generic
cgroup concept seems to be lacking ATM..

In one year from now on, our preferred distros will show, who won this
competition. Probably both of you ;-)

Pete

2010-11-18 23:12:31

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Hans-Peter Jansen, le Thu 18 Nov 2010 23:33:46 +0100, a ?crit :
> As already mentioned countless times (and some of it was even renamed
> for this very fact): the grouping by tty is just a starter. There are
> plenty of other possibilities to group the scheduling. The hard part is
> to find the right grouping concepts, that are making sense in the
> usability department _and_ are easy enough to be picked up from our
> favorite system and desktop environments. That's where the generic
> cgroup concept seems to be lacking ATM..

Actually, cgroups should probably be completely hierarchical: sessions
contain process groups, which contain processes, which contain threads.
You could also gather sessions with the same uid.

Samuel

2010-11-18 23:30:03

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, 2010-11-18 at 23:33 +0100, Hans-Peter Jansen wrote:

> If you think, that systemd can solve this and probably other aspects of
> responsiveness, go for it, compete with it, and _prove_ it with real
> facts and numbers, not just hand waving.

Of course systemd can do it. You don't even need systemd, autogroup or
whatever else if you don't mind a little scripting. There's nothing to
prove, any numbers he generates will be identical with any numbers I
generate... it's the same scheduler whether you configure it from
userspace or kernelspace.

> As already mentioned countless times (and some of it was even renamed
> for this very fact): the grouping by tty is just a starter. There are

Actually, I switched to setsid alone for now at least, and the only
thing in the root group is kernel threads.

-Mike

2010-11-18 23:36:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 00:12 +0100, Samuel Thibault wrote:
> Hans-Peter Jansen, le Thu 18 Nov 2010 23:33:46 +0100, a écrit :
> > As already mentioned countless times (and some of it was even renamed
> > for this very fact): the grouping by tty is just a starter. There are
> > plenty of other possibilities to group the scheduling. The hard part is
> > to find the right grouping concepts, that are making sense in the
> > usability department _and_ are easy enough to be picked up from our
> > favorite system and desktop environments. That's where the generic
> > cgroup concept seems to be lacking ATM..
>
> Actually, cgroups should probably be completely hierarchical: sessions
> contain process groups, which contain processes, which contain threads.
> You could also gather sessions with the same uid.

Hierarchical is ~tempting, but adds overhead for not much gain.

-Mike

2010-11-18 23:44:15

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Mike Galbraith, le Thu 18 Nov 2010 16:35:51 -0700, a ?crit :
> On Fri, 2010-11-19 at 00:12 +0100, Samuel Thibault wrote:
> > Hans-Peter Jansen, le Thu 18 Nov 2010 23:33:46 +0100, a ?crit :
> > > As already mentioned countless times (and some of it was even renamed
> > > for this very fact): the grouping by tty is just a starter. There are
> > > plenty of other possibilities to group the scheduling. The hard part is
> > > to find the right grouping concepts, that are making sense in the
> > > usability department _and_ are easy enough to be picked up from our
> > > favorite system and desktop environments. That's where the generic
> > > cgroup concept seems to be lacking ATM..
> >
> > Actually, cgroups should probably be completely hierarchical: sessions
> > contain process groups, which contain processes, which contain threads.
> > You could also gather sessions with the same uid.
>
> Hierarchical is ~tempting, but adds overhead for not much gain.

What overhead? The implementation of cgroups is actually already
hierarchical.

Samuel

2010-11-19 00:00:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, Nov 18, 2010 at 3:43 PM, Samuel Thibault
<[email protected]> wrote:
>
> What overhead? The implementation of cgroups is actually already
> hierarchical.

Well, at least the actual group creation overhead.

If it's a "only at setsid()", that's a fairly rare thing (although I
think somebody might want to run something like the AIM7 benchmark - I
have this memory of it doing lots of tty tests).

Or if it's only at "user launches new program from window manager",
that's rare too.

But if you do it per process group, now you're doing one for each
command invocation in a shell, for example. If you're doing things per
thread, you've already lost.

Also, remember the goal: it was never about some theoretical end
result. It's all about a simple heuristic that makes things work
better. Trying to do that "perfectly" totally and utterly misses the
whole point.

(google "Perfect is the enemy of good" - Voltaire)

Linus

2010-11-19 00:02:09

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Linus Torvalds, le Thu 18 Nov 2010 15:51:35 -0800, a ?crit :
> On Thu, Nov 18, 2010 at 3:43 PM, Samuel Thibault
> <[email protected]> wrote:
> >
> > What overhead? The implementation of cgroups is actually already
> > hierarchical.
>
> Well, at least the actual group creation overhead.
>
> If it's a "only at setsid()", that's a fairly rare thing (although I
> think somebody might want to run something like the AIM7 benchmark - I
> have this memory of it doing lots of tty tests).
>
> Or if it's only at "user launches new program from window manager",
> that's rare too.
>
> But if you do it per process group, now you're doing one for each
> command invocation in a shell, for example.

Well, if it's from an interactive shell, it's not really a problem :)

But when it's from a script it can become one, yes. But are cgroups so
expensive?

> If you're doing things per thread, you've already lost.

Not per thread, per process, i.e. put threads of the same process in the
same cgroup. Again, I would have thought that creating a cgroup is very
lightweight in front of a fork(). If not, maybe we are just looking for
another, more lightweight container information that the scheduler would
use [1], and keep more heavyweight containers for the non-automatic
creation way.

> Also, remember the goal: it was never about some theoretical end
> result. It's all about a simple heuristic that makes things work
> better. Trying to do that "perfectly" totally and utterly misses the
> whole point.

Sure. Using sid should already be quite good, but including the uid
information as well should be easily even better.

Samuel

[1] Actually, I happen to have written a PhD thesis on this, that's why
I'm popping up:)

2010-11-19 00:07:28

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Samuel Thibault, le Fri 19 Nov 2010 01:02:04 +0100, a ?crit :
> Linus Torvalds, le Thu 18 Nov 2010 15:51:35 -0800, a ?crit :
> > On Thu, Nov 18, 2010 at 3:43 PM, Samuel Thibault
> > <[email protected]> wrote:
> > >
> > > What overhead? The implementation of cgroups is actually already
> > > hierarchical.
> >
> > Well, at least the actual group creation overhead.
> >
> > If it's a "only at setsid()", that's a fairly rare thing (although I
> > think somebody might want to run something like the AIM7 benchmark - I
> > have this memory of it doing lots of tty tests).
> >
> > Or if it's only at "user launches new program from window manager",
> > that's rare too.
> >
> > But if you do it per process group, now you're doing one for each
> > command invocation in a shell, for example.
>
> Well, if it's from an interactive shell, it's not really a problem :)
>
> But when it's from a script it can become one, yes. But are cgroups so
> expensive?
>
> > If you're doing things per thread, you've already lost.
>
> Not per thread, per process, i.e. put threads of the same process in the
> same cgroup. Again, I would have thought that creating a cgroup is very
> lightweight in front of a fork(). If not, maybe we are just looking for
> another, more lightweight container information that the scheduler would
> use [1], and keep more heavyweight containers for the non-automatic
> creation way.
>
> > Also, remember the goal: it was never about some theoretical end
> > result. It's all about a simple heuristic that makes things work
> > better. Trying to do that "perfectly" totally and utterly misses the
> > whole point.
>
> Sure. Using sid should already be quite good, but including the uid
> information as well should be easily even better.

Also note that having a hierarchical process structure should permit to
make things globally more efficient: avoid putting e.g. your cpp, cc1,
and asm processes at three corners of your 4-socket NUMA machine :)

Samuel

2010-11-19 00:36:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/16/2010 12:05 PM, Lennart Poettering wrote:
> On Tue, 16.11.10 19:08, Peter Zijlstra ([email protected]) wrote:
>
>>
>> On Tue, 2010-11-16 at 18:03 +0100, Lennart Poettering wrote:
>>> Binding something like this to TTYs is just backwards. No graphical
>>> session has a TTY attached anymore. And there might be multiple TTYs
>>> used in the same session.
>>
>> Using a group per tty makes sense for us console jockeys..
>
> Well, then maybe you shouldn't claim this was relevant for anybody but
> yourself. Because it is irrelevant for most users if it is bound to the TTY.
>

For what it's worth, I suspect that the object that should be bound to
is probably not the tty, but rather the session ID of the process (which
generally is 1:1 with controlling TTY for console processes.)

-hpa

2010-11-19 00:42:14

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

H. Peter Anvin, le Thu 18 Nov 2010 16:35:59 -0800, a ?crit :
> On 11/16/2010 12:05 PM, Lennart Poettering wrote:
> > On Tue, 16.11.10 19:08, Peter Zijlstra ([email protected]) wrote:
> >
> >>
> >> On Tue, 2010-11-16 at 18:03 +0100, Lennart Poettering wrote:
> >>> Binding something like this to TTYs is just backwards. No graphical
> >>> session has a TTY attached anymore. And there might be multiple TTYs
> >>> used in the same session.
> >>
> >> Using a group per tty makes sense for us console jockeys..
> >
> > Well, then maybe you shouldn't claim this was relevant for anybody but
> > yourself. Because it is irrelevant for most users if it is bound to the TTY.
> >
>
> For what it's worth, I suspect that the object that should be bound to
> is probably not the tty, but rather the session ID of the process (which
> generally is 1:1 with controlling TTY for console processes.)

Agreed.

That'll catch both the tty case (implemented by the proposed patch), and
the rest.

Samuel

2010-11-19 00:49:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, Nov 18, 2010 at 4:02 PM, Samuel Thibault
<[email protected]> wrote:
>
>> If you're doing things per thread, you've already lost.
>
> Not per thread, per process, i.e. put threads of the same process in the
> same cgroup. Again, I would have thought that creating a cgroup is very
> lightweight in front of a fork()

Absolutely not.

We have a good light-weight fork(). We try to avoid any extra
allocations. We *definitely* don't want things at that level.

Seriously. I'd really like somebody running AIM7 just to see that even
just doing it at setsid() doesn't hurt too badly. And that's something
that happens once in a blue moon compared to fork and/or process
groups.

Once per session is about as much as is acceptable. That's the kind of
granularity we should look at. So things like "groups per user",
"groups per session", "groups per one graphical application" are good.
Not things that can happen tens of thousands of times a second.

Linus

2010-11-19 00:59:46

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Linus Torvalds, le Thu 18 Nov 2010 16:42:27 -0800, a ?crit :
> Once per session is about as much as is acceptable. That's the kind of
> granularity we should look at. So things like "groups per user",
> "groups per session", "groups per one graphical application" are good.
> Not things that can happen tens of thousands of times a second.

All right. I believe that should already work quite well both for
desktop and servers indeed. "Per one graphical application" will most
probably require desktop panel patching, however.

Samuel

2010-11-19 01:11:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Thu, Nov 18, 2010 at 4:59 PM, Samuel Thibault
<[email protected]> wrote:
>
> All right. ?I believe that should already work quite well both for
> desktop and servers indeed. ?"Per one graphical application" will most
> probably require desktop panel patching, however.

Sure. As mentioned somewhere earlier in this thread, I don't think
this whole grouping thing is some "either or" black-or-white thing.

So I personally like the kernel patch because it's simple, and it
"just works", regardless of what user space you happen to run. But
that doesn't mean that user space couldn't easily give additional
hints (to the point that if you end up having a distro that does
hinting for everything, you could just turn off the kernel side
entirely).

Linus

2010-11-19 01:13:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 01:59 +0100, Samuel Thibault wrote:
> Linus Torvalds, le Thu 18 Nov 2010 16:42:27 -0800, a écrit :
> > Once per session is about as much as is acceptable. That's the kind of
> > granularity we should look at. So things like "groups per user",
> > "groups per session", "groups per one graphical application" are good.
> > Not things that can happen tens of thousands of times a second.
>
> All right. I believe that should already work quite well both for
> desktop and servers indeed. "Per one graphical application" will most
> probably require desktop panel patching, however.

I think that could be done with a fork flag with the same semantics as
reset on fork. Once your task launcher (ala kdeinit) is tagged, it
launches task groups, the children lose that ability. That requires
userspace interaction though, dunno how you'd detect automatically.

-Mike

2010-11-19 01:24:10

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Mike Galbraith, le Thu 18 Nov 2010 18:12:37 -0700, a ?crit :
> I think that could be done with a fork flag with the same semantics as
> reset on fork. Once your task launcher (ala kdeinit) is tagged, it
> launches task groups, the children lose that ability.

Mmm, even if always creating groups can have a slight overhead, you
should probably not prevent userspace from deciding to do so when it
knows it's appropriate.

Samuel

2010-11-19 02:28:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 02:23 +0100, Samuel Thibault wrote:
> Mike Galbraith, le Thu 18 Nov 2010 18:12:37 -0700, a écrit :
> > I think that could be done with a fork flag with the same semantics as
> > reset on fork. Once your task launcher (ala kdeinit) is tagged, it
> > launches task groups, the children lose that ability.
>
> Mmm, even if always creating groups can have a slight overhead, you
> should probably not prevent userspace from deciding to do so when it
> knows it's appropriate.

I think you may have misunderstood. The flag would be a hint that
userland can set to say "I want to fork off task groups", just as
SCHED_RESET_ON_FORK asks the kernel to reset a child's sched class and
nice level on fork.

-Mike

2010-11-19 03:15:16

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

* Samuel Thibault ([email protected]) wrote:
> H. Peter Anvin, le Thu 18 Nov 2010 16:35:59 -0800, a ?crit :
> > On 11/16/2010 12:05 PM, Lennart Poettering wrote:
> > > On Tue, 16.11.10 19:08, Peter Zijlstra ([email protected]) wrote:
> > >
> > >>
> > >> On Tue, 2010-11-16 at 18:03 +0100, Lennart Poettering wrote:
> > >>> Binding something like this to TTYs is just backwards. No graphical
> > >>> session has a TTY attached anymore. And there might be multiple TTYs
> > >>> used in the same session.
> > >>
> > >> Using a group per tty makes sense for us console jockeys..
> > >
> > > Well, then maybe you shouldn't claim this was relevant for anybody but
> > > yourself. Because it is irrelevant for most users if it is bound to the TTY.
> > >
> >
> > For what it's worth, I suspect that the object that should be bound to
> > is probably not the tty, but rather the session ID of the process (which
> > generally is 1:1 with controlling TTY for console processes.)
>
> Agreed.
>
> That'll catch both the tty case (implemented by the proposed patch), and
> the rest.

This really does make a lot of sense. Tying on the Session ID rather than the
TTY would allow to deal with graphical applications by letting them specify
session IDs with setsid() when the application starts. It seems much more
generic than TTY, and maps to TTY already.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-11-19 05:20:47

by Andev

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Tue, Nov 16, 2010 at 9:06 PM, Ted Ts'o <[email protected]> wrote:

>
> That works for me. ?One suggestion is that in addition to "opt-out",
> it should be also possible for an application launcher file to specify
> a specific cgroup name that should be used. ?That would allow multiple
> applications in a group to assigned to the same cgroup.

+1 for implementing this in systemd than in the kernel.

The userspace has much more info about which process needs to go into
which group.

2010-11-19 09:02:48

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Mike Galbraith, le Thu 18 Nov 2010 19:28:21 -0700, a ?crit :
> On Fri, 2010-11-19 at 02:23 +0100, Samuel Thibault wrote:
> > Mike Galbraith, le Thu 18 Nov 2010 18:12:37 -0700, a ?crit :
> > > I think that could be done with a fork flag with the same semantics as
> > > reset on fork. Once your task launcher (ala kdeinit) is tagged, it
> > > launches task groups, the children lose that ability.
> >
> > Mmm, even if always creating groups can have a slight overhead, you
> > should probably not prevent userspace from deciding to do so when it
> > knows it's appropriate.
>
> I think you may have misunderstood. The flag would be a hint that
> userland can set to say "I want to fork off task groups", just as
> SCHED_RESET_ON_FORK asks the kernel to reset a child's sched class and
> nice level on fork.

Ah, ok. I would have rather done this per fork call, as there may be a
difference between starting an application and starting a panel widget.

Samuel

2010-11-19 11:50:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 00:43 +0100, Samuel Thibault wrote:
> What overhead? The implementation of cgroups is actually already
> hierarchical.

It must be nice to be that ignorant ;-) Speaking for the scheduler
cgroup controller (that being the only one I actually know), most all
the load-balance operations are O(n) in the number of active cgroups,
and a lot of the cpu local schedule operations are O(d) where d is the
depth of the cgroup tree.

[ and that's with the .38 targeted code, current mainline is O(n ln(n))
for load balancing and truly sucks on multi-socket ]

You add a lot of pointer chasing to all the scheduler fast paths and
there is quite significant data size bloat for even compiling with the
controller enabled, let alone actually using the stuff.

But sure, treat them as if they were free to use, I guess your machine
is fast enough.

2010-11-19 11:57:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 01:07 +0100, Samuel Thibault wrote:
>
>
> Also note that having a hierarchical process structure should permit to
> make things globally more efficient: avoid putting e.g. your cpp, cc1,
> and asm processes at three corners of your 4-socket NUMA machine :)

We have the hierarchy mandated by POSIX to track parents, childs,
sessions and all that stuff, its just not the data structure used for
scheduling.

And no, using that to load-balance between CPUs doesn't necessarily help
with the NUMA case, load-balancing is an impossible job (equivalent to
page-replacement -- you simply don't know the future), applications
simply do wildly weird stuff.

>From a process hierarchy there's absolutely no difference between a
cc1/cpp/asm and some MPI jobs, both can be parent-child relations with
pipes between, some just run short and have data affinity, others run
long and don't have any.

2010-11-19 12:00:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 00:20 -0500, Andev wrote:
> On Tue, Nov 16, 2010 at 9:06 PM, Ted Ts'o <[email protected]> wrote:
>
> >
> > That works for me. One suggestion is that in addition to "opt-out",
> > it should be also possible for an application launcher file to specify
> > a specific cgroup name that should be used. That would allow multiple
> > applications in a group to assigned to the same cgroup.
>
> +1 for implementing this in systemd than in the kernel.
>
> The userspace has much more info about which process needs to go into
> which group.

-1 on your systemd/gnome/wm/etc. crap, that means it will become
impossible to sanely switch off (gnome firmly believes knobs are evil),
leaving everybody who _does_ know wth they're doing up a certain creek
without a paddle.

2010-11-19 12:19:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 12:49 +0100, Peter Zijlstra wrote:
> On Fri, 2010-11-19 at 00:43 +0100, Samuel Thibault wrote:
> > What overhead? The implementation of cgroups is actually already
> > hierarchical.
>
> It must be nice to be that ignorant ;-) Speaking for the scheduler
> cgroup controller (that being the only one I actually know), most all
> the load-balance operations are O(n) in the number of active cgroups,
> and a lot of the cpu local schedule operations are O(d) where d is the
> depth of the cgroup tree.
>
> [ and that's with the .38 targeted code, current mainline is O(n ln(n))
> for load balancing and truly sucks on multi-socket ]
>
> You add a lot of pointer chasing to all the scheduler fast paths and
> there is quite significant data size bloat for even compiling with the
> controller enabled, let alone actually using the stuff.
>
> But sure, treat them as if they were free to use, I guess your machine
> is fast enough.

In general though, I think you can say that: cgroups ass overhead.
Simply because you add constraints, this means you need to 1) account
more, 2) enforce constraints. Both have definite non-zero cost in both
data and time.

2010-11-19 12:32:16

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, Nov 19, 2010 at 3:49 AM, Peter Zijlstra <[email protected]> wrote:
> It must be nice to be that ignorant ;-) Speaking for the scheduler
> cgroup controller (that being the only one I actually know), most all
> the load-balance operations are O(n) in the number of active cgroups,
> and a lot of the cpu local schedule operations are O(d) where d is the
> depth of the cgroup tree.

The same would apply to CPU autogroups, presumably?

Paul

2010-11-19 12:39:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 12:49 +0100, Peter Zijlstra wrote:

> You add a lot of pointer chasing to all the scheduler fast paths and
> there is quite significant data size bloat for even compiling with the
> controller enabled, let alone actually using the stuff.

The pointer chasing hurts, but fast path + global spinlock is what
slapped the hierarchical outta me :)

-Mike

2010-11-19 12:52:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 04:31 -0800, Paul Menage wrote:
> On Fri, Nov 19, 2010 at 3:49 AM, Peter Zijlstra <[email protected]> wrote:
> > It must be nice to be that ignorant ;-) Speaking for the scheduler
> > cgroup controller (that being the only one I actually know), most all
> > the load-balance operations are O(n) in the number of active cgroups,
> > and a lot of the cpu local schedule operations are O(d) where d is the
> > depth of the cgroup tree.
>
> The same would apply to CPU autogroups, presumably?

Yep, they're not special at all... uses the same mechanism.

2010-11-19 12:55:53

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

* Peter Zijlstra ([email protected]) wrote:
> On Fri, 2010-11-19 at 12:49 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-11-19 at 00:43 +0100, Samuel Thibault wrote:
> > > What overhead? The implementation of cgroups is actually already
> > > hierarchical.
> >
> > It must be nice to be that ignorant ;-) Speaking for the scheduler
> > cgroup controller (that being the only one I actually know), most all
> > the load-balance operations are O(n) in the number of active cgroups,
> > and a lot of the cpu local schedule operations are O(d) where d is the
> > depth of the cgroup tree.
> >
> > [ and that's with the .38 targeted code, current mainline is O(n ln(n))
> > for load balancing and truly sucks on multi-socket ]
> >
> > You add a lot of pointer chasing to all the scheduler fast paths and
> > there is quite significant data size bloat for even compiling with the
> > controller enabled, let alone actually using the stuff.
> >
> > But sure, treat them as if they were free to use, I guess your machine
> > is fast enough.
>
> In general though, I think you can say that: cgroups ass overhead.

I really think you meant "add" here ? (Hey! The keys were next to each other!)
;)

> Simply because you add constraints, this means you need to 1) account
> more, 2) enforce constraints. Both have definite non-zero cost in both
> data and time.

Yep, this looks like one of these perpetual throughput vs latency trade-offs.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-11-19 13:01:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 07:55 -0500, Mathieu Desnoyers wrote:
>
> > In general though, I think you can say that: cgroups ass overhead.
>
> I really think you meant "add" here ? (Hey! The keys were next to each other!)
> ;)

Uhm, quite!

> > Simply because you add constraints, this means you need to 1) account
> > more, 2) enforce constraints. Both have definite non-zero cost in both
> > data and time.
>
> Yep, this looks like one of these perpetual throughput vs latency trade-offs.
>
Trade-off sure, throughput vs latency only in a specific use-case, its
more a feature vs cost thing, just like all them trace people want lower
cost tracing but want more features at the same time..

2010-11-19 13:03:40

by Ben Gamari

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 19 Nov 2010 12:59:53 +0100, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2010-11-19 at 00:20 -0500, Andev wrote:
> > +1 for implementing this in systemd than in the kernel.
> >
> > The userspace has much more info about which process needs to go into
> > which group.
>
> -1 on your systemd/gnome/wm/etc. crap, that means it will become
> impossible to sanely switch off (gnome firmly believes knobs are evil),
> leaving everybody who _does_ know wth they're doing up a certain creek
> without a paddle.

Please, can we stop with this false dichotomy? This is decidedly not
true and as Lennart has already pointed out, the knob already exists in
systemd. You may like the kernel approach, but this does not mean there
is no place for grouping driven by userspace.

- Ben

2010-11-19 13:03:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 13:51 +0100, Peter Zijlstra wrote:
> On Fri, 2010-11-19 at 04:31 -0800, Paul Menage wrote:
> > On Fri, Nov 19, 2010 at 3:49 AM, Peter Zijlstra <[email protected]> wrote:
> > > It must be nice to be that ignorant ;-) Speaking for the scheduler
> > > cgroup controller (that being the only one I actually know), most all
> > > the load-balance operations are O(n) in the number of active cgroups,
> > > and a lot of the cpu local schedule operations are O(d) where d is the
> > > depth of the cgroup tree.
> >
> > The same would apply to CPU autogroups, presumably?
>
> Yep, they're not special at all... uses the same mechanism.

The only difference is cost of creation and destruction, so cgroups and
autogroups suck boulders of slightly different diameter when creating
and/or destroying at high frequency.

-Mike

2010-11-19 13:07:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups


On Nov 19, 2010, at 8:03 AM, Ben Gamari wrote:

>> t means it will become
>> impossible to sanely switch off (gnome firmly believes knobs are evil),
>> leaving everybody who _does_ know wth they're doing up a certain creek
>> without a paddle.
>
> Please, can we stop with this false dichotomy? This is decidedly not
> true and as Lennart has already pointed out, the knob already exists in
> systemd. You may like the kernel approach, but this does not mean there
> is no place for grouping driven by userspace.

Yes, and then at the next release, some idiotic GNOME engineer will decide to the "improve" the system by removing yet another knob....

-- Ted

2010-11-19 13:20:41

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

* Peter Zijlstra ([email protected]) wrote:
> On Fri, 2010-11-19 at 07:55 -0500, Mathieu Desnoyers wrote:
[...]
> > Yep, this looks like one of these perpetual throughput vs latency trade-offs.
> >
> Trade-off sure, throughput vs latency only in a specific use-case, its
> more a feature vs cost thing, just like all them trace people want lower
> cost tracing but want more features at the same time..

Yep, agreed for "feature vs cost", given that the kind of latency that is fixed
in this case really boils down to a sluggish system under a relatively common
workload -- so making this work might be called a "feature". ;)

FWIW, about tracing, well, we should distinguish between features that add cost
to the fast-path and slow-path only.

But I really don't care anymore, since Ingo, Linus, Thomas and yourself made it
very clear that you don't care. So let's not even start this discussion --
it's going nowhere.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-11-19 13:21:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 08:03 -0500, Ben Gamari wrote:
> On Fri, 19 Nov 2010 12:59:53 +0100, Peter Zijlstra <[email protected]> wrote:
> > On Fri, 2010-11-19 at 00:20 -0500, Andev wrote:
> > > +1 for implementing this in systemd than in the kernel.
> > >
> > > The userspace has much more info about which process needs to go into
> > > which group.
> >
> > -1 on your systemd/gnome/wm/etc. crap, that means it will become
> > impossible to sanely switch off (gnome firmly believes knobs are evil),
> > leaving everybody who _does_ know wth they're doing up a certain creek
> > without a paddle.
>
> Please, can we stop with this false dichotomy? This is decidedly not
> true and as Lennart has already pointed out, the knob already exists in
> systemd.

It does for systemd, but he also said he's proposing patches for other
userspace components (something gnome). And as we all know, gnome is not
big on options, its their way or the high-way.

> You may like the kernel approach, but this does not mean there
> is no place for grouping driven by userspace.

My concern is that I don't want to go trawling through a dozen unrelated
userspace package configurations to disable all this.

2010-11-19 14:34:10

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Peter Zijlstra, le Fri 19 Nov 2010 12:57:24 +0100, a ?crit :
> On Fri, 2010-11-19 at 01:07 +0100, Samuel Thibault wrote:
> > Also note that having a hierarchical process structure should permit to
> > make things globally more efficient: avoid putting e.g. your cpp, cc1,
> > and asm processes at three corners of your 4-socket NUMA machine :)
>
> And no, using that to load-balance between CPUs doesn't necessarily help
> with the NUMA case,

It doesn't _necessarily_ help, but it should help in quite a few cases.

> load-balancing is an impossible job (equivalent to
> page-replacement -- you simply don't know the future), applications
> simply do wildly weird stuff.

Sure. Not a reason not to get the low-hanging fruits :)

> From a process hierarchy there's absolutely no difference between a
> cc1/cpp/asm and some MPI jobs, both can be parent-child relations with
> pipes between, some just run short and have data affinity, others run
> long and don't have any.

MPI jobs typically communicate with each other. Keeping them on the same
socket permits to keep shared-memory MPI drivers to mostly remain in
e.g. the L3 cache. That typically gives benefits.

Samuel

2010-11-19 14:43:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 15:24 +0100, Samuel Thibault wrote:
> Peter Zijlstra, le Fri 19 Nov 2010 12:57:24 +0100, a écrit :
> > On Fri, 2010-11-19 at 01:07 +0100, Samuel Thibault wrote:
> > > Also note that having a hierarchical process structure should permit to
> > > make things globally more efficient: avoid putting e.g. your cpp, cc1,
> > > and asm processes at three corners of your 4-socket NUMA machine :)
> >
> > And no, using that to load-balance between CPUs doesn't necessarily help
> > with the NUMA case,
>
> It doesn't _necessarily_ help, but it should help in quite a few cases.

Colour me unconvinced, measuring shared cache footprint using PMUs might
help (and people have actually implemented and played with that at
various times in the past) but again, the added overhead of doing so
will hurt a lot more workloads than might benefit.

> > load-balancing is an impossible job (equivalent to
> > page-replacement -- you simply don't know the future), applications
> > simply do wildly weird stuff.
>
> Sure. Not a reason not to get the low-hanging fruits :)

I'm not at all convinced using the process hierarchy will really help
much, but feel free to write the patch and test it. But making the
migration condition very complex will definitely hurt some workloads.

> > From a process hierarchy there's absolutely no difference between a
> > cc1/cpp/asm and some MPI jobs, both can be parent-child relations with
> > pipes between, some just run short and have data affinity, others run
> > long and don't have any.
>
> MPI jobs typically communicate with each other. Keeping them on the same
> socket permits to keep shared-memory MPI drivers to mostly remain in
> e.g. the L3 cache. That typically gives benefits.

Pushing them away permits them to use a larger part of that same L3
cache allowing them to work on larger data sets. Most of the MPI apps
have a large compute to communication ratio because that is what allows
them to run in parallel so well (traditionally the interconnects were
terribly slow to boot), that suggests that working on larger data sets
is a good thing and running on the same node really doesn't matter since
communication is assumes slow anyway.

There really is no simple solution to his.

2010-11-19 14:55:09

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Peter Zijlstra, le Fri 19 Nov 2010 15:43:13 +0100, a ?crit :
> > MPI jobs typically communicate with each other. Keeping them on the same
> > socket permits to keep shared-memory MPI drivers to mostly remain in
> > e.g. the L3 cache. That typically gives benefits.
>
> Pushing them away permits them to use a larger part of that same L3
> cache allowing them to work on larger data sets.

But then you are not benefitting from all CPU cores.

> Most of the MPI apps
> have a large compute to communication ratio because that is what allows
> them to run in parallel so well (traditionally the interconnects were
> terribly slow to boot), that suggests that working on larger data sets
> is a good thing and running on the same node really doesn't matter since
> communication is assumes slow anyway.

Err, if the compute to communication ratio is big, then you should use
all CPU cores, up to the point where communication becomes a matter
again, and making sure that related MPI processes end up on the same
socket will permit to got a it further.

> There really is no simple solution to his.

I never said there was even a solution, actually (in particular any kind
of generic solution), but that there are a a few simple ways exist to
make things better.

Samuel

2010-11-19 16:29:21

by David Miller

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

From: Theodore Tso <[email protected]>
Date: Fri, 19 Nov 2010 08:07:12 -0500

>
> On Nov 19, 2010, at 8:03 AM, Ben Gamari wrote:
>
>>> t means it will become
>>> impossible to sanely switch off (gnome firmly believes knobs are evil),
>>> leaving everybody who _does_ know wth they're doing up a certain creek
>>> without a paddle.
>>
>> Please, can we stop with this false dichotomy? This is decidedly not
>> true and as Lennart has already pointed out, the knob already exists in
>> systemd. You may like the kernel approach, but this does not mean there
>> is no place for grouping driven by userspace.
>
> Yes, and then at the next release, some idiotic GNOME engineer will
> decide to the "improve" the system by removing yet another knob....

I have to agree on this one, there is no reason to believe that this
trend will not continue and I've lost every ounce of optimism I've
ever had in this area.

Besides, I'm having trouble believing someone who can't even get it
through his thick skull that even "M-x compile" in emacs gives a TTY
to the make process. Yet he kept claiming the opposite over and over
again until I absolutely proved it to him, at which point he became
completely silent on that line of reasoning.

Someone who argues and behaves in that way is not someone I put much
stock in as far as implementing things in a wise or well informed way,
it feels more like a mechanism which is being forced and rushed ahead
without enough research or consideration.

2010-11-19 16:34:55

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 19.11.10 08:29, David Miller ([email protected]) wrote:

> Someone who argues and behaves in that way is not someone I put much
> stock in as far as implementing things in a wise or well informed way,
> it feels more like a mechanism which is being forced and rushed ahead
> without enough research or consideration.

Wow, so much hate!

Love you too,

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-19 16:42:38

by David Miller

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

From: Lennart Poettering <[email protected]>
Date: Fri, 19 Nov 2010 17:34:31 +0100

> On Fri, 19.11.10 08:29, David Miller ([email protected]) wrote:
>
>> Someone who argues and behaves in that way is not someone I put much
>> stock in as far as implementing things in a wise or well informed way,
>> it feels more like a mechanism which is being forced and rushed ahead
>> without enough research or consideration.
>
> Wow, so much hate!

Thanks for not addressing my technical arguments at all, which
serves only to further my reservations.

2010-11-19 17:52:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Guys, calm down. It's really not -that- much of a deal.

We can do both. There is basically zero cost in the kernel: you can
disabled the feature, and it's not like it's a maintenance headache
(all the complexity that matters is the group scheduling itself, not
the autogroup code that is well separated). And the kernel approach is
useful to just take an otherwise unmodified system and just make it
have nice default behavior.

And the user level approach? I think it's fine too. If you run systemd
for other reasons (or if the gnome people add it to the task launcher
or whatever), doing it there isn't wrong. I personally think it's
somewhat disgusting to have a user-level callback with processes etc
just to clean up a group, but whatever. As long as it's not common,
who cares?

And you really _can_ combine them. As mentioned, I'd be nervous about
AIM benchmarks. I keep mentioning AIM, but that's because it has shown
odd tty issues before. Back when the RT guys wanted to do that crazy
blocking BKL thing (replace the spinlock with a semaphore), AIM7
plummeted by 40%. And people were looking all over the place (file
locking etc), and the tty layer was the biggest reason iirc.

Now, I don't know if AIM7 actually uses setsid() heavily etc, and it's
possible it never hits it at all and only does read/write/kill. And
it's not like AIM7 is the main thing we should look at regardless. But
the point is that we know that there are some tty-heavy loads that are
relevant, and it's very possible that a hybrid approach with "tty's
handled automatically by the kernel, window manager does others in
user space" would be a good way to avoid the (potential) nasty side of
something that has a lot of short-lived tty connections.

So guys, calm down. We don't need to hate each other.

Linus

2010-11-19 19:12:22

by Ben Gamari

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 19 Nov 2010 09:51:14 -0800, Linus Torvalds <[email protected]> wrote:
> And the user level approach? I think it's fine too. If you run systemd
> for other reasons (or if the gnome people add it to the task launcher
> or whatever), doing it there isn't wrong. I personally think it's
> somewhat disgusting to have a user-level callback with processes etc
> just to clean up a group, but whatever. As long as it's not common,
> who cares?
>
On that note, is there a good reason why the notify_on_release interface
works the way it does? Wouldn't it be simpler if the cgroup simply
provided a file on which a process (e.g. systemd) could block?

I guess it's a little too late at this point considering the old
mechanism will still need to be supported, but it seems like this would
provide a slightly cheaper cleanup path. Just my (perhaps flawed) two
pence.

>
> So guys, calm down. We don't need to hate each other.
>
Thanks for the nudge back to sanity.

Cheers,

- Ben

2010-11-19 19:31:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 2010-11-19 at 09:51 -0800, Linus Torvalds wrote:

> And you really _can_ combine them. As mentioned, I'd be nervous about
> AIM benchmarks. I keep mentioning AIM, but that's because it has shown
> odd tty issues before. Back when the RT guys wanted to do that crazy
> blocking BKL thing (replace the spinlock with a semaphore), AIM7
> plummeted by 40%. And people were looking all over the place (file
> locking etc), and the tty layer was the biggest reason iirc.

If I were home, I'd have already checked out AIM7 and more, but I didn't
load laptop up with all my toys unfortunately. (or fortunately, since
T130 ain't exactly a speed daemon)

-Mike

2010-11-19 19:54:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, Nov 19, 2010 at 11:12 AM, Ben Gamari <[email protected]> wrote:
>
> On that note, is there a good reason why the notify_on_release interface
> works the way it does? Wouldn't it be simpler if the cgroup simply
> provided a file on which a process (e.g. systemd) could block?

Actually, the sane interface would likely just be to have a "drop on
release" interface.

Maybe some people really want to be _notified_. But my guess would
that that just dropping the cgroup when it becomes empty would be at
least a reasonable subset of users.

Who uses that thing now? The desktop launcher/systemd approach
definitely doesn't seem want the overhead of being notified and having
to remove it manually. Does anybody else really want it?

But that's really an independent question from all the other things.
But with new cgroup users, it's very possible that it turns out that
some of the original interfaces are just inconvenient and silly.

Linus

2010-11-19 20:38:35

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, Nov 19, 2010 at 11:12 AM, Ben Gamari <[email protected]> wrote:
> On that note, is there a good reason why the notify_on_release interface
> works the way it does? Wouldn't it be simpler if the cgroup simply
> provided a file on which a process (e.g. systemd) could block?

Backwards-compatibility with cpusets, which is what cgroups evolved from.

A delete_on_release option would be possible too, for the cases where
there's really no entity that wants to do more than simply delete the
group in question.

Paul

2010-11-20 01:13:52

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 19.11.10 14:12, Ben Gamari ([email protected]) wrote:

>
> On Fri, 19 Nov 2010 09:51:14 -0800, Linus Torvalds <[email protected]> wrote:
> > And the user level approach? I think it's fine too. If you run systemd
> > for other reasons (or if the gnome people add it to the task launcher
> > or whatever), doing it there isn't wrong. I personally think it's
> > somewhat disgusting to have a user-level callback with processes etc
> > just to clean up a group, but whatever. As long as it's not common,
> > who cares?
> >
> On that note, is there a good reason why the notify_on_release interface
> works the way it does? Wouldn't it be simpler if the cgroup simply
> provided a file on which a process (e.g. systemd) could block?

The notify_on_release interface is awful indeed. Feels like the old
hotplug interface where each module request by the kernel caused a
hotplug script to be spawned by the kernel.

However, I am not sure I like the idea of having pollable files like that,
because in the systemd case I am very much interested in getting
recursive notifications, i.e. I want to register once for getting
notifications for a full subtree instead of having to register for each
cgroup individually.

My personal favourite solution would be to get a netlink msg when a
cgroup runs empty. That way multiple programs could listen to the events
at the same time, and we'd have an easy way to subscribe to a whole
hierarchy of groups.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-20 01:33:43

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Fri, 19.11.10 11:48, Linus Torvalds ([email protected]) wrote:

>
> On Fri, Nov 19, 2010 at 11:12 AM, Ben Gamari <[email protected]> wrote:
> >
> > On that note, is there a good reason why the notify_on_release interface
> > works the way it does? Wouldn't it be simpler if the cgroup simply
> > provided a file on which a process (e.g. systemd) could block?
>
> Actually, the sane interface would likely just be to have a "drop on
> release" interface.

Hmm, I think automatic cleanup would be quite nice, but there are some
niche cases to think about first (i.e. what do you do if a process
generates a cgroup and wants to make itself a member of it, but then
dies before it can do that, and the cgroup stays around but empty and
never gets cleaned up).

> Who uses that thing now? The desktop launcher/systemd approach
> definitely doesn't seem want the overhead of being notified and having
> to remove it manually. Does anybody else really want it?

I'd like automatic cleanup, but definitely also want to be notified when
a cgroup runs empty. Here's the use case: we run apache in a cgroup (in
a named hierarchy, not attached to any controller, we do this for
keeping track of the service and all its children). Now apache dies. Its
children, various CGI scripts stay around. However, since Apache is
configured to be restarted systemd now kills all remaining children and
waits until they are all gone, so that when it starts Apache anew we are in
a clean and defined environment, and no remains of the previous instance
remain. For this to work I need some kind of notification when all
children are gone. Of course if systemd is PID 1, I can just use SIGCHLD
for that, but that's more difficult when we are managing user porcesses,
and want to do that with a PID != 1. And even even we are PID 1 its kinda
neat to have an explicit notification for when a cgroup is empty instead
of having to constantly check whether the cgroup is now empty after each
SIGCHLD we receive.

Also, there must be a way to opt out of automatic cleanup for some
groups, since it might make sense to give users access to subtrees of
the hierarchy and if you clean up groups belonging to privileged code
then you get a namespace problem because unprivileged code might
recreate that group and confuse everybody.

So yeah, auto-cleanup is nice, but notifications I want too, please
thanks.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-20 04:26:11

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

* Lennart Poettering <[email protected]> [2010-11-20 02:13:30]:

> On Fri, 19.11.10 14:12, Ben Gamari ([email protected]) wrote:
>
> >
> > On Fri, 19 Nov 2010 09:51:14 -0800, Linus Torvalds <[email protected]> wrote:
> > > And the user level approach? I think it's fine too. If you run systemd
> > > for other reasons (or if the gnome people add it to the task launcher
> > > or whatever), doing it there isn't wrong. I personally think it's
> > > somewhat disgusting to have a user-level callback with processes etc
> > > just to clean up a group, but whatever. As long as it's not common,
> > > who cares?
> > >
> > On that note, is there a good reason why the notify_on_release interface
> > works the way it does? Wouldn't it be simpler if the cgroup simply
> > provided a file on which a process (e.g. systemd) could block?
>
> The notify_on_release interface is awful indeed. Feels like the old
> hotplug interface where each module request by the kernel caused a
> hotplug script to be spawned by the kernel.
>
> However, I am not sure I like the idea of having pollable files like that,
> because in the systemd case I am very much interested in getting
> recursive notifications, i.e. I want to register once for getting
> notifications for a full subtree instead of having to register for each
> cgroup individually.
>
> My personal favourite solution would be to get a netlink msg when a
> cgroup runs empty. That way multiple programs could listen to the events
> at the same time, and we'd have an easy way to subscribe to a whole
> hierarchy of groups.
>

The netlink message should not be hard to do if we agree to work on
it. The largest objections I've heard is that netlink implies
network programming and most users want to be able to script in
their automation and network scripting is hard.

--
Three Cheers,
Balbir

2010-11-20 15:41:24

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sat, 20.11.10 09:55, Balbir Singh ([email protected]) wrote:

> > However, I am not sure I like the idea of having pollable files like that,
> > because in the systemd case I am very much interested in getting
> > recursive notifications, i.e. I want to register once for getting
> > notifications for a full subtree instead of having to register for each
> > cgroup individually.
> >
> > My personal favourite solution would be to get a netlink msg when a
> > cgroup runs empty. That way multiple programs could listen to the events
> > at the same time, and we'd have an easy way to subscribe to a whole
> > hierarchy of groups.
>
> The netlink message should not be hard to do if we agree to work on
> it. The largest objections I've heard is that netlink implies
> network programming and most users want to be able to script in
> their automation and network scripting is hard.

Well, the notify_on_release stuff cannot be dropped anyway at this point
in time, so netlink support would be an addition to, not a replacement for
the current stuff that might be useful for scripting folks.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-20 19:35:26

by Mike Galbraith

[permalink] [raw]
Subject: [PATCH v4] sched: automated per session task groups

On Tue, 2010-11-16 at 18:28 +0100, Ingo Molnar wrote:

> Mike,
>
> Mind sending a new patch with a separate v2 announcement in a new thread, once you
> have something i could apply to the scheduler tree (for a v2.6.38 merge)?

Changes since last:
- switch to per session vs tty
- make autogroups visible in /proc/sched_debug
- make autogroups visible in /proc/<pid>/autogroup
- add nice level bandwidth tweakability to /proc/<pid>/autogroup

Modulo "kill it" debate outcome...

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity. This patch implements an idea from Linus,
to automatically create task groups. This patch only per session autogroups,
but leaves the way open for enhancement.

Implementation: each task's signal struct contains an inherited pointer to a
refcounted autogroup struct containing a task group pointer, the default for
all tasks pointing to the init_task_group. When a task calls setsid(), the
process wide reference to the default group is dropped, a new task group is
created, and the process is moved into the new task group. Children thereafter
inherit this task group, and increase it's refcount. On exit, a reference to the
current task group is dropped when the last reference to each signal struct is
dropped. The task group is destroyed when the last signal struct referencing
it is freed. At runqueue selection time, IFF a task has no cgroup assignment,
it's current autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level through the
proc filesystem. cat /proc/<pid>/autogroup displays the task's group and
the group's nice level. echo <nice level> > /proc/<pid>/autogroup sets the
task group's shares to the weight of nice <level> task. Setting nice level
is rate limited for !admin users due to the abuse risk of task group locking.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is
selected, but can be disabled via the boot option noautogroup, and can be
also be turned on/off on the fly via..
echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Signed-off-by: Mike Galbraith <[email protected]>

---
Documentation/kernel-parameters.txt | 2
fs/proc/base.c | 79 +++++++++++
include/linux/sched.h | 23 +++
init/Kconfig | 12 +
kernel/fork.c | 5
kernel/sched.c | 13 +
kernel/sched_autogroup.c | 243 ++++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 23 +++
kernel/sched_debug.c | 29 ++--
kernel/sys.c | 4
kernel/sysctl.c | 11 +
11 files changed, 426 insertions(+), 18 deletions(-)

Index: linux-2.6.37.git/include/linux/sched.h
===================================================================
--- linux-2.6.37.git.orig/include/linux/sched.h
+++ linux-2.6.37.git/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -576,6 +578,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1931,6 +1936,24 @@ int sched_rt_handler(struct ctl_table *t

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+extern unsigned int sysctl_sched_autogroup_enabled;
+
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#ifdef CONFIG_PROC_FS
+extern void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m);
+extern int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice);
+#endif
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
Index: linux-2.6.37.git/kernel/sched.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched.c
+++ linux-2.6.37.git/kernel/sched.c
@@ -78,6 +78,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -268,6 +269,10 @@ struct task_group {
struct task_group *parent;
struct list_head siblings;
struct list_head children;
+
+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
};

#define root_task_group init_task_group
@@ -605,11 +610,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -2006,6 +2014,7 @@ static void sched_irq_time_avg_update(st
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7979,7 +7988,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
Index: linux-2.6.37.git/kernel/fork.c
===================================================================
--- linux-2.6.37.git.orig/kernel/fork.c
+++ linux-2.6.37.git/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(st

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clo
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
Index: linux-2.6.37.git/kernel/sys.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sys.c
+++ linux-2.6.37.git/kernel/sys.c
@@ -1080,8 +1080,10 @@ SYSCALL_DEFINE0(setsid)
err = session;
out:
write_unlock_irq(&tasklist_lock);
- if (err > 0)
+ if (err > 0) {
proc_sid_connector(group_leader);
+ sched_autogroup_create_attach(group_leader);
+ }
return err;
}

Index: linux-2.6.37.git/kernel/sched_debug.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched_debug.c
+++ linux-2.6.37.git/kernel/sched_debug.c
@@ -87,6 +87,20 @@ static void print_cfs_group_stats(struct
}
#endif

+#if defined(CONFIG_CGROUP_SCHED) && \
+ (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+static void task_group_path(struct task_group *tg, char *buf, int buflen)
+{
+ /* may be NULL if the underlying cgroup isn't fully-created yet */
+ if (!tg->css.cgroup) {
+ if (!autogroup_path(tg, buf, buflen))
+ buf[0] = '\0';
+ return;
+ }
+ cgroup_path(tg->css.cgroup, buf, buflen);
+}
+#endif
+
static void
print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
{
@@ -115,7 +129,7 @@ print_task(struct seq_file *m, struct rq
char path[64];

rcu_read_lock();
- cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
+ task_group_path(task_group(p), path, sizeof(path));
rcu_read_unlock();
SEQ_printf(m, " %s", path);
}
@@ -147,19 +161,6 @@ static void print_rq(struct seq_file *m,
read_unlock_irqrestore(&tasklist_lock, flags);
}

-#if defined(CONFIG_CGROUP_SCHED) && \
- (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
-static void task_group_path(struct task_group *tg, char *buf, int buflen)
-{
- /* may be NULL if the underlying cgroup isn't fully-created yet */
- if (!tg->css.cgroup) {
- buf[0] = '\0';
- return;
- }
- cgroup_path(tg->css.cgroup, buf, buflen);
-}
-#endif
-
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
Index: linux-2.6.37.git/fs/proc/base.c
===================================================================
--- linux-2.6.37.git.orig/fs/proc/base.c
+++ linux-2.6.37.git/fs/proc/base.c
@@ -1407,6 +1407,82 @@ static const struct file_operations proc

#endif

+#ifdef CONFIG_SCHED_AUTOGROUP
+/*
+ * Print out autogroup related information:
+ */
+static int sched_autogroup_show(struct seq_file *m, void *v)
+{
+ struct inode *inode = m->private;
+ struct task_struct *p;
+
+ p = get_proc_task(inode);
+ if (!p)
+ return -ESRCH;
+ proc_sched_autogroup_show_task(p, m);
+
+ put_task_struct(p);
+
+ return 0;
+}
+
+static ssize_t
+sched_autogroup_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *offset)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct task_struct *p;
+ char buffer[PROC_NUMBUF];
+ long nice;
+ int err;
+
+ memset(buffer, 0, sizeof(buffer));
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+
+ err = strict_strtol(strstrip(buffer), 0, &nice);
+ if (err)
+ return -EINVAL;
+
+ p = get_proc_task(inode);
+ if (!p)
+ return -ESRCH;
+
+ err = nice;
+ err = proc_sched_autogroup_set_nice(p, &err);
+ if (err)
+ count = err;
+
+ put_task_struct(p);
+
+ return count;
+}
+
+static int sched_autogroup_open(struct inode *inode, struct file *filp)
+{
+ int ret;
+
+ ret = single_open(filp, sched_autogroup_show, NULL);
+ if (!ret) {
+ struct seq_file *m = filp->private_data;
+
+ m->private = inode;
+ }
+ return ret;
+}
+
+static const struct file_operations proc_pid_sched_autogroup_operations = {
+ .open = sched_autogroup_open,
+ .read = seq_read,
+ .write = sched_autogroup_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
+
static ssize_t comm_write(struct file *file, const char __user *buf,
size_t count, loff_t *offset)
{
@@ -2733,6 +2809,9 @@ static const struct pid_entry tgid_base_
#ifdef CONFIG_SCHED_DEBUG
REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations),
#endif
+#ifdef CONFIG_SCHED_AUTOGROUP
+ REG("autogroup", S_IRUGO|S_IWUSR, proc_pid_sched_autogroup_operations),
+#endif
REG("comm", S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUSR, proc_pid_syscall),
Index: linux-2.6.37.git/kernel/sched_autogroup.h
===================================================================
--- /dev/null
+++ linux-2.6.37.git/kernel/sched_autogroup.h
@@ -0,0 +1,23 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
+{
+ return 0;
+}
+#endif
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6.37.git/kernel/sched_autogroup.c
===================================================================
--- /dev/null
+++ linux-2.6.37.git/kernel/sched_autogroup.c
@@ -0,0 +1,243 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/utsname.h>
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+
+struct autogroup {
+ struct task_group *tg;
+ struct kref kref;
+ struct rw_semaphore lock;
+ unsigned long id;
+ int nice;
+};
+
+static struct autogroup autogroup_default;
+static atomic_t autogroup_seq_nr;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ init_task_group.autogroup = &autogroup_default;
+ kref_init(&autogroup_default.kref);
+ init_rwsem(&autogroup_default.lock);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+ struct task_group *tg = ag->tg;
+
+ kfree(ag);
+ sched_destroy_group(tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
+
+ if (!ag)
+ goto out_fail;
+
+ ag->tg = sched_create_group(&init_task_group);
+
+ if (IS_ERR(ag->tg))
+ goto out_fail;
+
+ ag->tg->autogroup = ag;
+ kref_init(&ag->kref);
+ init_rwsem(&ag->lock);
+ ag->id = atomic_inc_return(&autogroup_seq_nr);
+
+ return ag;
+
+out_fail:
+ if (ag) {
+ kfree(ag);
+ WARN_ON(1);
+ } else
+ WARN_ON(1);
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline bool
+task_wants_autogroup(struct task_struct *p, struct task_group *tg)
+{
+ if (tg != &root_task_group)
+ return false;
+
+ if (p->sched_class != &fair_sched_class)
+ return false;
+
+ /*
+ * We can only assume the task group can't go away on us if
+ * autogroup_move_group() can see us on ->thread_group list.
+ */
+ if (p->flags & PF_EXITING)
+ return false;
+
+ return true;
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ if (enabled && task_wants_autogroup(p, tg))
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+
+ spin_lock(&p->sighand->siglock);
+
+ prev = p->signal->autogroup;
+ if (prev == ag) {
+ spin_unlock(&p->sighand->siglock);
+ return;
+ }
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+ t = p;
+
+ do {
+ sched_move_task(p);
+ } while_each_thread(p, t);
+
+ spin_unlock(&p->sighand->siglock);
+
+ autogroup_kref_put(prev);
+}
+
+/* Allocates GFP_KERNEL, cannot be called under any spinlock */
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* Cannot be called under siglock. Currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ struct sighand_struct *sighand = current->sighand;
+
+ spin_lock(&sighand->siglock);
+ sig->autogroup = autogroup_kref_get(current->signal->autogroup);
+ spin_unlock(&sighand->siglock);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+
+#ifdef CONFIG_PROC_FS
+
+static inline struct autogroup *autogroup_get(struct task_struct *p)
+{
+ struct autogroup *ag;
+
+ /* task may be moved after we unlock.. tough */
+ spin_lock(&p->sighand->siglock);
+ ag = autogroup_kref_get(p->signal->autogroup);
+ spin_unlock(&p->sighand->siglock);
+
+ return ag;
+}
+
+int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
+{
+ static unsigned long next = INITIAL_JIFFIES;
+ struct autogroup *ag;
+ int err;
+
+ if (*nice < -20 || *nice > 19)
+ return -EINVAL;
+
+ err = security_task_setnice(current, *nice);
+ if (err)
+ return err;
+
+ if (*nice < 0 && !can_nice(current, *nice))
+ return -EPERM;
+
+ /* this is a heavy operation taking global locks.. */
+ if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
+ return -EAGAIN;
+
+ next = HZ / 10 + jiffies;;
+ ag = autogroup_get(p);
+
+ down_write(&ag->lock);
+ err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);
+ if (!err)
+ ag->nice = *nice;
+ up_write(&ag->lock);
+
+ autogroup_kref_put(ag);
+
+ return err;
+}
+
+void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
+{
+ struct autogroup *ag = autogroup_get(p);
+
+ down_read(&ag->lock);
+ seq_printf(m, "/autogroup-%ld nice %d\n", ag->id, ag->nice);
+ up_read(&ag->lock);
+
+ autogroup_kref_put(ag);
+}
+#endif /* CONFIG_PROC_FS */
+
+#ifdef CONFIG_SCHED_DEBUG
+static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
+{
+ return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id);
+}
+#endif /* CONFIG_SCHED_DEBUG */
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6.37.git/kernel/sysctl.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sysctl.c
+++ linux-2.6.37.git/kernel/sysctl.c
@@ -382,6 +382,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
Index: linux-2.6.37.git/init/Kconfig
===================================================================
--- linux-2.6.37.git.orig/init/Kconfig
+++ linux-2.6.37.git/init/Kconfig
@@ -728,6 +728,18 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task session.
+
config MM_OWNER
bool

Index: linux-2.6.37.git/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.37.git.orig/Documentation/kernel-parameters.txt
+++ linux-2.6.37.git/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.


2010-11-20 19:46:20

by Jesper Juhl

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 15 Nov 2010, Linus Torvalds wrote:

> On Mon, Nov 15, 2010 at 2:41 PM, <[email protected]> wrote:
> >
> > So the set of all tasks that never call proc_set_tty() ends up in the same one
> > big default group, correct?
>
> Well, yes and no.
>
> Yes, that's what the code currently does. But I did ask Mike (and he
> delivered) to try to make the code look and work in a way where the
> whole "tty thing" is just one of the heuristics.
>
> It's not clear exactly what the non-tty heuristics would be, but I do
> have a few suggestions:
>
> - I think it might be a good idea to associate a task group with the
> current "cred" of a process, and fall back on it in the absense of a
> tty-provided one.
>
Or how about (just brainstorming here) a group per 'process group'?


--
Jesper Juhl <[email protected]> http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

2010-11-20 19:51:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sat, 2010-11-20 at 20:33 +0100, Jesper Juhl wrote:
> On Mon, 15 Nov 2010, Linus Torvalds wrote:
>
> > On Mon, Nov 15, 2010 at 2:41 PM, <[email protected]> wrote:
> > >
> > > So the set of all tasks that never call proc_set_tty() ends up in the same one
> > > big default group, correct?
> >
> > Well, yes and no.
> >
> > Yes, that's what the code currently does. But I did ask Mike (and he
> > delivered) to try to make the code look and work in a way where the
> > whole "tty thing" is just one of the heuristics.
> >
> > It's not clear exactly what the non-tty heuristics would be, but I do
> > have a few suggestions:
> >
> > - I think it might be a good idea to associate a task group with the
> > current "cred" of a process, and fall back on it in the absense of a
> > tty-provided one.
> >
> Or how about (just brainstorming here) a group per 'process group'?

I switched to per session, which on my system at least looks like more
than enough granularity.

-Mike

2010-11-20 20:25:12

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Jesper Juhl, le Sat 20 Nov 2010 20:33:21 +0100, a ?crit :
> Or how about (just brainstorming here) a group per 'process group'?

That'd be too heavy by Linus' measurement, as you have a process group
for each shell command.

Samuel

2010-11-20 20:50:07

by Jesper Juhl

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sat, 20 Nov 2010, Mike Galbraith wrote:

> On Sat, 2010-11-20 at 20:33 +0100, Jesper Juhl wrote:
> > On Mon, 15 Nov 2010, Linus Torvalds wrote:
> >
> > > On Mon, Nov 15, 2010 at 2:41 PM, <[email protected]> wrote:
> > > >
> > > > So the set of all tasks that never call proc_set_tty() ends up in the same one
> > > > big default group, correct?
> > >
> > > Well, yes and no.
> > >
> > > Yes, that's what the code currently does. But I did ask Mike (and he
> > > delivered) to try to make the code look and work in a way where the
> > > whole "tty thing" is just one of the heuristics.
> > >
> > > It's not clear exactly what the non-tty heuristics would be, but I do
> > > have a few suggestions:
> > >
> > > - I think it might be a good idea to associate a task group with the
> > > current "cred" of a process, and fall back on it in the absense of a
> > > tty-provided one.
> > >
> > Or how about (just brainstorming here) a group per 'process group'?
>
> I switched to per session, which on my system at least looks like more
> than enough granularity.
>
Sounds sane and it's closer to the original 'per tty' which was so
successful.

--
Jesper Juhl <[email protected]> http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

2010-11-20 22:02:39

by Konstantin Svist

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On 11/20/2010 11:51 AM, Mike Galbraith wrote:
> On Sat, 2010-11-20 at 20:33 +0100, Jesper Juhl wrote:
>> On Mon, 15 Nov 2010, Linus Torvalds wrote:
>>
>>> On Mon, Nov 15, 2010 at 2:41 PM,<[email protected]> wrote:
>>>> So the set of all tasks that never call proc_set_tty() ends up in the same one
>>>> big default group, correct?
>>> Well, yes and no.
>>>
>>> Yes, that's what the code currently does. But I did ask Mike (and he
>>> delivered) to try to make the code look and work in a way where the
>>> whole "tty thing" is just one of the heuristics.
>>>
>>> It's not clear exactly what the non-tty heuristics would be, but I do
>>> have a few suggestions:
>>>
>>> - I think it might be a good idea to associate a task group with the
>>> current "cred" of a process, and fall back on it in the absense of a
>>> tty-provided one.
>>>
>> Or how about (just brainstorming here) a group per 'process group'?
> I switched to per session, which on my system at least looks like more
> than enough granularity

Will that have an effect on software like Chromium which creates a fork
for each tab? If a user opens Thunderbird and Chromium with 100 tabs,
Thunderbird should probably get 50% CPU time instead of just 1%...

2010-11-20 22:15:06

by Samuel Thibault

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

Konstantin Svist, le Sat 20 Nov 2010 14:02:32 -0800, a ?crit :
> Will that have an effect on software like Chromium which creates a fork
> for each tab? If a user opens Thunderbird and Chromium with 100 tabs,
> Thunderbird should probably get 50% CPU time instead of just 1%...

Chromium doesn't create a session or even a process group for each tab.

Samuel

2010-11-20 22:17:03

by Mika Laitio

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

> > What complexity? Have you looked at the patch? It has no complexity anywhere.
> >
> > It's a _lot_ less complex than having system daemons you don't
> > control. We have not had good results with that approach in the past.
> > System daemons tend to cause nasty problems, and debugging them is a
> > nightmare.
>
> Well, userspace doesn't bite.

To me, this starts sounding like a legendary "userspace suspend" versus
"kernel suspend" support fights many years ago... Since that, I have
_sometimes_but_rarely_ been luckily enought to get my laptops
recovered from the suspend states successfully...

And as the kernel patches allows disabling the schedyles feature either via kernel
build parameters, boot parameters or at runtime _if_you_ really want, I do
not understand why do you complaint if there will be much better defaults
in the kernel immediately once you just build and boot the new kernel.
(I really do not plan to start converting init-rd boot scripts from
all of my computers systemd way of doing things in near future...)

Mika

2010-11-20 22:18:27

by Thomas Fjellstrom

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On November 20, 2010, you wrote:
> On 11/20/2010 11:51 AM, Mike Galbraith wrote:
> > On Sat, 2010-11-20 at 20:33 +0100, Jesper Juhl wrote:
> >> On Mon, 15 Nov 2010, Linus Torvalds wrote:
> >>> On Mon, Nov 15, 2010 at 2:41 PM,<[email protected]> wrote:
> >>>> So the set of all tasks that never call proc_set_tty() ends up in the
> >>>> same one big default group, correct?
> >>>
> >>> Well, yes and no.
> >>>
> >>> Yes, that's what the code currently does. But I did ask Mike (and he
> >>> delivered) to try to make the code look and work in a way where the
> >>> whole "tty thing" is just one of the heuristics.
> >>>
> >>> It's not clear exactly what the non-tty heuristics would be, but I do
> >>>
> >>> have a few suggestions:
> >>> - I think it might be a good idea to associate a task group with the
> >>>
> >>> current "cred" of a process, and fall back on it in the absense of a
> >>> tty-provided one.
> >>
> >> Or how about (just brainstorming here) a group per 'process group'?
> >
> > I switched to per session, which on my system at least looks like more
> > than enough granularity
>
> Will that have an effect on software like Chromium which creates a fork
> for each tab? If a user opens Thunderbird and Chromium with 100 tabs,
> Thunderbird should probably get 50% CPU time instead of just 1%...

At least on my machine, all of the chromium processes have the same session
id.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
Thomas Fjellstrom
[email protected]

2010-11-21 00:19:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Sun, 2010-11-21 at 00:16 +0200, Mika Laitio wrote:
> > > What complexity? Have you looked at the patch? It has no complexity anywhere.
> > >
> > > It's a _lot_ less complex than having system daemons you don't
> > > control. We have not had good results with that approach in the past.
> > > System daemons tend to cause nasty problems, and debugging them is a
> > > nightmare.
> >
> > Well, userspace doesn't bite.
>
> To me, this starts sounding like a legendary "userspace suspend" versus
> "kernel suspend" support fights many years ago...

Nah. Patch accepted, rejected, obsoleted 10 minutes from now etc etc,
it'll never be a big deal.

-Mike

2010-11-21 13:38:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


Hello Mike,

* Mike Galbraith <[email protected]> wrote:

> On Tue, 2010-11-16 at 18:28 +0100, Ingo Molnar wrote:
>
> > Mike,
> >
> > Mind sending a new patch with a separate v2 announcement in a new thread, once you
> > have something i could apply to the scheduler tree (for a v2.6.38 merge)?
>
> Changes since last:
> - switch to per session vs tty
> - make autogroups visible in /proc/sched_debug
> - make autogroups visible in /proc/<pid>/autogroup
> - add nice level bandwidth tweakability to /proc/<pid>/autogroup

I tested it a bit, and autosched-v4 crashes on bootup with with attached config.

Note: the box has serial logging enabled and there's UART code in the stacktrace -
maybe it's related. Let me know if you need the full bootup log.

Thanks,

Ingo

[FAILED]
Enabling local filesystem quotas: [ OK ]
PPS event at 4294886381
Enabling /etc/fstab swaps: swapon: /dev/hda2: Function not implemented
[FAILED]
INIT: Entering runleveBUG: unable to handle kernel paging request at f548604c
IP:l: 3 [<c10307f0>] update_cfs_shares+0x60/0x160
*pdpt = 0000000002017001 *pde = 00000000029d4067 *pte = 8000000035486160
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/block/sr0/dev

Pid: 1, comm: init Not tainted 2.6.37-rc2-tip+ #64308 A8N-E/System Product Name
EIP: 0060:[<c10307f0>] EFLAGS: 00010086 CPU: 1
EIP is at update_cfs_shares+0x60/0x160
EAX: fffffffe EBX: f547603b ECX: 00000400 EDX: 00000002
ESI: f5486000 EDI: 0000013b EBP: f6459d48 ESP: f6459d3c
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process init (pid: 1, ti=f6458000 task=f6450000 task.ti=f6458000)
Stack:
f5475a80 f6f066c0 00000004 f6459d84 c103256f 00000002 00000001 00000000
c10324d0 c200e6c0 00000001 f6f06b34 00000046 f5475a80 f5475ac8 f6f066c0
00000001 ffffffff f6459dfc c1b32820 f64a0010 f6459dc4 00000046 00000000
Call Trace:
[<c103256f>] update_shares+0x9f/0x170
[<c10324d0>] ? update_shares+0x0/0x170
[<c1b32820>] schedule+0x580/0x9d0
[<c1039335>] ? sub_preempt_count+0xa5/0xe0
[<c1b330e5>] schedule_timeout+0x125/0x2a0
[<c104fe60>] ? process_timeout+0x0/0x10
[<c15aef4f>] uart_close+0x17f/0x350
[<c105fea0>] ? autoremove_wake_function+0x0/0x50
[<c1471f72>] tty_release+0x102/0x500
[<c1125fdf>] ? locks_remove_posix+0xf/0xa0
[<c1119a43>] ? fsnotify+0x1e3/0x2f0
[<c11198d3>] ? fsnotify+0x73/0x2f0
[<c10ea1e1>] fput+0xb1/0x230
[<c10e7e7e>] filp_close+0x4e/0x70
[<c10e7f14>] sys_close+0x74/0xc0
[<c1002b90>] sysenter_do_call+0x12/0x31
Code: 00 00 00 8b 18 8b 79 1c 8b 49 18 2b b8 84 00 00 00 01 d3 89 d8 0f af c1 01 fb 74 07 89 c2 c1 fa 1f f7 fb 83 f8 02 ba 02 00 00 00 <8b> 5e 4c 0f 4d d0 39 d1 0f 42 d1 8b 4e 1c 85 c9 0f 84 6a 00 00
EIP: [<c10307f0>] update_cfs_shares+0x60/0x160 SS:ESP 0068:f6459d3c
CR2: 00000000f548604c
---[ end trace f0ad48f53e29a8fe ]---
Kernel panic - not syncing: Fatal exception
Pid: 1, comm: init Tainted: G D 2.6.37-rc2-tip+ #64308
Call Trace:
[<c1b31ef1>] ? panic+0x66/0x15c
[<c10065c3>] ? oops_end+0x83/0x90
[<c10220fc>] ? no_context+0xbc/0x190
[<c102225d>] ? __bad_area_nosemaphore+0x8d/0x130
[<c10219a4>] ? vmalloc_fault+0x14/0x1c0
[<c1021b64>] ? spurious_fault+0x14/0x110
[<c1022317>] ? bad_area_nosemaphore+0x17/0x20
[<c1022741>] ? do_page_fault+0x281/0x4c0
[<c1008756>] ? native_sched_clock+0x26/0x90
[<c1066033>] ? sched_clock_local+0xd3/0x1c0
[<c10224c0>] ? do_page_fault+0x0/0x4c0
[<c1b361e2>] ? error_code+0x5a/0x60
[<c10224c0>] ? do_page_fault+0x0/0x4c0
[<c10307f0>] ? update_cfs_shares+0x60/0x160
[<c103256f>] ? update_shares+0x9f/0x170
[<c10324d0>] ? update_shares+0x0/0x170
[<c1b32820>] ? schedule+0x580/0x9d0
[<c1039335>] ? sub_preempt_count+0xa5/0xe0
[<c1b330e5>] ? schedule_timeout+0x125/0x2a0
[<c104fe60>] ? process_timeout+0x0/0x10
[<c15aef4f>] ? uart_close+0x17f/0x350
[<c105fea0>] ? autoremove_wake_function+0x0/0x50
[<c1471f72>] ? tty_release+0x102/0x500
[<c1125fdf>] ? locks_remove_posix+0xf/0xa0
[<c1119a43>] ? fsnotify+0x1e3/0x2f0
[<c11198d3>] ? fsnotify+0x73/0x2f0
[<c10ea1e1>] ? fput+0xb1/0x230
[<c10e7e7e>] ? filp_close+0x4e/0x70
[<c10e7f14>] ? sys_close+0x74/0xc0
[<c1002b90>] ? sysenter_do_call+0x12/0x31
Rebooting in 1 seconds..Press any key to enter the menu


Attachments:
(No filename) (4.09 kB)
config (66.60 kB)
Download all attachments

2010-11-21 13:39:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


Btw., there's a small cleanup in the patch that i picked up (see below), and i also
edited the commit log a bit - so you might want to pick up the version below.

Ingo

--------------->
>From 741430a6af2bdefe3d017226bbcfe96f9ed46b58 Mon Sep 17 00:00:00 2001
From: Mike Galbraith <[email protected]>
Date: Sat, 20 Nov 2010 12:35:00 -0700
Subject: [PATCH] sched: Improve desktop interactivity: Implement automated per session task groups

A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. This patch only per session autogroups, but leaves the
way open for enhancement.

Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), the process wide
reference to the default group is dropped, a new task group is
created, and the process is moved into the new task group.
Children thereafter inherit this task group, and increase it's
refcount. On exit, a reference to the current task group is
dropped when the last reference to each signal struct is
dropped. The task group is destroyed when the last signal
struct referencing it is freed.

At runqueue selection time, IFF a task has no cgroup assignment, its
current autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem. cat /proc/<pid>/autogroup displays
the task's group and the group's nice level.

echo <nice level> > /proc/<pid>/autogroup

Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the abuse
risk of task group locking.

The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the
boot option noautogroup, and can be also be turned on/off on the
fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task
group.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Markus Trippelsdorf <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/kernel-parameters.txt | 2 +
fs/proc/base.c | 79 ++++++++++++
include/linux/sched.h | 23 ++++
init/Kconfig | 12 ++
kernel/fork.c | 5 +-
kernel/sched.c | 13 ++-
kernel/sched_autogroup.c | 240 +++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 23 ++++
kernel/sched_debug.c | 29 ++--
kernel/sys.c | 4 +-
kernel/sysctl.c | 11 ++
11 files changed, 423 insertions(+), 18 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 92e83e5..86820a7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters. It is defined in the file
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..2fa0ce2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1407,6 +1407,82 @@ static const struct file_operations proc_pid_sched_operations = {

#endif

+#ifdef CONFIG_SCHED_AUTOGROUP
+/*
+ * Print out autogroup related information:
+ */
+static int sched_autogroup_show(struct seq_file *m, void *v)
+{
+ struct inode *inode = m->private;
+ struct task_struct *p;
+
+ p = get_proc_task(inode);
+ if (!p)
+ return -ESRCH;
+ proc_sched_autogroup_show_task(p, m);
+
+ put_task_struct(p);
+
+ return 0;
+}
+
+static ssize_t
+sched_autogroup_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *offset)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct task_struct *p;
+ char buffer[PROC_NUMBUF];
+ long nice;
+ int err;
+
+ memset(buffer, 0, sizeof(buffer));
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+
+ err = strict_strtol(strstrip(buffer), 0, &nice);
+ if (err)
+ return -EINVAL;
+
+ p = get_proc_task(inode);
+ if (!p)
+ return -ESRCH;
+
+ err = nice;
+ err = proc_sched_autogroup_set_nice(p, &err);
+ if (err)
+ count = err;
+
+ put_task_struct(p);
+
+ return count;
+}
+
+static int sched_autogroup_open(struct inode *inode, struct file *filp)
+{
+ int ret;
+
+ ret = single_open(filp, sched_autogroup_show, NULL);
+ if (!ret) {
+ struct seq_file *m = filp->private_data;
+
+ m->private = inode;
+ }
+ return ret;
+}
+
+static const struct file_operations proc_pid_sched_autogroup_operations = {
+ .open = sched_autogroup_open,
+ .read = seq_read,
+ .write = sched_autogroup_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
+
static ssize_t comm_write(struct file *file, const char __user *buf,
size_t count, loff_t *offset)
{
@@ -2733,6 +2809,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_SCHED_DEBUG
REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations),
#endif
+#ifdef CONFIG_SCHED_AUTOGROUP
+ REG("autogroup", S_IRUGO|S_IWUSR, proc_pid_sched_autogroup_operations),
+#endif
REG("comm", S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUSR, proc_pid_syscall),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 840f127..bc6dca5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -576,6 +578,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1926,6 +1931,24 @@ int sched_rt_handler(struct ctl_table *table, int write,

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+extern unsigned int sysctl_sched_autogroup_enabled;
+
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#ifdef CONFIG_PROC_FS
+extern void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m);
+extern int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice);
+#endif
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
diff --git a/init/Kconfig b/init/Kconfig
index 88c1046..f6f44d0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -728,6 +728,18 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task session.
+
config MM_OWNER
bool

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..b6f2475 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(struct signal_struct *sig)

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
diff --git a/kernel/sched.c b/kernel/sched.c
index 550cf3a..2bc19cb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -78,6 +78,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -270,6 +271,10 @@ struct task_group {
struct task_group *parent;
struct list_head siblings;
struct list_head children;
+
+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
};

#define root_task_group init_task_group
@@ -612,11 +617,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -1878,6 +1886,7 @@ static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7734,7 +7743,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

for_each_possible_cpu(i) {
diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c
new file mode 100644
index 0000000..2bd4020
--- /dev/null
+++ b/kernel/sched_autogroup.c
@@ -0,0 +1,240 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/utsname.h>
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+
+struct autogroup {
+ struct task_group *tg;
+ struct kref kref;
+ struct rw_semaphore lock;
+ unsigned long id;
+ int nice;
+};
+
+static struct autogroup autogroup_default;
+static atomic_t autogroup_seq_nr;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ init_task_group.autogroup = &autogroup_default;
+ kref_init(&autogroup_default.kref);
+ init_rwsem(&autogroup_default.lock);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+ struct task_group *tg = ag->tg;
+
+ kfree(ag);
+ sched_destroy_group(tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
+
+ if (!ag)
+ goto out_fail;
+
+ ag->tg = sched_create_group(&init_task_group);
+
+ if (IS_ERR(ag->tg))
+ goto out_fail;
+
+ ag->tg->autogroup = ag;
+ kref_init(&ag->kref);
+ init_rwsem(&ag->lock);
+ ag->id = atomic_inc_return(&autogroup_seq_nr);
+
+ return ag;
+
+out_fail:
+ kfree(ag);
+ WARN_ON_ONCE(1);
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline bool
+task_wants_autogroup(struct task_struct *p, struct task_group *tg)
+{
+ if (tg != &root_task_group)
+ return false;
+
+ if (p->sched_class != &fair_sched_class)
+ return false;
+
+ /*
+ * We can only assume the task group can't go away on us if
+ * autogroup_move_group() can see us on ->thread_group list.
+ */
+ if (p->flags & PF_EXITING)
+ return false;
+
+ return true;
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ if (enabled && task_wants_autogroup(p, tg))
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+
+ spin_lock(&p->sighand->siglock);
+
+ prev = p->signal->autogroup;
+ if (prev == ag) {
+ spin_unlock(&p->sighand->siglock);
+ return;
+ }
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+ t = p;
+
+ do {
+ sched_move_task(p);
+ } while_each_thread(p, t);
+
+ spin_unlock(&p->sighand->siglock);
+
+ autogroup_kref_put(prev);
+}
+
+/* Allocates GFP_KERNEL, cannot be called under any spinlock */
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* Cannot be called under siglock. Currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ struct sighand_struct *sighand = current->sighand;
+
+ spin_lock(&sighand->siglock);
+ sig->autogroup = autogroup_kref_get(current->signal->autogroup);
+ spin_unlock(&sighand->siglock);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+
+#ifdef CONFIG_PROC_FS
+
+static inline struct autogroup *autogroup_get(struct task_struct *p)
+{
+ struct autogroup *ag;
+
+ /* task may be moved after we unlock.. tough */
+ spin_lock(&p->sighand->siglock);
+ ag = autogroup_kref_get(p->signal->autogroup);
+ spin_unlock(&p->sighand->siglock);
+
+ return ag;
+}
+
+int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
+{
+ static unsigned long next = INITIAL_JIFFIES;
+ struct autogroup *ag;
+ int err;
+
+ if (*nice < -20 || *nice > 19)
+ return -EINVAL;
+
+ err = security_task_setnice(current, *nice);
+ if (err)
+ return err;
+
+ if (*nice < 0 && !can_nice(current, *nice))
+ return -EPERM;
+
+ /* this is a heavy operation taking global locks.. */
+ if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
+ return -EAGAIN;
+
+ next = HZ / 10 + jiffies;;
+ ag = autogroup_get(p);
+
+ down_write(&ag->lock);
+ err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);
+ if (!err)
+ ag->nice = *nice;
+ up_write(&ag->lock);
+
+ autogroup_kref_put(ag);
+
+ return err;
+}
+
+void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
+{
+ struct autogroup *ag = autogroup_get(p);
+
+ down_read(&ag->lock);
+ seq_printf(m, "/autogroup-%ld nice %d\n", ag->id, ag->nice);
+ up_read(&ag->lock);
+
+ autogroup_kref_put(ag);
+}
+#endif /* CONFIG_PROC_FS */
+
+#ifdef CONFIG_SCHED_DEBUG
+static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
+{
+ return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id);
+}
+#endif /* CONFIG_SCHED_DEBUG */
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
diff --git a/kernel/sched_autogroup.h b/kernel/sched_autogroup.h
new file mode 100644
index 0000000..40deaef
--- /dev/null
+++ b/kernel/sched_autogroup.h
@@ -0,0 +1,23 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
+{
+ return 0;
+}
+#endif
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index e6590e7..3e5b067 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -87,6 +87,20 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu,
}
#endif

+#if defined(CONFIG_CGROUP_SCHED) && \
+ (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+static void task_group_path(struct task_group *tg, char *buf, int buflen)
+{
+ /* may be NULL if the underlying cgroup isn't fully-created yet */
+ if (!tg->css.cgroup) {
+ if (!autogroup_path(tg, buf, buflen))
+ buf[0] = '\0';
+ return;
+ }
+ cgroup_path(tg->css.cgroup, buf, buflen);
+}
+#endif
+
static void
print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
{
@@ -115,7 +129,7 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
char path[64];

rcu_read_lock();
- cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
+ task_group_path(task_group(p), path, sizeof(path));
rcu_read_unlock();
SEQ_printf(m, " %s", path);
}
@@ -147,19 +161,6 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
read_unlock_irqrestore(&tasklist_lock, flags);
}

-#if defined(CONFIG_CGROUP_SCHED) && \
- (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
-static void task_group_path(struct task_group *tg, char *buf, int buflen)
-{
- /* may be NULL if the underlying cgroup isn't fully-created yet */
- if (!tg->css.cgroup) {
- buf[0] = '\0';
- return;
- }
- cgroup_path(tg->css.cgroup, buf, buflen);
-}
-#endif
-
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
diff --git a/kernel/sys.c b/kernel/sys.c
index 7f5a0cd..2745dcd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1080,8 +1080,10 @@ SYSCALL_DEFINE0(setsid)
err = session;
out:
write_unlock_irq(&tasklist_lock);
- if (err > 0)
+ if (err > 0) {
proc_sid_connector(group_leader);
+ sched_autogroup_create_attach(group_leader);
+ }
return err;
}

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9b520d7..eb4b493 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -370,6 +370,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",

2010-11-21 15:51:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On 11/21, Ingo Molnar wrote:
>
> Btw., there's a small cleanup in the patch that i picked up (see below), and i also
> edited the commit log a bit - so you might want to pick up the version below.

I didn't read the patch in details, but a couple of nits...

> +static void
> +autogroup_move_group(struct task_struct *p, struct autogroup *ag)
> +{
> + struct autogroup *prev;
> + struct task_struct *t;
> +
> + spin_lock(&p->sighand->siglock);

This needs spin_lock_irq(), ->siglock is irq-safe.

The same for other lockers, but:

> +static inline struct autogroup *autogroup_get(struct task_struct *p)
> +{
> + struct autogroup *ag;
> +
> + /* task may be moved after we unlock.. tough */
> + spin_lock(&p->sighand->siglock);

This is called by fs/proc. In this case nothing protects us from
release_task(), we can hit ->siglock == NULL (or we can race with
exec which changes ->sighand in theory).

This needs lock_task_sighand() (it can fail). Perhaps something
else have the same problem...

If the task is current and it is not exiting, or it is the new
child (sched_autogroup_fork), then it is safe to use ->siglock
directly.

And a pure cosmetic nit,

> +void sched_autogroup_fork(struct signal_struct *sig)
> +{
> + struct sighand_struct *sighand = current->sighand;
> +
> + spin_lock(&sighand->siglock);
> + sig->autogroup = autogroup_kref_get(current->signal->autogroup);
> + spin_unlock(&sighand->siglock);
> +}

This looks a bit confusing. We do not need current->sighand->siglock
to set sig->autogroup. Nobody except us can see this new signal_struct,
and in any case current->sighand->siglock can't help.

It is needed for autogroup_kref_get(), but we already have autogroup_get().
I'd suggest

void sched_autogroup_fork(struct signal_struct *sig)
{
sig->autogroup = autogroup_get(current);
}

Oleg.

2010-11-21 16:16:07

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, 2010-11-21 at 14:37 +0100, Ingo Molnar wrote:
> Hello Mike,
>
> * Mike Galbraith <[email protected]> wrote:
>
> > On Tue, 2010-11-16 at 18:28 +0100, Ingo Molnar wrote:
> >
> > > Mike,
> > >
> > > Mind sending a new patch with a separate v2 announcement in a new thread, once you
> > > have something i could apply to the scheduler tree (for a v2.6.38 merge)?
> >
> > Changes since last:
> > - switch to per session vs tty
> > - make autogroups visible in /proc/sched_debug
> > - make autogroups visible in /proc/<pid>/autogroup
> > - add nice level bandwidth tweakability to /proc/<pid>/autogroup
>
> I tested it a bit, and autosched-v4 crashes on bootup with with attached config.

Oh crud. I ran 37, but not tip, it's toxic with my own config. So much
for darn thing being ready :(

-Mike

2010-11-21 16:36:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, 2010-11-21 at 16:44 +0100, Oleg Nesterov wrote:
> On 11/21, Ingo Molnar wrote:
> >
> > Btw., there's a small cleanup in the patch that i picked up (see below), and i also
> > edited the commit log a bit - so you might want to pick up the version below.
>
> I didn't read the patch in details, but a couple of nits...
>
> > +static void
> > +autogroup_move_group(struct task_struct *p, struct autogroup *ag)
> > +{
> > + struct autogroup *prev;
> > + struct task_struct *t;
> > +
> > + spin_lock(&p->sighand->siglock);
>
> This needs spin_lock_irq(), ->siglock is irq-safe.

Ok.

> The same for other lockers, but:
>
> > +static inline struct autogroup *autogroup_get(struct task_struct *p)
> > +{
> > + struct autogroup *ag;
> > +
> > + /* task may be moved after we unlock.. tough */
> > + spin_lock(&p->sighand->siglock);
>
> This is called by fs/proc. In this case nothing protects us from
> release_task(), we can hit ->siglock == NULL (or we can race with
> exec which changes ->sighand in theory).

Oh my. Thanks.

(gad, signal_struct is sooo much harder to get right)

> This needs lock_task_sighand() (it can fail). Perhaps something
> else have the same problem...
>
> If the task is current and it is not exiting, or it is the new
> child (sched_autogroup_fork), then it is safe to use ->siglock
> directly.

Ok,

> And a pure cosmetic nit,
>
> > +void sched_autogroup_fork(struct signal_struct *sig)
> > +{
> > + struct sighand_struct *sighand = current->sighand;
> > +
> > + spin_lock(&sighand->siglock);
> > + sig->autogroup = autogroup_kref_get(current->signal->autogroup);
> > + spin_unlock(&sighand->siglock);
> > +}
>
> This looks a bit confusing. We do not need current->sighand->siglock
> to set sig->autogroup. Nobody except us can see this new signal_struct,
> and in any case current->sighand->siglock can't help.
>
> It is needed for autogroup_kref_get(), but we already have autogroup_get().
> I'd suggest
>
> void sched_autogroup_fork(struct signal_struct *sig)
> {
> sig->autogroup = autogroup_get(current);
> }

I'll do that.. once the thing is non-toxic to tip.

Thanks a lot Oleg.

-Mike

2010-11-21 18:43:40

by Gene Heskett

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sunday, November 21, 2010, Ingo Molnar wrote:
>Hello Mike,
>
>* Mike Galbraith <[email protected]> wrote:
>> On Tue, 2010-11-16 at 18:28 +0100, Ingo Molnar wrote:
>> > Mike,
>> >
>> > Mind sending a new patch with a separate v2 announcement in a new
>> > thread, once you have something i could apply to the scheduler tree
>> > (for a v2.6.38 merge)?
>>
>> Changes since last:
>> - switch to per session vs tty
>> - make autogroups visible in /proc/sched_debug
>> - make autogroups visible in /proc/<pid>/autogroup
>> - add nice level bandwidth tweakability to /proc/<pid>/autogroup
>
>I tested it a bit, and autosched-v4 crashes on bootup with with attached
>config.
>
>Note: the box has serial logging enabled and there's UART code in the
>stacktrace - maybe it's related. Let me know if you need the full bootup
>log.
>
>Thanks,
>
> Ingo
>
>[FAILED]
>Enabling local filesystem quotas: [ OK ]
>PPS event at 4294886381
>Enabling /etc/fstab swaps: swapon: /dev/hda2: Function not implemented
>[FAILED]
>INIT: Entering runleveBUG: unable to handle kernel paging request at
>f548604c IP:l: 3 [<c10307f0>] update_cfs_shares+0x60/0x160
>*pdpt = 0000000002017001 *pde = 00000000029d4067 *pte = 8000000035486160
>Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>last sysfs file: /sys/block/sr0/dev
>
>Pid: 1, comm: init Not tainted 2.6.37-rc2-tip+ #64308 A8N-E/System
>Product Name EIP: 0060:[<c10307f0>] EFLAGS: 00010086 CPU: 1
>EIP is at update_cfs_shares+0x60/0x160
>EAX: fffffffe EBX: f547603b ECX: 00000400 EDX: 00000002
>ESI: f5486000 EDI: 0000013b EBP: f6459d48 ESP: f6459d3c
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
>Process init (pid: 1, ti=f6458000 task=f6450000 task.ti=f6458000)
>Stack:
> f5475a80 f6f066c0 00000004 f6459d84 c103256f 00000002 00000001 00000000
> c10324d0 c200e6c0 00000001 f6f06b34 00000046 f5475a80 f5475ac8 f6f066c0
> 00000001 ffffffff f6459dfc c1b32820 f64a0010 f6459dc4 00000046 00000000
>Call Trace:
> [<c103256f>] update_shares+0x9f/0x170
> [<c10324d0>] ? update_shares+0x0/0x170
> [<c1b32820>] schedule+0x580/0x9d0
> [<c1039335>] ? sub_preempt_count+0xa5/0xe0
> [<c1b330e5>] schedule_timeout+0x125/0x2a0
> [<c104fe60>] ? process_timeout+0x0/0x10
> [<c15aef4f>] uart_close+0x17f/0x350
> [<c105fea0>] ? autoremove_wake_function+0x0/0x50
> [<c1471f72>] tty_release+0x102/0x500
> [<c1125fdf>] ? locks_remove_posix+0xf/0xa0
> [<c1119a43>] ? fsnotify+0x1e3/0x2f0
> [<c11198d3>] ? fsnotify+0x73/0x2f0
> [<c10ea1e1>] fput+0xb1/0x230
> [<c10e7e7e>] filp_close+0x4e/0x70
> [<c10e7f14>] sys_close+0x74/0xc0
> [<c1002b90>] sysenter_do_call+0x12/0x31
>Code: 00 00 00 8b 18 8b 79 1c 8b 49 18 2b b8 84 00 00 00 01 d3 89 d8 0f
>af c1 01 fb 74 07 89 c2 c1 fa 1f f7 fb 83 f8 02 ba 02 00 00 00 <8b> 5e
>4c 0f 4d d0 39 d1 0f 42 d1 8b 4e 1c 85 c9 0f 84 6a 00 00 EIP:
>[<c10307f0>] update_cfs_shares+0x60/0x160 SS:ESP 0068:f6459d3c CR2:
>00000000f548604c
>---[ end trace f0ad48f53e29a8fe ]---
>Kernel panic - not syncing: Fatal exception
>Pid: 1, comm: init Tainted: G D 2.6.37-rc2-tip+ #64308
>Call Trace:
> [<c1b31ef1>] ? panic+0x66/0x15c
> [<c10065c3>] ? oops_end+0x83/0x90
> [<c10220fc>] ? no_context+0xbc/0x190
> [<c102225d>] ? __bad_area_nosemaphore+0x8d/0x130
> [<c10219a4>] ? vmalloc_fault+0x14/0x1c0
> [<c1021b64>] ? spurious_fault+0x14/0x110
> [<c1022317>] ? bad_area_nosemaphore+0x17/0x20
> [<c1022741>] ? do_page_fault+0x281/0x4c0
> [<c1008756>] ? native_sched_clock+0x26/0x90
> [<c1066033>] ? sched_clock_local+0xd3/0x1c0
> [<c10224c0>] ? do_page_fault+0x0/0x4c0
> [<c1b361e2>] ? error_code+0x5a/0x60
> [<c10224c0>] ? do_page_fault+0x0/0x4c0
> [<c10307f0>] ? update_cfs_shares+0x60/0x160
> [<c103256f>] ? update_shares+0x9f/0x170
> [<c10324d0>] ? update_shares+0x0/0x170
> [<c1b32820>] ? schedule+0x580/0x9d0
> [<c1039335>] ? sub_preempt_count+0xa5/0xe0
> [<c1b330e5>] ? schedule_timeout+0x125/0x2a0
> [<c104fe60>] ? process_timeout+0x0/0x10
> [<c15aef4f>] ? uart_close+0x17f/0x350
> [<c105fea0>] ? autoremove_wake_function+0x0/0x50
> [<c1471f72>] ? tty_release+0x102/0x500
> [<c1125fdf>] ? locks_remove_posix+0xf/0xa0
> [<c1119a43>] ? fsnotify+0x1e3/0x2f0
> [<c11198d3>] ? fsnotify+0x73/0x2f0
> [<c10ea1e1>] ? fput+0xb1/0x230
> [<c10e7e7e>] ? filp_close+0x4e/0x70
> [<c10e7f14>] ? sys_close+0x74/0xc0
> [<c1002b90>] ? sysenter_do_call+0x12/0x31
>Rebooting in 1 seconds..Press any key to enter the menu

And I just 2 hours ago got it working on 2.6.36.1(rc1) but had to learn and
add to my 'makeit' script before I could make x work again. Yeah, I'm a
bad bad boy, I run the latest nvidia drivers. A tail on the syslog is
clean (so far anyway, uptime is 2:06).

So you can have (FWTW) my reviewed by: Gene Heskett

These patches are a definite keeper IMNSHO.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
On the whole, I'd rather be in Philadelphia.
-- W.C. Fields' epitaph

2010-11-22 06:16:32

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

* Linus Torvalds <[email protected]> [2010-11-16 10:49:22]:

> That's the point. We can push out the kernel change, and everything
> will "just work". We can make that feature we already have in the
> kernel actually be _useful_.
>

This is absolutely fine, as long as everyone wants the feature, "just
works" for us as kernel developers might not be the same as "just
works" for all end users. How does one find and disable this feature
if one is not happy? I don't think it is very hard, it could be a
simple tool that the distro provides or documentation, but hidden
works both ways, IMHO. In summary, we need tooling with good defaults.

--
Three Cheers,
Balbir

2010-11-22 06:22:54

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

* Peter Zijlstra <[email protected]> [2010-11-19 12:49:36]:

> On Fri, 2010-11-19 at 00:43 +0100, Samuel Thibault wrote:
> > What overhead? The implementation of cgroups is actually already
> > hierarchical.
>
> It must be nice to be that ignorant ;-) Speaking for the scheduler
> cgroup controller (that being the only one I actually know), most all
> the load-balance operations are O(n) in the number of active cgroups,
> and a lot of the cpu local schedule operations are O(d) where d is the
> depth of the cgroup tree.
>
> [ and that's with the .38 targeted code, current mainline is O(n ln(n))
> for load balancing and truly sucks on multi-socket ]
>

I can say that for memory, with hierarchies we account all the way up,
which can be a visible overhead, depending on how often you fault.


--
Three Cheers,
Balbir

2010-11-22 06:24:48

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

* Lennart Poettering <[email protected]> [2010-11-20 16:41:01]:

> On Sat, 20.11.10 09:55, Balbir Singh ([email protected]) wrote:
>
> > > However, I am not sure I like the idea of having pollable files like that,
> > > because in the systemd case I am very much interested in getting
> > > recursive notifications, i.e. I want to register once for getting
> > > notifications for a full subtree instead of having to register for each
> > > cgroup individually.
> > >
> > > My personal favourite solution would be to get a netlink msg when a
> > > cgroup runs empty. That way multiple programs could listen to the events
> > > at the same time, and we'd have an easy way to subscribe to a whole
> > > hierarchy of groups.
> >
> > The netlink message should not be hard to do if we agree to work on
> > it. The largest objections I've heard is that netlink implies
> > network programming and most users want to be able to script in
> > their automation and network scripting is hard.
>
> Well, the notify_on_release stuff cannot be dropped anyway at this point
> in time, so netlink support would be an addition to, not a replacement for
> the current stuff that might be useful for scripting folks.

Agreed, we still need the old notify_on_release. Are you suggesting
that for scripting we use the old interface and newer tools use
netlink?

--
Three Cheers,
Balbir

2010-11-22 19:21:31

by Lennart Poettering

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH v3] sched: automated per tty task groups

On Mon, 22.11.10 11:54, Balbir Singh ([email protected]) wrote:

>
> * Lennart Poettering <[email protected]> [2010-11-20 16:41:01]:
>
> > On Sat, 20.11.10 09:55, Balbir Singh ([email protected]) wrote:
> >
> > > > However, I am not sure I like the idea of having pollable files like that,
> > > > because in the systemd case I am very much interested in getting
> > > > recursive notifications, i.e. I want to register once for getting
> > > > notifications for a full subtree instead of having to register for each
> > > > cgroup individually.
> > > >
> > > > My personal favourite solution would be to get a netlink msg when a
> > > > cgroup runs empty. That way multiple programs could listen to the events
> > > > at the same time, and we'd have an easy way to subscribe to a whole
> > > > hierarchy of groups.
> > >
> > > The netlink message should not be hard to do if we agree to work on
> > > it. The largest objections I've heard is that netlink implies
> > > network programming and most users want to be able to script in
> > > their automation and network scripting is hard.
> >
> > Well, the notify_on_release stuff cannot be dropped anyway at this point
> > in time, so netlink support would be an addition to, not a replacement for
> > the current stuff that might be useful for scripting folks.
>
> Agreed, we still need the old notify_on_release. Are you suggesting
> that for scripting we use the old interface and newer tools use
> netlink?

No, the contrary. I was referring to "the current stuff that might be
useful for scripting and folks". And then netlink stuff would be for
everything beyond scripting, but not scripting itself.

Lennart

--
Lennart Poettering - Red Hat, Inc.

2010-11-25 16:00:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, 2010-11-21 at 14:37 +0100, Ingo Molnar wrote:

> I tested it a bit, and autosched-v4 crashes on bootup with with attached config.

Hah. Took a while between vacation activities and my wimpy little
memory ordering knowledge, but I've finally got it fingered out. Tip's
update_shares() changes exposed previously invisible (to me) memory
ordering woes nicely it seems.

My vacation is (sniff) over, so I won't get a fully tested patch out the
door for review until I get back home.

-Mike

2010-11-28 14:25:11

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Thu, 2010-11-25 at 09:00 -0700, Mike Galbraith wrote:

> My vacation is (sniff) over, so I won't get a fully tested patch out the
> door for review until I get back home.

Either I forgot to pack my eyeballs, or laptop is just too dinky and
annoying. Now back home on beloved box, this little bugger poked me
dead in the eye.

Something else is seriously wrong though. 36.1 with attached (plus
sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
37.git and tip with fixlet below both suck rocks. With a make -j40
running, wakeup-latency is showing latencies of >100ms, amarok skips,
mouse lurches badly.. generally horrid. Something went south.

sched: fix 3d4b47b4 typo.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
LKML-Reference: new submission
---
kernel/sched.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -8087,7 +8087,6 @@ static inline void unregister_fair_sched
{
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
- int i;

/*
* Only empty task groups can be destroyed; so we can speculatively
@@ -8097,7 +8096,7 @@ static inline void unregister_fair_sched
return;

raw_spin_lock_irqsave(&rq->lock, flags);
- list_del_leaf_cfs_rq(tg->cfs_rq[i]);
+ list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
raw_spin_unlock_irqrestore(&rq->lock, flags);
}
#else /* !CONFG_FAIR_GROUP_SCHED */


Attachments:
sched_autogroup.36.1.diff (19.45 kB)

2010-11-28 19:32:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, Nov 28, 2010 at 6:24 AM, Mike Galbraith <[email protected]> wrote:
>
> Something else is seriously wrong though. ?36.1 with attached (plus
> sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> 37.git and tip with fixlet below both suck rocks. ?With a make -j40
> running, wakeup-latency is showing latencies of >100ms, amarok skips,
> mouse lurches badly.. generally horrid. ?Something went south.

Can you test -rc3? Is that still ok? And are you perhaps using
Nouveau? There's a report of some graphics (?) regression since -rc3
about bad desktop performance:

https://bugzilla.kernel.org/show_bug.cgi?id=23912

but it doesn't have any more information yet (so if -rc3 _is_ good for
you, and you can add anything to that report, it would be good. The
original reporter is hopefully bisecting it now)

Linus

2010-11-28 20:19:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Linus Torvalds <[email protected]> wrote:

> On Sun, Nov 28, 2010 at 6:24 AM, Mike Galbraith <[email protected]> wrote:
> >
> > Something else is seriously wrong though. ?36.1 with attached (plus
> > sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> > 37.git and tip with fixlet below both suck rocks. ?With a make -j40
> > running, wakeup-latency is showing latencies of >100ms, amarok skips,
> > mouse lurches badly.. generally horrid. ?Something went south.
>
> Can you test -rc3? Is that still ok? And are you perhaps using
> Nouveau? There's a report of some graphics (?) regression since -rc3
> about bad desktop performance:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=23912
>
> but it doesn't have any more information yet (so if -rc3 _is_ good for
> you, and you can add anything to that report, it would be good. The
> original reporter is hopefully bisecting it now)

Mike, the last pure -rc3 -tip commit is 92c883adf03b - you could try to check that
out too: it has most of the current sched/core commits, but has none of the post-rc3
DRM changes.

Thanks,

Ingo

2010-11-29 05:45:35

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, 2010-11-28 at 11:31 -0800, Linus Torvalds wrote:
> On Sun, Nov 28, 2010 at 6:24 AM, Mike Galbraith <[email protected]> wrote:
> >
> > Something else is seriously wrong though. 36.1 with attached (plus
> > sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> > 37.git and tip with fixlet below both suck rocks. With a make -j40
> > running, wakeup-latency is showing latencies of >100ms, amarok skips,
> > mouse lurches badly.. generally horrid. Something went south.
>
> Can you test -rc3? Is that still ok? And are you perhaps using
> Nouveau? There's a report of some graphics (?) regression since -rc3
> about bad desktop performance:

No Nouveau here, plain old boring nv.

> https://bugzilla.kernel.org/show_bug.cgi?id=23912
>
> but it doesn't have any more information yet (so if -rc3 _is_ good for
> you, and you can add anything to that report, it would be good. The
> original reporter is hopefully bisecting it now)

I'll hunt as soon as I can (inbox runneth over).

-Mike

2010-11-29 11:53:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, 2010-11-28 at 21:18 +0100, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
> > On Sun, Nov 28, 2010 at 6:24 AM, Mike Galbraith <[email protected]> wrote:
> > >
> > > Something else is seriously wrong though. 36.1 with attached (plus
> > > sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> > > 37.git and tip with fixlet below both suck rocks. With a make -j40
> > > running, wakeup-latency is showing latencies of >100ms, amarok skips,
> > > mouse lurches badly.. generally horrid. Something went south.
> >
> > Can you test -rc3? Is that still ok? And are you perhaps using
> > Nouveau? There's a report of some graphics (?) regression since -rc3
> > about bad desktop performance:
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=23912
> >
> > but it doesn't have any more information yet (so if -rc3 _is_ good for
> > you, and you can add anything to that report, it would be good. The
> > original reporter is hopefully bisecting it now)
>
> Mike, the last pure -rc3 -tip commit is 92c883adf03b - you could try to check that
> out too: it has most of the current sched/core commits, but has none of the post-rc3
> DRM changes.

Well we totally re-wrote the cgroup load-balancer in -tip. The thing
currently in -linus is a utter crap because its very strongly serialized
across all cores (some people spend like 25% of their time in there).

2010-11-29 12:31:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Peter Zijlstra <[email protected]> wrote:

> On Sun, 2010-11-28 at 21:18 +0100, Ingo Molnar wrote:
> > * Linus Torvalds <[email protected]> wrote:
> >
> > > On Sun, Nov 28, 2010 at 6:24 AM, Mike Galbraith <[email protected]> wrote:
> > > >
> > > > Something else is seriously wrong though. 36.1 with attached (plus
> > > > sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> > > > 37.git and tip with fixlet below both suck rocks. With a make -j40
> > > > running, wakeup-latency is showing latencies of >100ms, amarok skips,
> > > > mouse lurches badly.. generally horrid. Something went south.
> > >
> > > Can you test -rc3? Is that still ok? And are you perhaps using
> > > Nouveau? There's a report of some graphics (?) regression since -rc3
> > > about bad desktop performance:
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=23912
> > >
> > > but it doesn't have any more information yet (so if -rc3 _is_ good for
> > > you, and you can add anything to that report, it would be good. The
> > > original reporter is hopefully bisecting it now)
> >
> > Mike, the last pure -rc3 -tip commit is 92c883adf03b - you could try to check that
> > out too: it has most of the current sched/core commits, but has none of the post-rc3
> > DRM changes.
>

> Well we totally re-wrote the cgroup load-balancer in -tip. The thing currently in
> -linus is a utter crap because its very strongly serialized across all cores (some
> people spend like 25% of their time in there).

Yes, 92c883adf03b includes those changes:

08f3c3065f4c: Merge branch 'sched/core'
9437178f623a: sched: Update tg->shares after cpu.shares write
d6b5591829bd: sched: Allow update_cfs_load() to update global load
3b3d190ec368: sched: Implement demand based update_cfs_load()
c66eaf619c0c: sched: Update shares on idle_balance
a7a4f8a752ec: sched: Add sysctl_sched_shares_window
67e86250f8ea: sched: Introduce hierarchal order on shares update list
e33078baa4d3: sched: Fix update_cfs_load() synchronization
f0d7442a5924: sched: Fix load corruption from update_cfs_shares()
9e3081ca6114: sched: Make tg_shares_up() walk on-demand
3d4b47b4b040: sched: Implement on-demand (active) cfs_rq list
2069dd75c7d0: sched: Rewrite tg_shares_up)
48c5ccae88dc: sched: Simplify cpu-hot-unplug task migration
92fd4d4d67b9: Merge commit 'v2.6.37-rc2' into sched/core

I just wanted to give Mike a known-stable sha1 that has -rc3 but not the post-rc3
DRM changes.

Thanks,

Ingo

2010-11-29 13:45:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, 2010-11-29 at 13:30 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Sun, 2010-11-28 at 21:18 +0100, Ingo Molnar wrote:
> > > * Linus Torvalds <[email protected]> wrote:
> > >
> > > > On Sun, Nov 28, 2010 at 6:24 AM, Mike Galbraith <[email protected]> wrote:
> > > > >
> > > > > Something else is seriously wrong though. 36.1 with attached (plus
> > > > > sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> > > > > 37.git and tip with fixlet below both suck rocks. With a make -j40
> > > > > running, wakeup-latency is showing latencies of >100ms, amarok skips,
> > > > > mouse lurches badly.. generally horrid. Something went south.
> > > >
> > > > Can you test -rc3? Is that still ok? And are you perhaps using
> > > > Nouveau? There's a report of some graphics (?) regression since -rc3
> > > > about bad desktop performance:
> > > >
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=23912
> > > >
> > > > but it doesn't have any more information yet (so if -rc3 _is_ good for
> > > > you, and you can add anything to that report, it would be good. The
> > > > original reporter is hopefully bisecting it now)
> > >
> > > Mike, the last pure -rc3 -tip commit is 92c883adf03b - you could try to check that
> > > out too: it has most of the current sched/core commits, but has none of the post-rc3
> > > DRM changes.
> >
>
> > Well we totally re-wrote the cgroup load-balancer in -tip. The thing currently in
> > -linus is a utter crap because its very strongly serialized across all cores (some
> > people spend like 25% of their time in there).
>
> Yes, 92c883adf03b includes those changes:
>
> 08f3c3065f4c: Merge branch 'sched/core'
> 9437178f623a: sched: Update tg->shares after cpu.shares write
> d6b5591829bd: sched: Allow update_cfs_load() to update global load
> 3b3d190ec368: sched: Implement demand based update_cfs_load()
> c66eaf619c0c: sched: Update shares on idle_balance
> a7a4f8a752ec: sched: Add sysctl_sched_shares_window
> 67e86250f8ea: sched: Introduce hierarchal order on shares update list
> e33078baa4d3: sched: Fix update_cfs_load() synchronization
> f0d7442a5924: sched: Fix load corruption from update_cfs_shares()
> 9e3081ca6114: sched: Make tg_shares_up() walk on-demand
> 3d4b47b4b040: sched: Implement on-demand (active) cfs_rq list
> 2069dd75c7d0: sched: Rewrite tg_shares_up)
> 48c5ccae88dc: sched: Simplify cpu-hot-unplug task migration
> 92fd4d4d67b9: Merge commit 'v2.6.37-rc2' into sched/core
>
> I just wanted to give Mike a known-stable sha1 that has -rc3 but not the post-rc3
> DRM changes.

The good news is that the 37.git kernel was mislabeled in grub, was also
booting the _tip_ kernel, and is actually just fine. It's only tip, and
tip 92fd4d4d67b9 is still bad. I'll try a quick bisect before getting
back to backlog.

-Mike

2010-11-29 13:48:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Mike Galbraith <[email protected]> wrote:

> On Mon, 2010-11-29 at 13:30 +0100, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Sun, 2010-11-28 at 21:18 +0100, Ingo Molnar wrote:
> > > > * Linus Torvalds <[email protected]> wrote:
> > > >
> > > > > On Sun, Nov 28, 2010 at 6:24 AM, Mike Galbraith <[email protected]> wrote:
> > > > > >
> > > > > > Something else is seriously wrong though. 36.1 with attached (plus
> > > > > > sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> > > > > > 37.git and tip with fixlet below both suck rocks. With a make -j40
> > > > > > running, wakeup-latency is showing latencies of >100ms, amarok skips,
> > > > > > mouse lurches badly.. generally horrid. Something went south.
> > > > >
> > > > > Can you test -rc3? Is that still ok? And are you perhaps using
> > > > > Nouveau? There's a report of some graphics (?) regression since -rc3
> > > > > about bad desktop performance:
> > > > >
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=23912
> > > > >
> > > > > but it doesn't have any more information yet (so if -rc3 _is_ good for
> > > > > you, and you can add anything to that report, it would be good. The
> > > > > original reporter is hopefully bisecting it now)
> > > >
> > > > Mike, the last pure -rc3 -tip commit is 92c883adf03b - you could try to check that
> > > > out too: it has most of the current sched/core commits, but has none of the post-rc3
> > > > DRM changes.
> > >
> >
> > > Well we totally re-wrote the cgroup load-balancer in -tip. The thing currently in
> > > -linus is a utter crap because its very strongly serialized across all cores (some
> > > people spend like 25% of their time in there).
> >
> > Yes, 92c883adf03b includes those changes:
> >
> > 08f3c3065f4c: Merge branch 'sched/core'
> > 9437178f623a: sched: Update tg->shares after cpu.shares write
> > d6b5591829bd: sched: Allow update_cfs_load() to update global load
> > 3b3d190ec368: sched: Implement demand based update_cfs_load()
> > c66eaf619c0c: sched: Update shares on idle_balance
> > a7a4f8a752ec: sched: Add sysctl_sched_shares_window
> > 67e86250f8ea: sched: Introduce hierarchal order on shares update list
> > e33078baa4d3: sched: Fix update_cfs_load() synchronization
> > f0d7442a5924: sched: Fix load corruption from update_cfs_shares()
> > 9e3081ca6114: sched: Make tg_shares_up() walk on-demand
> > 3d4b47b4b040: sched: Implement on-demand (active) cfs_rq list
> > 2069dd75c7d0: sched: Rewrite tg_shares_up)
> > 48c5ccae88dc: sched: Simplify cpu-hot-unplug task migration
> > 92fd4d4d67b9: Merge commit 'v2.6.37-rc2' into sched/core
> >
> > I just wanted to give Mike a known-stable sha1 that has -rc3 but not the post-rc3
> > DRM changes.
>
> The good news is that the 37.git kernel was mislabeled in grub, was also
> booting the _tip_ kernel, and is actually just fine. It's only tip, and
> tip 92fd4d4d67b9 is still bad. I'll try a quick bisect before getting
> back to backlog.

Just curious, what's the freshest still good -tip sha1?

Thanks,

Ingo

2010-11-29 14:04:33

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, 2010-11-29 at 14:47 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:

> > The good news is that the 37.git kernel was mislabeled in grub, was also
> > booting the _tip_ kernel, and is actually just fine. It's only tip, and
> > tip 92fd4d4d67b9 is still bad. I'll try a quick bisect before getting
> > back to backlog.
>
> Just curious, what's the freshest still good -tip sha1?

I don't have a good tip yet. My bisection started with a merge, which
tested bad, so it spat out...

marge:..git/linux-2.6 # git bisect bad
The merge base e53beacd23d9cb47590da6a7a7f6d417b941a994 is bad.
This means the bug has been fixed between e53beacd23d9cb47590da6a7a7f6d417b941a994 and [19650e8580987c0ffabc2fe2cbc16b944789df8b].

marge:..git/linux-2.6 # git bisect log
git bisect start
# good: [19650e8580987c0ffabc2fe2cbc16b944789df8b] Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
git bisect good 19650e8580987c0ffabc2fe2cbc16b944789df8b
# bad: [92fd4d4d67b945c0766416284d4ab236b31542c4] Merge commit 'v2.6.37-rc2' into sched/core
git bisect bad 92fd4d4d67b945c0766416284d4ab236b31542c4
# bad: [e53beacd23d9cb47590da6a7a7f6d417b941a994] Linux 2.6.37-rc2
git bisect bad e53beacd23d9cb47590da6a7a7f6d417b941a994
marge:..git/linux-2.6 #

2010-11-29 16:28:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, Nov 29, 2010 at 3:53 AM, Peter Zijlstra <[email protected]> wrote:
>
> Well we totally re-wrote the cgroup load-balancer in -tip. The thing
> currently in -linus is a utter crap because its very strongly serialized
> across all cores (some people spend like 25% of their time in there).

Well, it seems that the rewrite is more crap than the "utter crap" in
current -git. What does that make -tip? Super-utter-crap?

Peter - getting the wrong answer quickly is not any better than strong
serialization.

Anyway, I'm happy to hear that the problem hasn't reached mainline yet.

Linus

2010-11-29 16:44:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Linus Torvalds <[email protected]> wrote:

> On Mon, Nov 29, 2010 at 3:53 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Well we totally re-wrote the cgroup load-balancer in -tip. The thing currently
> > in -linus is a utter crap because its very strongly serialized across all cores
> > (some people spend like 25% of their time in there).
>
> Well, it seems that the rewrite is more crap than the "utter crap" in current
> -git. What does that make -tip? Super-utter-crap?

Something along that line, or worse.

> Peter - getting the wrong answer quickly is not any better than strong
> serialization.

Yeah, obviously.

> Anyway, I'm happy to hear that the problem hasn't reached mainline yet.

It wont.

Thanks,

Ingo

2010-11-29 17:38:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, 2010-11-29 at 08:27 -0800, Linus Torvalds wrote:
> On Mon, Nov 29, 2010 at 3:53 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Well we totally re-wrote the cgroup load-balancer in -tip. The thing
> > currently in -linus is a utter crap because its very strongly serialized
> > across all cores (some people spend like 25% of their time in there).
>
> Well, it seems that the rewrite is more crap than the "utter crap" in
> current -git. What does that make -tip? Super-utter-crap?
>
> Peter - getting the wrong answer quickly is not any better than strong
> serialization.

I know, from the testing so far we _thought_ it was fairly sane.
Apparently there's still some work to do.

2010-11-29 18:03:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Peter Zijlstra <[email protected]> wrote:

> On Mon, 2010-11-29 at 08:27 -0800, Linus Torvalds wrote:
> > On Mon, Nov 29, 2010 at 3:53 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > Well we totally re-wrote the cgroup load-balancer in -tip. The thing
> > > currently in -linus is a utter crap because its very strongly serialized
> > > across all cores (some people spend like 25% of their time in there).
> >
> > Well, it seems that the rewrite is more crap than the "utter crap" in
> > current -git. What does that make -tip? Super-utter-crap?
> >
> > Peter - getting the wrong answer quickly is not any better than strong
> > serialization.
>
> I know, from the testing so far we _thought_ it was fairly sane.
> Apparently there's still some work to do.

Btw., i think it shows the conceptual power of Mike's patch that this cgroups
scheduling suckage was exposed so clearly. Previously it took weeks (sometimes
months) for bugs to reach those who are using cgroups.

Thanks,

Ingo

2010-11-29 19:06:11

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, 2010-11-29 at 18:37 +0100, Peter Zijlstra wrote:
> On Mon, 2010-11-29 at 08:27 -0800, Linus Torvalds wrote:
> > On Mon, Nov 29, 2010 at 3:53 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > Well we totally re-wrote the cgroup load-balancer in -tip. The thing
> > > currently in -linus is a utter crap because its very strongly serialized
> > > across all cores (some people spend like 25% of their time in there).
> >
> > Well, it seems that the rewrite is more crap than the "utter crap" in
> > current -git. What does that make -tip? Super-utter-crap?
> >
> > Peter - getting the wrong answer quickly is not any better than strong
> > serialization.
>
> I know, from the testing so far we _thought_ it was fairly sane.
> Apparently there's still some work to do.

Damn thing bisected to:

commit 92fd4d4d67b945c0766416284d4ab236b31542c4
Merge: fe7de49 e53beac
Author: Ingo Molnar <[email protected]>
Date: Thu Nov 18 13:22:14 2010 +0100

Merge commit 'v2.6.37-rc2' into sched/core

Merge reason: Move to a .37-rc base.

Signed-off-by: Ingo Molnar <[email protected]>

92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit

git bisect start
# good: [f6f94e2ab1b33f0082ac22d71f66385a60d8157f] Linux 2.6.36
git bisect good f6f94e2ab1b33f0082ac22d71f66385a60d8157f
# bad: [3a2b7f908d45fa45670e8ba9e7e24c0409ba43d8] Merge branch 'linus'
git bisect bad 3a2b7f908d45fa45670e8ba9e7e24c0409ba43d8
# good: [520045db940a381d2bee1c1b2179f7921b40fb10] Merge branches 'upstream/xenfs' and 'upstream/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
git bisect good 520045db940a381d2bee1c1b2179f7921b40fb10
# good: [520045db940a381d2bee1c1b2179f7921b40fb10] Merge branches 'upstream/xenfs' and 'upstream/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
git bisect good 520045db940a381d2bee1c1b2179f7921b40fb10
# good: [120a795da07c9a02221ca23464c28a7c6ad7de1d] audit mmap
git bisect good 120a795da07c9a02221ca23464c28a7c6ad7de1d
# good: [19650e8580987c0ffabc2fe2cbc16b944789df8b] Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
git bisect good 19650e8580987c0ffabc2fe2cbc16b944789df8b
# good: [11259d65a61b84ad954953a194c41fe84dff889a] Merge branch 'out-of-tree'
git bisect good 11259d65a61b84ad954953a194c41fe84dff889a
# good: [eae0932ceba16e7ee0b5690455a13ef8364845da] Merge branch 'x86/mm'
git bisect good eae0932ceba16e7ee0b5690455a13ef8364845da
# good: [0464a38aaca10e1a8afed003d16d25dca2168d86] Merge branch 'sched/urgent'
git bisect good 0464a38aaca10e1a8afed003d16d25dca2168d86
# good: [22d1b202a8d0e1dedc35086b8f3df0a7b37d1371] Merge branch 'x86/urgent'
git bisect good 22d1b202a8d0e1dedc35086b8f3df0a7b37d1371
# bad: [282810f891cf6587dfc04fc5e26ec7772330c8cb] Merge branch 'sched/core'
git bisect bad 282810f891cf6587dfc04fc5e26ec7772330c8cb
# bad: [2932e532dd8fbd699ce072a4badc7fbe69451be6] Merge branch 'out-of-tree'
git bisect bad 2932e532dd8fbd699ce072a4badc7fbe69451be6
# bad: [d6b5591829bd348a5fbe1c428d28dea00621cdba] sched: Allow update_cfs_load() to update global load
git bisect bad d6b5591829bd348a5fbe1c428d28dea00621cdba
# bad: [f0d7442a5924a802b66eef79b3708f77297bfb35] sched: Fix load corruption from update_cfs_shares()
git bisect bad f0d7442a5924a802b66eef79b3708f77297bfb35
# bad: [2069dd75c7d0f49355939e5586daf5a9ab216db7] sched: Rewrite tg_shares_up)
git bisect bad 2069dd75c7d0f49355939e5586daf5a9ab216db7
# bad: [48c5ccae88dcd989d9de507e8510313c6cbd352b] sched: Simplify cpu-hot-unplug task migration
git bisect bad 48c5ccae88dcd989d9de507e8510313c6cbd352b
# bad: [92fd4d4d67b945c0766416284d4ab236b31542c4] Merge commit 'v2.6.37-rc2' into sched/core
git bisect bad 92fd4d4d67b945c0766416284d4ab236b31542c4

2010-11-29 19:20:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Mike Galbraith <[email protected]> wrote:

> > I know, from the testing so far we _thought_ it was fairly sane. Apparently
> > there's still some work to do.
>
> Damn thing bisected to:
>
> commit 92fd4d4d67b945c0766416284d4ab236b31542c4
> Merge: fe7de49 e53beac
> Author: Ingo Molnar <[email protected]>
> Date: Thu Nov 18 13:22:14 2010 +0100
>
> Merge commit 'v2.6.37-rc2' into sched/core
>
> Merge reason: Move to a .37-rc base.
>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> 92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit

Hm, i'd suggest to double check the two originator points:

e53beac - is it really 'bad' ?
fe7de49 - is it really 'good'?

Thanks,

Ingo

2010-11-30 03:40:03

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, Nov 29, 2010 at 11:20 AM, Ingo Molnar <[email protected]> wrote:
>
> * Mike Galbraith <[email protected]> wrote:
>
>> > I know, from the testing so far we _thought_ it was fairly sane. Apparently
>> > there's still some work to do.
>>
>> Damn thing bisected to:
>>
>> commit 92fd4d4d67b945c0766416284d4ab236b31542c4
>> Merge: fe7de49 e53beac
>> Author: Ingo Molnar <[email protected]>
>> Date: ? Thu Nov 18 13:22:14 2010 +0100
>>
>> ? ? Merge commit 'v2.6.37-rc2' into sched/core
>>
>> ? ? Merge reason: Move to a .37-rc base.
>>
>> ? ? Signed-off-by: Ingo Molnar <[email protected]>
>>
>> 92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit
>
> Hm, i'd suggest to double check the two originator points:
>
> ?e53beac - is it really 'bad' ?
> ?fe7de49 - is it really 'good'?
>
> Thanks,
>
> ? ? ? ?Ingo
>

https://lkml.org/lkml/2010/11/29/566

Should fix this. We missed this in testing as the delay between
last-task-exit and group destruction was always sufficiently large as
to ensure that the task_group had aged out of shares updates (as
opposed to requiring explicit removal).

With autogroup obviously the window here is essentially instantaneous
which leads to the buggy removal code being executed.

2010-11-30 04:14:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, 2010-11-29 at 19:39 -0800, Paul Turner wrote:
> On Mon, Nov 29, 2010 at 11:20 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Mike Galbraith <[email protected]> wrote:
> >
> >> > I know, from the testing so far we _thought_ it was fairly sane. Apparently
> >> > there's still some work to do.
> >>
> >> Damn thing bisected to:
> >>
> >> commit 92fd4d4d67b945c0766416284d4ab236b31542c4
> >> Merge: fe7de49 e53beac
> >> Author: Ingo Molnar <[email protected]>
> >> Date: Thu Nov 18 13:22:14 2010 +0100
> >>
> >> Merge commit 'v2.6.37-rc2' into sched/core
> >>
> >> Merge reason: Move to a .37-rc base.
> >>
> >> Signed-off-by: Ingo Molnar <[email protected]>
> >>
> >> 92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit
> >
> > Hm, i'd suggest to double check the two originator points:
> >
> > e53beac - is it really 'bad' ?
> > fe7de49 - is it really 'good'?
> >
> > Thanks,
> >
> > Ingo
> >
>
> https://lkml.org/lkml/2010/11/29/566
>
> Should fix this.

No, I had it in place where pertinent. Problem with bisection is that
there are a couple of spots where X doesn't work. With X, it's obvious,
less so in text console. Looks like I must have miscalled one of those.

-Mike

2010-11-30 04:24:21

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, Nov 29, 2010 at 8:14 PM, Mike Galbraith <[email protected]> wrote:
> On Mon, 2010-11-29 at 19:39 -0800, Paul Turner wrote:
>> On Mon, Nov 29, 2010 at 11:20 AM, Ingo Molnar <[email protected]> wrote:
>> >
>> > * Mike Galbraith <[email protected]> wrote:
>> >
>> >> > I know, from the testing so far we _thought_ it was fairly sane. Apparently
>> >> > there's still some work to do.
>> >>
>> >> Damn thing bisected to:
>> >>
>> >> commit 92fd4d4d67b945c0766416284d4ab236b31542c4
>> >> Merge: fe7de49 e53beac
>> >> Author: Ingo Molnar <[email protected]>
>> >> Date: ? Thu Nov 18 13:22:14 2010 +0100
>> >>
>> >> ? ? Merge commit 'v2.6.37-rc2' into sched/core
>> >>
>> >> ? ? Merge reason: Move to a .37-rc base.
>> >>
>> >> ? ? Signed-off-by: Ingo Molnar <[email protected]>
>> >>
>> >> 92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit
>> >
>> > Hm, i'd suggest to double check the two originator points:
>> >
>> > ?e53beac - is it really 'bad' ?
>> > ?fe7de49 - is it really 'good'?
>> >
>> > Thanks,
>> >
>> > ? ? ? ?Ingo
>> >
>>
>> https://lkml.org/lkml/2010/11/29/566
>>
>> Should fix this.
>
> No, I had it in place where pertinent. ?Problem with bisection is that
> there are a couple of spots where X doesn't work. ?With X, it's obvious,
> less so in text console. ?Looks like I must have miscalled one of those.
>
> ? ? ? ?-Mike

I've left some machines running tip + fix above + autogroup to see if
anything else emerges. Hasn't crashed yet, I'll leave it going
overnight.

>
>

2010-11-30 07:54:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, 2010-11-29 at 20:20 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > > I know, from the testing so far we _thought_ it was fairly sane. Apparently
> > > there's still some work to do.
> >
> > Damn thing bisected to:
> >
> > commit 92fd4d4d67b945c0766416284d4ab236b31542c4
> > Merge: fe7de49 e53beac
> > Author: Ingo Molnar <[email protected]>
> > Date: Thu Nov 18 13:22:14 2010 +0100
> >
> > Merge commit 'v2.6.37-rc2' into sched/core
> >
> > Merge reason: Move to a .37-rc base.
> >
> > Signed-off-by: Ingo Molnar <[email protected]>
> >
> > 92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit
>
> Hm, i'd suggest to double check the two originator points:
>
> e53beac - is it really 'bad' ?
> fe7de49 - is it really 'good'?

Nope. I did a bisection this morning in text mode with a pipe-test
based measurement proggy, and it bisected cleanly.

2069dd75c7d0f49355939e5586daf5a9ab216db7 is the first bad commit

commit 2069dd75c7d0f49355939e5586daf5a9ab216db7
Author: Peter Zijlstra <[email protected]>
Date: Mon Nov 15 15:47:00 2010 -0800

sched: Rewrite tg_shares_up)

-Mike

2010-11-30 13:18:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Mon, 2010-11-29 at 20:23 -0800, Paul Turner wrote:

> I've left some machines running tip + fix above + autogroup to see if
> anything else emerges. Hasn't crashed yet, I'll leave it going
> overnight.

Thanks. Below is the hopefully final version against tip. The last I
sent contained a couple remnants.

From: Mike Galbraith <[email protected]>
Date: Tue Nov 30 14:07:12 CET 2010
Subject: [PATCH] sched: Improve desktop interactivity: Implement automated per session task groups

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity. This patch implements an idea from Linus,
to automatically create task groups. Currently, only per session autogroups
are implemented, but the patch leaves the way open for enhancement.

Implementation: each task's signal struct contains an inherited pointer to
a refcounted autogroup struct containing a task group pointer, the default
for all tasks pointing to the init_task_group. When a task calls setsid(),
a new task group is created, the process is moved into the new task group,
and a reference to the preveious task group is dropped. Child processes
inherit this task group thereafter, and increase it's refcount. When the
last thread of a process exits, the process's reference is dropped, such
that when the last process referencing an autogroup exits, the autogroup
is destroyed.

At runqueue selection time, IFF a task has no cgroup assignment, its current
autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level through the
proc filesystem. cat /proc/<pid>/autogroup displays the task's group and the
group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task
group's shares to the weight of nice <level> task. Setting nice level is rate
limited for !admin users due to the abuse risk of task group locking.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is
selected, but can be disabled via the boot option noautogroup, and can also
be turned on/off on the fly via..
echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Signed-off-by: Mike Galbraith <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Markus Trippelsdorf <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/kernel-parameters.txt | 2
fs/proc/base.c | 79 ++++++++++++
include/linux/sched.h | 23 +++
init/Kconfig | 12 +
kernel/fork.c | 5
kernel/sched.c | 13 +-
kernel/sched_autogroup.c | 229 ++++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 32 +++++
kernel/sched_debug.c | 29 ++--
kernel/sys.c | 4
kernel/sysctl.c | 11 +
11 files changed, 421 insertions(+), 18 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -513,6 +513,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -580,6 +582,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1931,6 +1936,24 @@ int sched_rt_handler(struct ctl_table *t

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+extern unsigned int sysctl_sched_autogroup_enabled;
+
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#ifdef CONFIG_PROC_FS
+extern void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m);
+extern int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice);
+#endif
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -79,6 +79,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -271,6 +272,10 @@ struct task_group {
struct task_group *parent;
struct list_head siblings;
struct list_head children;
+
+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
};

#define root_task_group init_task_group
@@ -603,11 +608,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -1869,6 +1877,7 @@ static void sched_irq_time_avg_update(st
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7752,7 +7761,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

for_each_possible_cpu(i) {
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(st

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clo
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -1080,8 +1080,10 @@ SYSCALL_DEFINE0(setsid)
err = session;
out:
write_unlock_irq(&tasklist_lock);
- if (err > 0)
+ if (err > 0) {
proc_sid_connector(group_leader);
+ sched_autogroup_create_attach(group_leader);
+ }
return err;
}

Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -87,6 +87,20 @@ static void print_cfs_group_stats(struct
}
#endif

+#if defined(CONFIG_CGROUP_SCHED) && \
+ (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+static void task_group_path(struct task_group *tg, char *buf, int buflen)
+{
+ /* may be NULL if the underlying cgroup isn't fully-created yet */
+ if (!tg->css.cgroup) {
+ if (!autogroup_path(tg, buf, buflen))
+ buf[0] = '\0';
+ return;
+ }
+ cgroup_path(tg->css.cgroup, buf, buflen);
+}
+#endif
+
static void
print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
{
@@ -115,7 +129,7 @@ print_task(struct seq_file *m, struct rq
char path[64];

rcu_read_lock();
- cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
+ task_group_path(task_group(p), path, sizeof(path));
rcu_read_unlock();
SEQ_printf(m, " %s", path);
}
@@ -147,19 +161,6 @@ static void print_rq(struct seq_file *m,
read_unlock_irqrestore(&tasklist_lock, flags);
}

-#if defined(CONFIG_CGROUP_SCHED) && \
- (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
-static void task_group_path(struct task_group *tg, char *buf, int buflen)
-{
- /* may be NULL if the underlying cgroup isn't fully-created yet */
- if (!tg->css.cgroup) {
- buf[0] = '\0';
- return;
- }
- cgroup_path(tg->css.cgroup, buf, buflen);
-}
-#endif
-
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
Index: linux-2.6/fs/proc/base.c
===================================================================
--- linux-2.6.orig/fs/proc/base.c
+++ linux-2.6/fs/proc/base.c
@@ -1407,6 +1407,82 @@ static const struct file_operations proc

#endif

+#ifdef CONFIG_SCHED_AUTOGROUP
+/*
+ * Print out autogroup related information:
+ */
+static int sched_autogroup_show(struct seq_file *m, void *v)
+{
+ struct inode *inode = m->private;
+ struct task_struct *p;
+
+ p = get_proc_task(inode);
+ if (!p)
+ return -ESRCH;
+ proc_sched_autogroup_show_task(p, m);
+
+ put_task_struct(p);
+
+ return 0;
+}
+
+static ssize_t
+sched_autogroup_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *offset)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct task_struct *p;
+ char buffer[PROC_NUMBUF];
+ long nice;
+ int err;
+
+ memset(buffer, 0, sizeof(buffer));
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+
+ err = strict_strtol(strstrip(buffer), 0, &nice);
+ if (err)
+ return -EINVAL;
+
+ p = get_proc_task(inode);
+ if (!p)
+ return -ESRCH;
+
+ err = nice;
+ err = proc_sched_autogroup_set_nice(p, &err);
+ if (err)
+ count = err;
+
+ put_task_struct(p);
+
+ return count;
+}
+
+static int sched_autogroup_open(struct inode *inode, struct file *filp)
+{
+ int ret;
+
+ ret = single_open(filp, sched_autogroup_show, NULL);
+ if (!ret) {
+ struct seq_file *m = filp->private_data;
+
+ m->private = inode;
+ }
+ return ret;
+}
+
+static const struct file_operations proc_pid_sched_autogroup_operations = {
+ .open = sched_autogroup_open,
+ .read = seq_read,
+ .write = sched_autogroup_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
+
static ssize_t comm_write(struct file *file, const char __user *buf,
size_t count, loff_t *offset)
{
@@ -2733,6 +2809,9 @@ static const struct pid_entry tgid_base_
#ifdef CONFIG_SCHED_DEBUG
REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations),
#endif
+#ifdef CONFIG_SCHED_AUTOGROUP
+ REG("autogroup", S_IRUGO|S_IWUSR, proc_pid_sched_autogroup_operations),
+#endif
REG("comm", S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUSR, proc_pid_syscall),
Index: linux-2.6/kernel/sched_autogroup.h
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.h
@@ -0,0 +1,32 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+struct autogroup {
+ struct kref kref;
+ struct task_group *tg;
+ struct rw_semaphore lock;
+ unsigned long id;
+ int nice;
+};
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+static inline void autogroup_free(struct task_group *tg) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
+{
+ return 0;
+}
+#endif
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6/kernel/sched_autogroup.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_autogroup.c
@@ -0,0 +1,229 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/utsname.h>
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+static struct autogroup autogroup_default;
+static atomic_t autogroup_seq_nr;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ init_task_group.autogroup = &autogroup_default;
+ kref_init(&autogroup_default.kref);
+ init_rwsem(&autogroup_default.lock);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_free(struct task_group *tg)
+{
+ kfree(tg->autogroup);
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+
+ sched_destroy_group(ag->tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
+ struct task_group *tg;
+
+ if (!ag)
+ goto out_fail;
+
+ tg = sched_create_group(&init_task_group);
+
+ if (IS_ERR(tg))
+ goto out_free;
+
+ kref_init(&ag->kref);
+ init_rwsem(&ag->lock);
+ ag->id = atomic_inc_return(&autogroup_seq_nr);
+ ag->tg = tg;
+ tg->autogroup = ag;
+
+ return ag;
+
+out_free:
+ kfree(ag);
+out_fail:
+ if (printk_ratelimit()) {
+ printk(KERN_WARNING "autogroup_create: %s failure.\n",
+ ag ? "sched_create_group()" : "kmalloc()");
+ }
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline bool
+task_wants_autogroup(struct task_struct *p, struct task_group *tg)
+{
+ if (tg != &root_task_group)
+ return false;
+
+ if (p->sched_class != &fair_sched_class)
+ return false;
+
+ /*
+ * We can only assume the task group can't go away on us if
+ * autogroup_move_group() can see us on ->thread_group list.
+ */
+ if (p->flags & PF_EXITING)
+ return false;
+
+ return true;
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ if (enabled && task_wants_autogroup(p, tg))
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+ unsigned long flags;
+
+ BUG_ON(!lock_task_sighand(p, &flags));
+
+ prev = p->signal->autogroup;
+ if (prev == ag) {
+ unlock_task_sighand(p, &flags);
+ return;
+ }
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+
+ t = p;
+ do {
+ sched_move_task(t);
+ } while_each_thread(p, t);
+
+ unlock_task_sighand(p, &flags);
+ autogroup_kref_put(prev);
+}
+
+/* Allocates GFP_KERNEL, cannot be called under any spinlock */
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* Cannot be called under siglock. Currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ struct task_struct *p = current;
+
+ spin_lock_irq(&p->sighand->siglock);
+ sig->autogroup = autogroup_kref_get(p->signal->autogroup);
+ spin_unlock_irq(&p->sighand->siglock);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+
+#ifdef CONFIG_PROC_FS
+
+/* Called with siglock held. */
+int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
+{
+ static unsigned long next = INITIAL_JIFFIES;
+ struct autogroup *ag;
+ int err;
+
+ if (*nice < -20 || *nice > 19)
+ return -EINVAL;
+
+ err = security_task_setnice(current, *nice);
+ if (err)
+ return err;
+
+ if (*nice < 0 && !can_nice(current, *nice))
+ return -EPERM;
+
+ /* this is a heavy operation taking global locks.. */
+ if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
+ return -EAGAIN;
+
+ next = HZ / 10 + jiffies;
+ ag = autogroup_kref_get(p->signal->autogroup);
+
+ down_write(&ag->lock);
+ err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);
+ if (!err)
+ ag->nice = *nice;
+ up_write(&ag->lock);
+
+ autogroup_kref_put(ag);
+
+ return err;
+}
+
+void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
+{
+ struct autogroup *ag = autogroup_kref_get(p->signal->autogroup);
+
+ down_read(&ag->lock);
+ seq_printf(m, "/autogroup-%ld nice %d\n", ag->id, ag->nice);
+ up_read(&ag->lock);
+
+ autogroup_kref_put(ag);
+}
+#endif /* CONFIG_PROC_FS */
+
+#ifdef CONFIG_SCHED_DEBUG
+static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
+{
+ return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id);
+}
+#endif /* CONFIG_SCHED_DEBUG */
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -370,6 +370,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -741,6 +741,18 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task session.
+
config MM_OWNER
bool

Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters.
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.




2010-11-30 13:49:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, 2010-11-30 at 14:18 +0100, Mike Galbraith wrote:
>
> From: Mike Galbraith <[email protected]>
> Date: Tue Nov 30 14:07:12 CET 2010
> Subject: [PATCH] sched: Improve desktop interactivity: Implement automated per session task groups
>
> A recurring complaint from CFS users is that parallel kbuild has a negative
> impact on desktop interactivity. This patch implements an idea from Linus,
> to automatically create task groups. Currently, only per session autogroups
> are implemented, but the patch leaves the way open for enhancement.
>
> Implementation: each task's signal struct contains an inherited pointer to
> a refcounted autogroup struct containing a task group pointer, the default
> for all tasks pointing to the init_task_group. When a task calls setsid(),
> a new task group is created, the process is moved into the new task group,
> and a reference to the preveious task group is dropped. Child processes
> inherit this task group thereafter, and increase it's refcount. When the
> last thread of a process exits, the process's reference is dropped, such
> that when the last process referencing an autogroup exits, the autogroup
> is destroyed.
>
> At runqueue selection time, IFF a task has no cgroup assignment, its current
> autogroup is used.
>
> Autogroup bandwidth is controllable via setting it's nice level through the
> proc filesystem. cat /proc/<pid>/autogroup displays the task's group and the
> group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task
> group's shares to the weight of nice <level> task. Setting nice level is rate
> limited for !admin users due to the abuse risk of task group locking.
>
> The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is
> selected, but can be disabled via the boot option noautogroup, and can also
> be turned on/off on the fly via..
> echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
> ..which will automatically move tasks to/from the root task group.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Markus Trippelsdorf <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>

Looks good to me, Thanks Mike!

Acked-by: Peter Zijlstra <[email protected]>

2010-11-30 14:00:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Peter Zijlstra <[email protected]> wrote:

> On Tue, 2010-11-30 at 14:18 +0100, Mike Galbraith wrote:
> >
> > From: Mike Galbraith <[email protected]>
> > Date: Tue Nov 30 14:07:12 CET 2010
> > Subject: [PATCH] sched: Improve desktop interactivity: Implement automated per session task groups
> >
> > A recurring complaint from CFS users is that parallel kbuild has a negative
> > impact on desktop interactivity. This patch implements an idea from Linus,
> > to automatically create task groups. Currently, only per session autogroups
> > are implemented, but the patch leaves the way open for enhancement.
> >
> > Implementation: each task's signal struct contains an inherited pointer to
> > a refcounted autogroup struct containing a task group pointer, the default
> > for all tasks pointing to the init_task_group. When a task calls setsid(),
> > a new task group is created, the process is moved into the new task group,
> > and a reference to the preveious task group is dropped. Child processes
> > inherit this task group thereafter, and increase it's refcount. When the
> > last thread of a process exits, the process's reference is dropped, such
> > that when the last process referencing an autogroup exits, the autogroup
> > is destroyed.
> >
> > At runqueue selection time, IFF a task has no cgroup assignment, its current
> > autogroup is used.
> >
> > Autogroup bandwidth is controllable via setting it's nice level through the
> > proc filesystem. cat /proc/<pid>/autogroup displays the task's group and the
> > group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task
> > group's shares to the weight of nice <level> task. Setting nice level is rate
> > limited for !admin users due to the abuse risk of task group locking.
> >
> > The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is
> > selected, but can be disabled via the boot option noautogroup, and can also
> > be turned on/off on the fly via..
> > echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
> > ..which will automatically move tasks to/from the root task group.
> >
> > Signed-off-by: Mike Galbraith <[email protected]>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Markus Trippelsdorf <[email protected]>
> > Cc: Mathieu Desnoyers <[email protected]>
> > LKML-Reference: <[email protected]>
> > Signed-off-by: Ingo Molnar <[email protected]>
>
> Looks good to me, Thanks Mike!
>
> Acked-by: Peter Zijlstra <[email protected]>

Ok, great!

I've queued it up in tip:sched/core and started testing it - will push it out if it
passes basic tests. Added Linus's Acked-by - i presume that's still valid for v4 as
well, right?

Thanks,

Ingo

2010-11-30 14:13:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Mike Galbraith <[email protected]> wrote:

> On Mon, 2010-11-29 at 20:23 -0800, Paul Turner wrote:
>
> > I've left some machines running tip + fix above + autogroup to see if
> > anything else emerges. Hasn't crashed yet, I'll leave it going
> > overnight.
>
> Thanks. Below is the hopefully final version against tip. The last I
> sent contained a couple remnants.

Note, I removed this chunk:

> kernel/sched_debug.c | 29 ++--

> Index: linux-2.6/kernel/sched_debug.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_debug.c
> +++ linux-2.6/kernel/sched_debug.c
> @@ -87,6 +87,20 @@ static void print_cfs_group_stats(struct
> }
> #endif
>
> +#if defined(CONFIG_CGROUP_SCHED) && \
> + (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
> +static void task_group_path(struct task_group *tg, char *buf, int buflen)
> +{
> + /* may be NULL if the underlying cgroup isn't fully-created yet */
> + if (!tg->css.cgroup) {
> + if (!autogroup_path(tg, buf, buflen))
> + buf[0] = '\0';
> + return;
> + }
> + cgroup_path(tg->css.cgroup, buf, buflen);
> +}
> +#endif
> +
> static void
> print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
> {
> @@ -115,7 +129,7 @@ print_task(struct seq_file *m, struct rq
> char path[64];
>
> rcu_read_lock();
> - cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
> + task_group_path(task_group(p), path, sizeof(path));
> rcu_read_unlock();
> SEQ_printf(m, " %s", path);
> }
> @@ -147,19 +161,6 @@ static void print_rq(struct seq_file *m,
> read_unlock_irqrestore(&tasklist_lock, flags);
> }
>
> -#if defined(CONFIG_CGROUP_SCHED) && \
> - (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
> -static void task_group_path(struct task_group *tg, char *buf, int buflen)
> -{
> - /* may be NULL if the underlying cgroup isn't fully-created yet */
> - if (!tg->css.cgroup) {
> - buf[0] = '\0';
> - return;
> - }
> - cgroup_path(tg->css.cgroup, buf, buflen);
> -}
> -#endif
> -
> void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> {
> s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,

Because it didn't build (for obvious reasons - the CONFIG conditions dont match up),
but more importantly it's quite ugly. Some existing 'path' variables are 64 byte,
some are 128 byte - so there's pre-existing damage - i removed it all.

Could we do this debugging code in a bit saner way please? (as a delta patch on top
of the -tip that i'll push out in the next hour or so.)

Thanks,

Ingo

2010-11-30 14:18:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Mike Galbraith <[email protected]> wrote:

> On Mon, 2010-11-29 at 20:20 +0100, Ingo Molnar wrote:
> > * Mike Galbraith <[email protected]> wrote:
> >
> > > > I know, from the testing so far we _thought_ it was fairly sane. Apparently
> > > > there's still some work to do.
> > >
> > > Damn thing bisected to:
> > >
> > > commit 92fd4d4d67b945c0766416284d4ab236b31542c4
> > > Merge: fe7de49 e53beac
> > > Author: Ingo Molnar <[email protected]>
> > > Date: Thu Nov 18 13:22:14 2010 +0100
> > >
> > > Merge commit 'v2.6.37-rc2' into sched/core
> > >
> > > Merge reason: Move to a .37-rc base.
> > >
> > > Signed-off-by: Ingo Molnar <[email protected]>
> > >
> > > 92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit
> >
> > Hm, i'd suggest to double check the two originator points:
> >
> > e53beac - is it really 'bad' ?
> > fe7de49 - is it really 'good'?
>
> Nope. I did a bisection this morning in text mode with a pipe-test
> based measurement proggy, and it bisected cleanly.
>
> 2069dd75c7d0f49355939e5586daf5a9ab216db7 is the first bad commit
>
> commit 2069dd75c7d0f49355939e5586daf5a9ab216db7
> Author: Peter Zijlstra <[email protected]>
> Date: Mon Nov 15 15:47:00 2010 -0800
>
> sched: Rewrite tg_shares_up)

Ok. And has this fixed it:

822bc180a7f7: sched: Fix unregister_fair_sched_group()

... or are there two bugs?

Thanks,

Ingo

2010-11-30 14:53:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


another detail is that i needed this fix:

--- linux.orig/init/Kconfig
+++ linux/init/Kconfig
@@ -790,6 +790,7 @@ endif # NAMESPACES

config SCHED_AUTOGROUP
bool "Automatic process group scheduling"
+ select EVENTFD
select CGROUPS
select CGROUP_SCHED
select FAIR_GROUP_SCHED

Because CGROUPS depends on eventfd infrastructure.

Thanks,

Ingo

2010-11-30 15:01:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, 2010-11-30 at 15:53 +0100, Ingo Molnar wrote:
> another detail is that i needed this fix:
>
> --- linux.orig/init/Kconfig
> +++ linux/init/Kconfig
> @@ -790,6 +790,7 @@ endif # NAMESPACES
>
> config SCHED_AUTOGROUP
> bool "Automatic process group scheduling"
> + select EVENTFD
> select CGROUPS
> select CGROUP_SCHED
> select FAIR_GROUP_SCHED
>
> Because CGROUPS depends on eventfd infrastructure.

Shouldn't then cgroups select that?

2010-11-30 15:11:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Peter Zijlstra <[email protected]> wrote:

> On Tue, 2010-11-30 at 15:53 +0100, Ingo Molnar wrote:
> > another detail is that i needed this fix:
> >
> > --- linux.orig/init/Kconfig
> > +++ linux/init/Kconfig
> > @@ -790,6 +790,7 @@ endif # NAMESPACES
> >
> > config SCHED_AUTOGROUP
> > bool "Automatic process group scheduling"
> > + select EVENTFD
> > select CGROUPS
> > select CGROUP_SCHED
> > select FAIR_GROUP_SCHED
> >
> > Because CGROUPS depends on eventfd infrastructure.
>
> Shouldn't then cgroups select that?

It depends on it so selecting it is fine. !EVENTFD is a CONFIG_EMBEDDED-only thing
anyway.

Thanks,

Ingo

2010-11-30 15:18:41

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, Nov 30, 2010 at 02:18:03PM +0100, Mike Galbraith wrote:
> On Mon, 2010-11-29 at 20:23 -0800, Paul Turner wrote:
>
> > I've left some machines running tip + fix above + autogroup to see if
> > anything else emerges. Hasn't crashed yet, I'll leave it going
> > overnight.
>
> Thanks. Below is the hopefully final version against tip. The last I
> sent contained a couple remnants.
>
> From: Mike Galbraith <[email protected]>
> Date: Tue Nov 30 14:07:12 CET 2010
> Subject: [PATCH] sched: Improve desktop interactivity: Implement automated per session task groups
>
> A recurring complaint from CFS users is that parallel kbuild has a negative
> impact on desktop interactivity. This patch implements an idea from Linus,
> to automatically create task groups. Currently, only per session autogroups
> are implemented, but the patch leaves the way open for enhancement.
>
> Implementation: each task's signal struct contains an inherited pointer to
> a refcounted autogroup struct containing a task group pointer, the default
> for all tasks pointing to the init_task_group. When a task calls setsid(),
> a new task group is created, the process is moved into the new task group,
> and a reference to the preveious task group is dropped. Child processes
> inherit this task group thereafter, and increase it's refcount. When the
> last thread of a process exits, the process's reference is dropped, such
> that when the last process referencing an autogroup exits, the autogroup
> is destroyed.
>
> At runqueue selection time, IFF a task has no cgroup assignment, its current
> autogroup is used.
>
> Autogroup bandwidth is controllable via setting it's nice level through the
> proc filesystem. cat /proc/<pid>/autogroup displays the task's group and the
> group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task
> group's shares to the weight of nice <level> task. Setting nice level is rate
> limited for !admin users due to the abuse risk of task group locking.
>

Hi Mike,

I was wonderig if these autogroups can be visible in regular cgroup
hierarchy so that once somebody mounts cpu controller, these are visible?

I was wondering why is a good idea to create a separate interface for
autogroups through proc and not try to integrate it with cgroup interface.

Without it now any user space tool shall have to either disable the
autogroup feature completely or now also worry about /proc interface
and there also autogroups are searchable through pid and there is no
direct way to access these.

IIUC, these autogroups create flat setup and are at same level as
init_task_group and are not children of it. Currently cpu cgroup
is hierarchical by default and any new cgroup is child of init_task_group
and that could lead to representation issues.

Well, will we not get same kind of latency boost if we make these autogroups
children of root? If yes, then hierarchical representation issue of autogroup
will be a moot point.

We already have /proc/<pid>/cgroup interface which points to tasks's
cgroup. We probably can avoid creating /proc/<pid>/autgroup if there
is an associated cgroup which appears in cgroup hierachy and then user
can change the weight of group through cgroup interface (instead of
introducing another interface).

Thanks
Vivek

> The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is
> selected, but can be disabled via the boot option noautogroup, and can also
> be turned on/off on the fly via..
> echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
> ..which will automatically move tasks to/from the root task group.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Markus Trippelsdorf <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> Documentation/kernel-parameters.txt | 2
> fs/proc/base.c | 79 ++++++++++++
> include/linux/sched.h | 23 +++
> init/Kconfig | 12 +
> kernel/fork.c | 5
> kernel/sched.c | 13 +-
> kernel/sched_autogroup.c | 229 ++++++++++++++++++++++++++++++++++++
> kernel/sched_autogroup.h | 32 +++++
> kernel/sched_debug.c | 29 ++--
> kernel/sys.c | 4
> kernel/sysctl.c | 11 +
> 11 files changed, 421 insertions(+), 18 deletions(-)
>
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -513,6 +513,8 @@ struct thread_group_cputimer {
> spinlock_t lock;
> };
>
> +struct autogroup;
> +
> /*
> * NOTE! "signal_struct" does not have it's own
> * locking, because a shared signal_struct always
> @@ -580,6 +582,9 @@ struct signal_struct {
>
> struct tty_struct *tty; /* NULL if no tty */
>
> +#ifdef CONFIG_SCHED_AUTOGROUP
> + struct autogroup *autogroup;
> +#endif
> /*
> * Cumulative resource counters for dead threads in the group,
> * and for reaped dead child processes forked by this group.
> @@ -1931,6 +1936,24 @@ int sched_rt_handler(struct ctl_table *t
>
> extern unsigned int sysctl_sched_compat_yield;
>
> +#ifdef CONFIG_SCHED_AUTOGROUP
> +extern unsigned int sysctl_sched_autogroup_enabled;
> +
> +extern void sched_autogroup_create_attach(struct task_struct *p);
> +extern void sched_autogroup_detach(struct task_struct *p);
> +extern void sched_autogroup_fork(struct signal_struct *sig);
> +extern void sched_autogroup_exit(struct signal_struct *sig);
> +#ifdef CONFIG_PROC_FS
> +extern void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m);
> +extern int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice);
> +#endif
> +#else
> +static inline void sched_autogroup_create_attach(struct task_struct *p) { }
> +static inline void sched_autogroup_detach(struct task_struct *p) { }
> +static inline void sched_autogroup_fork(struct signal_struct *sig) { }
> +static inline void sched_autogroup_exit(struct signal_struct *sig) { }
> +#endif
> +
> #ifdef CONFIG_RT_MUTEXES
> extern int rt_mutex_getprio(struct task_struct *p);
> extern void rt_mutex_setprio(struct task_struct *p, int prio);
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -79,6 +79,7 @@
>
> #include "sched_cpupri.h"
> #include "workqueue_sched.h"
> +#include "sched_autogroup.h"
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/sched.h>
> @@ -271,6 +272,10 @@ struct task_group {
> struct task_group *parent;
> struct list_head siblings;
> struct list_head children;
> +
> +#ifdef CONFIG_SCHED_AUTOGROUP
> + struct autogroup *autogroup;
> +#endif
> };
>
> #define root_task_group init_task_group
> @@ -603,11 +608,14 @@ static inline int cpu_of(struct rq *rq)
> */
> static inline struct task_group *task_group(struct task_struct *p)
> {
> + struct task_group *tg;
> struct cgroup_subsys_state *css;
>
> css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> lockdep_is_held(&task_rq(p)->lock));
> - return container_of(css, struct task_group, css);
> + tg = container_of(css, struct task_group, css);
> +
> + return autogroup_task_group(p, tg);
> }
>
> /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
> @@ -1869,6 +1877,7 @@ static void sched_irq_time_avg_update(st
> #include "sched_idletask.c"
> #include "sched_fair.c"
> #include "sched_rt.c"
> +#include "sched_autogroup.c"
> #include "sched_stoptask.c"
> #ifdef CONFIG_SCHED_DEBUG
> # include "sched_debug.c"
> @@ -7752,7 +7761,7 @@ void __init sched_init(void)
> #ifdef CONFIG_CGROUP_SCHED
> list_add(&init_task_group.list, &task_groups);
> INIT_LIST_HEAD(&init_task_group.children);
> -
> + autogroup_init(&init_task);
> #endif /* CONFIG_CGROUP_SCHED */
>
> for_each_possible_cpu(i) {
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c
> +++ linux-2.6/kernel/fork.c
> @@ -174,8 +174,10 @@ static inline void free_signal_struct(st
>
> static inline void put_signal_struct(struct signal_struct *sig)
> {
> - if (atomic_dec_and_test(&sig->sigcnt))
> + if (atomic_dec_and_test(&sig->sigcnt)) {
> + sched_autogroup_exit(sig);
> free_signal_struct(sig);
> + }
> }
>
> void __put_task_struct(struct task_struct *tsk)
> @@ -904,6 +906,7 @@ static int copy_signal(unsigned long clo
> posix_cpu_timers_init_group(sig);
>
> tty_audit_fork(sig);
> + sched_autogroup_fork(sig);
>
> sig->oom_adj = current->signal->oom_adj;
> sig->oom_score_adj = current->signal->oom_score_adj;
> Index: linux-2.6/kernel/sys.c
> ===================================================================
> --- linux-2.6.orig/kernel/sys.c
> +++ linux-2.6/kernel/sys.c
> @@ -1080,8 +1080,10 @@ SYSCALL_DEFINE0(setsid)
> err = session;
> out:
> write_unlock_irq(&tasklist_lock);
> - if (err > 0)
> + if (err > 0) {
> proc_sid_connector(group_leader);
> + sched_autogroup_create_attach(group_leader);
> + }
> return err;
> }
>
> Index: linux-2.6/kernel/sched_debug.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_debug.c
> +++ linux-2.6/kernel/sched_debug.c
> @@ -87,6 +87,20 @@ static void print_cfs_group_stats(struct
> }
> #endif
>
> +#if defined(CONFIG_CGROUP_SCHED) && \
> + (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
> +static void task_group_path(struct task_group *tg, char *buf, int buflen)
> +{
> + /* may be NULL if the underlying cgroup isn't fully-created yet */
> + if (!tg->css.cgroup) {
> + if (!autogroup_path(tg, buf, buflen))
> + buf[0] = '\0';
> + return;
> + }
> + cgroup_path(tg->css.cgroup, buf, buflen);
> +}
> +#endif
> +
> static void
> print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
> {
> @@ -115,7 +129,7 @@ print_task(struct seq_file *m, struct rq
> char path[64];
>
> rcu_read_lock();
> - cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
> + task_group_path(task_group(p), path, sizeof(path));
> rcu_read_unlock();
> SEQ_printf(m, " %s", path);
> }
> @@ -147,19 +161,6 @@ static void print_rq(struct seq_file *m,
> read_unlock_irqrestore(&tasklist_lock, flags);
> }
>
> -#if defined(CONFIG_CGROUP_SCHED) && \
> - (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
> -static void task_group_path(struct task_group *tg, char *buf, int buflen)
> -{
> - /* may be NULL if the underlying cgroup isn't fully-created yet */
> - if (!tg->css.cgroup) {
> - buf[0] = '\0';
> - return;
> - }
> - cgroup_path(tg->css.cgroup, buf, buflen);
> -}
> -#endif
> -
> void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> {
> s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
> Index: linux-2.6/fs/proc/base.c
> ===================================================================
> --- linux-2.6.orig/fs/proc/base.c
> +++ linux-2.6/fs/proc/base.c
> @@ -1407,6 +1407,82 @@ static const struct file_operations proc
>
> #endif
>
> +#ifdef CONFIG_SCHED_AUTOGROUP
> +/*
> + * Print out autogroup related information:
> + */
> +static int sched_autogroup_show(struct seq_file *m, void *v)
> +{
> + struct inode *inode = m->private;
> + struct task_struct *p;
> +
> + p = get_proc_task(inode);
> + if (!p)
> + return -ESRCH;
> + proc_sched_autogroup_show_task(p, m);
> +
> + put_task_struct(p);
> +
> + return 0;
> +}
> +
> +static ssize_t
> +sched_autogroup_write(struct file *file, const char __user *buf,
> + size_t count, loff_t *offset)
> +{
> + struct inode *inode = file->f_path.dentry->d_inode;
> + struct task_struct *p;
> + char buffer[PROC_NUMBUF];
> + long nice;
> + int err;
> +
> + memset(buffer, 0, sizeof(buffer));
> + if (count > sizeof(buffer) - 1)
> + count = sizeof(buffer) - 1;
> + if (copy_from_user(buffer, buf, count))
> + return -EFAULT;
> +
> + err = strict_strtol(strstrip(buffer), 0, &nice);
> + if (err)
> + return -EINVAL;
> +
> + p = get_proc_task(inode);
> + if (!p)
> + return -ESRCH;
> +
> + err = nice;
> + err = proc_sched_autogroup_set_nice(p, &err);
> + if (err)
> + count = err;
> +
> + put_task_struct(p);
> +
> + return count;
> +}
> +
> +static int sched_autogroup_open(struct inode *inode, struct file *filp)
> +{
> + int ret;
> +
> + ret = single_open(filp, sched_autogroup_show, NULL);
> + if (!ret) {
> + struct seq_file *m = filp->private_data;
> +
> + m->private = inode;
> + }
> + return ret;
> +}
> +
> +static const struct file_operations proc_pid_sched_autogroup_operations = {
> + .open = sched_autogroup_open,
> + .read = seq_read,
> + .write = sched_autogroup_write,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
> +
> +#endif /* CONFIG_SCHED_AUTOGROUP */
> +
> static ssize_t comm_write(struct file *file, const char __user *buf,
> size_t count, loff_t *offset)
> {
> @@ -2733,6 +2809,9 @@ static const struct pid_entry tgid_base_
> #ifdef CONFIG_SCHED_DEBUG
> REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations),
> #endif
> +#ifdef CONFIG_SCHED_AUTOGROUP
> + REG("autogroup", S_IRUGO|S_IWUSR, proc_pid_sched_autogroup_operations),
> +#endif
> REG("comm", S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
> #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
> INF("syscall", S_IRUSR, proc_pid_syscall),
> Index: linux-2.6/kernel/sched_autogroup.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/kernel/sched_autogroup.h
> @@ -0,0 +1,32 @@
> +#ifdef CONFIG_SCHED_AUTOGROUP
> +
> +struct autogroup {
> + struct kref kref;
> + struct task_group *tg;
> + struct rw_semaphore lock;
> + unsigned long id;
> + int nice;
> +};
> +
> +static inline struct task_group *
> +autogroup_task_group(struct task_struct *p, struct task_group *tg);
> +
> +#else /* !CONFIG_SCHED_AUTOGROUP */
> +
> +static inline void autogroup_init(struct task_struct *init_task) { }
> +static inline void autogroup_free(struct task_group *tg) { }
> +
> +static inline struct task_group *
> +autogroup_task_group(struct task_struct *p, struct task_group *tg)
> +{
> + return tg;
> +}
> +
> +#ifdef CONFIG_SCHED_DEBUG
> +static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
> +{
> + return 0;
> +}
> +#endif
> +
> +#endif /* CONFIG_SCHED_AUTOGROUP */
> Index: linux-2.6/kernel/sched_autogroup.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/kernel/sched_autogroup.c
> @@ -0,0 +1,229 @@
> +#ifdef CONFIG_SCHED_AUTOGROUP
> +
> +#include <linux/proc_fs.h>
> +#include <linux/seq_file.h>
> +#include <linux/kallsyms.h>
> +#include <linux/utsname.h>
> +
> +unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
> +static struct autogroup autogroup_default;
> +static atomic_t autogroup_seq_nr;
> +
> +static void autogroup_init(struct task_struct *init_task)
> +{
> + autogroup_default.tg = &init_task_group;
> + init_task_group.autogroup = &autogroup_default;
> + kref_init(&autogroup_default.kref);
> + init_rwsem(&autogroup_default.lock);
> + init_task->signal->autogroup = &autogroup_default;
> +}
> +
> +static inline void autogroup_free(struct task_group *tg)
> +{
> + kfree(tg->autogroup);
> +}
> +
> +static inline void autogroup_destroy(struct kref *kref)
> +{
> + struct autogroup *ag = container_of(kref, struct autogroup, kref);
> +
> + sched_destroy_group(ag->tg);
> +}
> +
> +static inline void autogroup_kref_put(struct autogroup *ag)
> +{
> + kref_put(&ag->kref, autogroup_destroy);
> +}
> +
> +static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
> +{
> + kref_get(&ag->kref);
> + return ag;
> +}
> +
> +static inline struct autogroup *autogroup_create(void)
> +{
> + struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
> + struct task_group *tg;
> +
> + if (!ag)
> + goto out_fail;
> +
> + tg = sched_create_group(&init_task_group);
> +
> + if (IS_ERR(tg))
> + goto out_free;
> +
> + kref_init(&ag->kref);
> + init_rwsem(&ag->lock);
> + ag->id = atomic_inc_return(&autogroup_seq_nr);
> + ag->tg = tg;
> + tg->autogroup = ag;
> +
> + return ag;
> +
> +out_free:
> + kfree(ag);
> +out_fail:
> + if (printk_ratelimit()) {
> + printk(KERN_WARNING "autogroup_create: %s failure.\n",
> + ag ? "sched_create_group()" : "kmalloc()");
> + }
> +
> + return autogroup_kref_get(&autogroup_default);
> +}
> +
> +static inline bool
> +task_wants_autogroup(struct task_struct *p, struct task_group *tg)
> +{
> + if (tg != &root_task_group)
> + return false;
> +
> + if (p->sched_class != &fair_sched_class)
> + return false;
> +
> + /*
> + * We can only assume the task group can't go away on us if
> + * autogroup_move_group() can see us on ->thread_group list.
> + */
> + if (p->flags & PF_EXITING)
> + return false;
> +
> + return true;
> +}
> +
> +static inline struct task_group *
> +autogroup_task_group(struct task_struct *p, struct task_group *tg)
> +{
> + int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
> +
> + if (enabled && task_wants_autogroup(p, tg))
> + return p->signal->autogroup->tg;
> +
> + return tg;
> +}
> +
> +static void
> +autogroup_move_group(struct task_struct *p, struct autogroup *ag)
> +{
> + struct autogroup *prev;
> + struct task_struct *t;
> + unsigned long flags;
> +
> + BUG_ON(!lock_task_sighand(p, &flags));
> +
> + prev = p->signal->autogroup;
> + if (prev == ag) {
> + unlock_task_sighand(p, &flags);
> + return;
> + }
> +
> + p->signal->autogroup = autogroup_kref_get(ag);
> +
> + t = p;
> + do {
> + sched_move_task(t);
> + } while_each_thread(p, t);
> +
> + unlock_task_sighand(p, &flags);
> + autogroup_kref_put(prev);
> +}
> +
> +/* Allocates GFP_KERNEL, cannot be called under any spinlock */
> +void sched_autogroup_create_attach(struct task_struct *p)
> +{
> + struct autogroup *ag = autogroup_create();
> +
> + autogroup_move_group(p, ag);
> + /* drop extra refrence added by autogroup_create() */
> + autogroup_kref_put(ag);
> +}
> +EXPORT_SYMBOL(sched_autogroup_create_attach);
> +
> +/* Cannot be called under siglock. Currently has no users */
> +void sched_autogroup_detach(struct task_struct *p)
> +{
> + autogroup_move_group(p, &autogroup_default);
> +}
> +EXPORT_SYMBOL(sched_autogroup_detach);
> +
> +void sched_autogroup_fork(struct signal_struct *sig)
> +{
> + struct task_struct *p = current;
> +
> + spin_lock_irq(&p->sighand->siglock);
> + sig->autogroup = autogroup_kref_get(p->signal->autogroup);
> + spin_unlock_irq(&p->sighand->siglock);
> +}
> +
> +void sched_autogroup_exit(struct signal_struct *sig)
> +{
> + autogroup_kref_put(sig->autogroup);
> +}
> +
> +static int __init setup_autogroup(char *str)
> +{
> + sysctl_sched_autogroup_enabled = 0;
> +
> + return 1;
> +}
> +
> +__setup("noautogroup", setup_autogroup);
> +
> +#ifdef CONFIG_PROC_FS
> +
> +/* Called with siglock held. */
> +int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
> +{
> + static unsigned long next = INITIAL_JIFFIES;
> + struct autogroup *ag;
> + int err;
> +
> + if (*nice < -20 || *nice > 19)
> + return -EINVAL;
> +
> + err = security_task_setnice(current, *nice);
> + if (err)
> + return err;
> +
> + if (*nice < 0 && !can_nice(current, *nice))
> + return -EPERM;
> +
> + /* this is a heavy operation taking global locks.. */
> + if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
> + return -EAGAIN;
> +
> + next = HZ / 10 + jiffies;
> + ag = autogroup_kref_get(p->signal->autogroup);
> +
> + down_write(&ag->lock);
> + err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);
> + if (!err)
> + ag->nice = *nice;
> + up_write(&ag->lock);
> +
> + autogroup_kref_put(ag);
> +
> + return err;
> +}
> +
> +void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
> +{
> + struct autogroup *ag = autogroup_kref_get(p->signal->autogroup);
> +
> + down_read(&ag->lock);
> + seq_printf(m, "/autogroup-%ld nice %d\n", ag->id, ag->nice);
> + up_read(&ag->lock);
> +
> + autogroup_kref_put(ag);
> +}
> +#endif /* CONFIG_PROC_FS */
> +
> +#ifdef CONFIG_SCHED_DEBUG
> +static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
> +{
> + return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id);
> +}
> +#endif /* CONFIG_SCHED_DEBUG */
> +
> +#endif /* CONFIG_SCHED_AUTOGROUP */
> Index: linux-2.6/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.orig/kernel/sysctl.c
> +++ linux-2.6/kernel/sysctl.c
> @@ -370,6 +370,17 @@ static struct ctl_table kern_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> +#ifdef CONFIG_SCHED_AUTOGROUP
> + {
> + .procname = "sched_autogroup_enabled",
> + .data = &sysctl_sched_autogroup_enabled,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + .extra1 = &zero,
> + .extra2 = &one,
> + },
> +#endif
> #ifdef CONFIG_PROVE_LOCKING
> {
> .procname = "prove_locking",
> Index: linux-2.6/init/Kconfig
> ===================================================================
> --- linux-2.6.orig/init/Kconfig
> +++ linux-2.6/init/Kconfig
> @@ -741,6 +741,18 @@ config NET_NS
>
> endif # NAMESPACES
>
> +config SCHED_AUTOGROUP
> + bool "Automatic process group scheduling"
> + select CGROUPS
> + select CGROUP_SCHED
> + select FAIR_GROUP_SCHED
> + help
> + This option optimizes the scheduler for common desktop workloads by
> + automatically creating and populating task groups. This separation
> + of workloads isolates aggressive CPU burners (like build jobs) from
> + desktop applications. Task group autogeneration is currently based
> + upon task session.
> +
> config MM_OWNER
> bool
>
> Index: linux-2.6/Documentation/kernel-parameters.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/kernel-parameters.txt
> +++ linux-2.6/Documentation/kernel-parameters.txt
> @@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters.
> noapic [SMP,APIC] Tells the kernel to not make use of any
> IOAPICs that may be present in the system.
>
> + noautogroup Disable scheduler automatic task group creation.
> +
> nobats [PPC] Do not use BATs for mapping kernel lowmem
> on "Classic" PPC cores.
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-11-30 15:41:12

by Mike Galbraith

[permalink] [raw]
Subject: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

Commit-ID: 5091faa449ee0b7d73bc296a93bca9540fc51d0a
Gitweb: http://git.kernel.org/tip/5091faa449ee0b7d73bc296a93bca9540fc51d0a
Author: Mike Galbraith <[email protected]>
AuthorDate: Tue, 30 Nov 2010 14:18:03 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 30 Nov 2010 16:03:35 +0100

sched: Add 'autogroup' scheduling feature: automated per session task groups

A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity. This patch
implements an idea from Linus, to automatically create task
groups. Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.

Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group. When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped. Child
processes inherit this task group thereafter, and increase it's
refcount. When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.

At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:

cat /proc/<pid>/autogroup

Displays the task's group and the group's nice level.

echo <nice level> > /proc/<pid>/autogroup

Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.

The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:

echo [01] > /proc/sys/kernel/sched_autogroup_enabled

... which will automatically move tasks to/from the root task group.

Signed-off-by: Mike Galbraith <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Markus Trippelsdorf <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Oleg Nesterov <[email protected]>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: Ingo Molnar <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/kernel-parameters.txt | 2 +
fs/proc/base.c | 79 ++++++++++++
include/linux/sched.h | 23 ++++
init/Kconfig | 13 ++
kernel/fork.c | 5 +-
kernel/sched.c | 13 ++-
kernel/sched_autogroup.c | 229 +++++++++++++++++++++++++++++++++++
kernel/sched_autogroup.h | 32 +++++
kernel/sched_debug.c | 47 +-------
kernel/sys.c | 4 +-
kernel/sysctl.c | 11 ++
11 files changed, 409 insertions(+), 49 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 92e83e5..86820a7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1622,6 +1622,8 @@ and is between 256 and 4096 characters. It is defined in the file
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.

+ noautogroup Disable scheduler automatic task group creation.
+
nobats [PPC] Do not use BATs for mapping kernel lowmem
on "Classic" PPC cores.

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f3d02ca..2fa0ce2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1407,6 +1407,82 @@ static const struct file_operations proc_pid_sched_operations = {

#endif

+#ifdef CONFIG_SCHED_AUTOGROUP
+/*
+ * Print out autogroup related information:
+ */
+static int sched_autogroup_show(struct seq_file *m, void *v)
+{
+ struct inode *inode = m->private;
+ struct task_struct *p;
+
+ p = get_proc_task(inode);
+ if (!p)
+ return -ESRCH;
+ proc_sched_autogroup_show_task(p, m);
+
+ put_task_struct(p);
+
+ return 0;
+}
+
+static ssize_t
+sched_autogroup_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *offset)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct task_struct *p;
+ char buffer[PROC_NUMBUF];
+ long nice;
+ int err;
+
+ memset(buffer, 0, sizeof(buffer));
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+
+ err = strict_strtol(strstrip(buffer), 0, &nice);
+ if (err)
+ return -EINVAL;
+
+ p = get_proc_task(inode);
+ if (!p)
+ return -ESRCH;
+
+ err = nice;
+ err = proc_sched_autogroup_set_nice(p, &err);
+ if (err)
+ count = err;
+
+ put_task_struct(p);
+
+ return count;
+}
+
+static int sched_autogroup_open(struct inode *inode, struct file *filp)
+{
+ int ret;
+
+ ret = single_open(filp, sched_autogroup_show, NULL);
+ if (!ret) {
+ struct seq_file *m = filp->private_data;
+
+ m->private = inode;
+ }
+ return ret;
+}
+
+static const struct file_operations proc_pid_sched_autogroup_operations = {
+ .open = sched_autogroup_open,
+ .read = seq_read,
+ .write = sched_autogroup_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
+
static ssize_t comm_write(struct file *file, const char __user *buf,
size_t count, loff_t *offset)
{
@@ -2733,6 +2809,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_SCHED_DEBUG
REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations),
#endif
+#ifdef CONFIG_SCHED_AUTOGROUP
+ REG("autogroup", S_IRUGO|S_IWUSR, proc_pid_sched_autogroup_operations),
+#endif
REG("comm", S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUSR, proc_pid_syscall),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a5b92c7..9c2d46d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -509,6 +509,8 @@ struct thread_group_cputimer {
spinlock_t lock;
};

+struct autogroup;
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -576,6 +578,9 @@ struct signal_struct {

struct tty_struct *tty; /* NULL if no tty */

+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
/*
* Cumulative resource counters for dead threads in the group,
* and for reaped dead child processes forked by this group.
@@ -1927,6 +1932,24 @@ int sched_rt_handler(struct ctl_table *table, int write,

extern unsigned int sysctl_sched_compat_yield;

+#ifdef CONFIG_SCHED_AUTOGROUP
+extern unsigned int sysctl_sched_autogroup_enabled;
+
+extern void sched_autogroup_create_attach(struct task_struct *p);
+extern void sched_autogroup_detach(struct task_struct *p);
+extern void sched_autogroup_fork(struct signal_struct *sig);
+extern void sched_autogroup_exit(struct signal_struct *sig);
+#ifdef CONFIG_PROC_FS
+extern void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m);
+extern int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice);
+#endif
+#else
+static inline void sched_autogroup_create_attach(struct task_struct *p) { }
+static inline void sched_autogroup_detach(struct task_struct *p) { }
+static inline void sched_autogroup_fork(struct signal_struct *sig) { }
+static inline void sched_autogroup_exit(struct signal_struct *sig) { }
+#endif
+
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
extern void rt_mutex_setprio(struct task_struct *p, int prio);
diff --git a/init/Kconfig b/init/Kconfig
index 88c1046..f1bba0a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -728,6 +728,19 @@ config NET_NS

endif # NAMESPACES

+config SCHED_AUTOGROUP
+ bool "Automatic process group scheduling"
+ select EVENTFD
+ select CGROUPS
+ select CGROUP_SCHED
+ select FAIR_GROUP_SCHED
+ help
+ This option optimizes the scheduler for common desktop workloads by
+ automatically creating and populating task groups. This separation
+ of workloads isolates aggressive CPU burners (like build jobs) from
+ desktop applications. Task group autogeneration is currently based
+ upon task session.
+
config MM_OWNER
bool

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..b6f2475 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -174,8 +174,10 @@ static inline void free_signal_struct(struct signal_struct *sig)

static inline void put_signal_struct(struct signal_struct *sig)
{
- if (atomic_dec_and_test(&sig->sigcnt))
+ if (atomic_dec_and_test(&sig->sigcnt)) {
+ sched_autogroup_exit(sig);
free_signal_struct(sig);
+ }
}

void __put_task_struct(struct task_struct *tsk)
@@ -904,6 +906,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
posix_cpu_timers_init_group(sig);

tty_audit_fork(sig);
+ sched_autogroup_fork(sig);

sig->oom_adj = current->signal->oom_adj;
sig->oom_score_adj = current->signal->oom_score_adj;
diff --git a/kernel/sched.c b/kernel/sched.c
index 66ef579..b646dad 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -79,6 +79,7 @@

#include "sched_cpupri.h"
#include "workqueue_sched.h"
+#include "sched_autogroup.h"

#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
@@ -271,6 +272,10 @@ struct task_group {
struct task_group *parent;
struct list_head siblings;
struct list_head children;
+
+#ifdef CONFIG_SCHED_AUTOGROUP
+ struct autogroup *autogroup;
+#endif
};

#define root_task_group init_task_group
@@ -603,11 +608,14 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
+ struct task_group *tg;
struct cgroup_subsys_state *css;

css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- return container_of(css, struct task_group, css);
+ tg = container_of(css, struct task_group, css);
+
+ return autogroup_task_group(p, tg);
}

/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
@@ -1869,6 +1877,7 @@ static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
#include "sched_idletask.c"
#include "sched_fair.c"
#include "sched_rt.c"
+#include "sched_autogroup.c"
#include "sched_stoptask.c"
#ifdef CONFIG_SCHED_DEBUG
# include "sched_debug.c"
@@ -7750,7 +7759,7 @@ void __init sched_init(void)
#ifdef CONFIG_CGROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
-
+ autogroup_init(&init_task);
#endif /* CONFIG_CGROUP_SCHED */

for_each_possible_cpu(i) {
diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c
new file mode 100644
index 0000000..57a7ac2
--- /dev/null
+++ b/kernel/sched_autogroup.c
@@ -0,0 +1,229 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/utsname.h>
+
+unsigned int __read_mostly sysctl_sched_autogroup_enabled = 1;
+static struct autogroup autogroup_default;
+static atomic_t autogroup_seq_nr;
+
+static void autogroup_init(struct task_struct *init_task)
+{
+ autogroup_default.tg = &init_task_group;
+ init_task_group.autogroup = &autogroup_default;
+ kref_init(&autogroup_default.kref);
+ init_rwsem(&autogroup_default.lock);
+ init_task->signal->autogroup = &autogroup_default;
+}
+
+static inline void autogroup_free(struct task_group *tg)
+{
+ kfree(tg->autogroup);
+}
+
+static inline void autogroup_destroy(struct kref *kref)
+{
+ struct autogroup *ag = container_of(kref, struct autogroup, kref);
+
+ sched_destroy_group(ag->tg);
+}
+
+static inline void autogroup_kref_put(struct autogroup *ag)
+{
+ kref_put(&ag->kref, autogroup_destroy);
+}
+
+static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
+{
+ kref_get(&ag->kref);
+ return ag;
+}
+
+static inline struct autogroup *autogroup_create(void)
+{
+ struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
+ struct task_group *tg;
+
+ if (!ag)
+ goto out_fail;
+
+ tg = sched_create_group(&init_task_group);
+
+ if (IS_ERR(tg))
+ goto out_free;
+
+ kref_init(&ag->kref);
+ init_rwsem(&ag->lock);
+ ag->id = atomic_inc_return(&autogroup_seq_nr);
+ ag->tg = tg;
+ tg->autogroup = ag;
+
+ return ag;
+
+out_free:
+ kfree(ag);
+out_fail:
+ if (printk_ratelimit()) {
+ printk(KERN_WARNING "autogroup_create: %s failure.\n",
+ ag ? "sched_create_group()" : "kmalloc()");
+ }
+
+ return autogroup_kref_get(&autogroup_default);
+}
+
+static inline bool
+task_wants_autogroup(struct task_struct *p, struct task_group *tg)
+{
+ if (tg != &root_task_group)
+ return false;
+
+ if (p->sched_class != &fair_sched_class)
+ return false;
+
+ /*
+ * We can only assume the task group can't go away on us if
+ * autogroup_move_group() can see us on ->thread_group list.
+ */
+ if (p->flags & PF_EXITING)
+ return false;
+
+ return true;
+}
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ int enabled = ACCESS_ONCE(sysctl_sched_autogroup_enabled);
+
+ if (enabled && task_wants_autogroup(p, tg))
+ return p->signal->autogroup->tg;
+
+ return tg;
+}
+
+static void
+autogroup_move_group(struct task_struct *p, struct autogroup *ag)
+{
+ struct autogroup *prev;
+ struct task_struct *t;
+ unsigned long flags;
+
+ BUG_ON(!lock_task_sighand(p, &flags));
+
+ prev = p->signal->autogroup;
+ if (prev == ag) {
+ unlock_task_sighand(p, &flags);
+ return;
+ }
+
+ p->signal->autogroup = autogroup_kref_get(ag);
+
+ t = p;
+ do {
+ sched_move_task(t);
+ } while_each_thread(p, t);
+
+ unlock_task_sighand(p, &flags);
+ autogroup_kref_put(prev);
+}
+
+/* Allocates GFP_KERNEL, cannot be called under any spinlock */
+void sched_autogroup_create_attach(struct task_struct *p)
+{
+ struct autogroup *ag = autogroup_create();
+
+ autogroup_move_group(p, ag);
+ /* drop extra refrence added by autogroup_create() */
+ autogroup_kref_put(ag);
+}
+EXPORT_SYMBOL(sched_autogroup_create_attach);
+
+/* Cannot be called under siglock. Currently has no users */
+void sched_autogroup_detach(struct task_struct *p)
+{
+ autogroup_move_group(p, &autogroup_default);
+}
+EXPORT_SYMBOL(sched_autogroup_detach);
+
+void sched_autogroup_fork(struct signal_struct *sig)
+{
+ struct task_struct *p = current;
+
+ spin_lock_irq(&p->sighand->siglock);
+ sig->autogroup = autogroup_kref_get(p->signal->autogroup);
+ spin_unlock_irq(&p->sighand->siglock);
+}
+
+void sched_autogroup_exit(struct signal_struct *sig)
+{
+ autogroup_kref_put(sig->autogroup);
+}
+
+static int __init setup_autogroup(char *str)
+{
+ sysctl_sched_autogroup_enabled = 0;
+
+ return 1;
+}
+
+__setup("noautogroup", setup_autogroup);
+
+#ifdef CONFIG_PROC_FS
+
+/* Called with siglock held. */
+int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
+{
+ static unsigned long next = INITIAL_JIFFIES;
+ struct autogroup *ag;
+ int err;
+
+ if (*nice < -20 || *nice > 19)
+ return -EINVAL;
+
+ err = security_task_setnice(current, *nice);
+ if (err)
+ return err;
+
+ if (*nice < 0 && !can_nice(current, *nice))
+ return -EPERM;
+
+ /* this is a heavy operation taking global locks.. */
+ if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
+ return -EAGAIN;
+
+ next = HZ / 10 + jiffies;
+ ag = autogroup_kref_get(p->signal->autogroup);
+
+ down_write(&ag->lock);
+ err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);
+ if (!err)
+ ag->nice = *nice;
+ up_write(&ag->lock);
+
+ autogroup_kref_put(ag);
+
+ return err;
+}
+
+void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
+{
+ struct autogroup *ag = autogroup_kref_get(p->signal->autogroup);
+
+ down_read(&ag->lock);
+ seq_printf(m, "/autogroup-%ld nice %d\n", ag->id, ag->nice);
+ up_read(&ag->lock);
+
+ autogroup_kref_put(ag);
+}
+#endif /* CONFIG_PROC_FS */
+
+#ifdef CONFIG_SCHED_DEBUG
+static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
+{
+ return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id);
+}
+#endif /* CONFIG_SCHED_DEBUG */
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
diff --git a/kernel/sched_autogroup.h b/kernel/sched_autogroup.h
new file mode 100644
index 0000000..5358e24
--- /dev/null
+++ b/kernel/sched_autogroup.h
@@ -0,0 +1,32 @@
+#ifdef CONFIG_SCHED_AUTOGROUP
+
+struct autogroup {
+ struct kref kref;
+ struct task_group *tg;
+ struct rw_semaphore lock;
+ unsigned long id;
+ int nice;
+};
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg);
+
+#else /* !CONFIG_SCHED_AUTOGROUP */
+
+static inline void autogroup_init(struct task_struct *init_task) { }
+static inline void autogroup_free(struct task_group *tg) { }
+
+static inline struct task_group *
+autogroup_task_group(struct task_struct *p, struct task_group *tg)
+{
+ return tg;
+}
+
+#ifdef CONFIG_SCHED_DEBUG
+static inline int autogroup_path(struct task_group *tg, char *buf, int buflen)
+{
+ return 0;
+}
+#endif
+
+#endif /* CONFIG_SCHED_AUTOGROUP */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index e95b774..1dfae3d 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -54,8 +54,7 @@ static unsigned long nsec_low(unsigned long long nsec)
#define SPLIT_NS(x) nsec_high(x), nsec_low(x)

#ifdef CONFIG_FAIR_GROUP_SCHED
-static void print_cfs_group_stats(struct seq_file *m, int cpu,
- struct task_group *tg)
+static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
{
struct sched_entity *se = tg->se[cpu];
if (!se)
@@ -110,16 +109,6 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
#endif

-#ifdef CONFIG_CGROUP_SCHED
- {
- char path[64];
-
- rcu_read_lock();
- cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
- rcu_read_unlock();
- SEQ_printf(m, " %s", path);
- }
-#endif
SEQ_printf(m, "\n");
}

@@ -147,19 +136,6 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
read_unlock_irqrestore(&tasklist_lock, flags);
}

-#if defined(CONFIG_CGROUP_SCHED) && \
- (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
-static void task_group_path(struct task_group *tg, char *buf, int buflen)
-{
- /* may be NULL if the underlying cgroup isn't fully-created yet */
- if (!tg->css.cgroup) {
- buf[0] = '\0';
- return;
- }
- cgroup_path(tg->css.cgroup, buf, buflen);
-}
-#endif
-
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
@@ -168,16 +144,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
struct sched_entity *last;
unsigned long flags;

-#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_FAIR_GROUP_SCHED)
- char path[128];
- struct task_group *tg = cfs_rq->tg;
-
- task_group_path(tg, path, sizeof(path));
-
- SEQ_printf(m, "\ncfs_rq[%d]:%s\n", cpu, path);
-#else
SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
-#endif
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "exec_clock",
SPLIT_NS(cfs_rq->exec_clock));

@@ -215,7 +182,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %ld\n", "load_contrib",
cfs_rq->load_contribution);
SEQ_printf(m, " .%-30s: %d\n", "load_tg",
- atomic_read(&tg->load_weight));
+ atomic_read(&cfs_rq->tg->load_weight));
#endif

print_cfs_group_stats(m, cpu, cfs_rq->tg);
@@ -224,17 +191,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)

void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
{
-#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_RT_GROUP_SCHED)
- char path[128];
- struct task_group *tg = rt_rq->tg;
-
- task_group_path(tg, path, sizeof(path));
-
- SEQ_printf(m, "\nrt_rq[%d]:%s\n", cpu, path);
-#else
SEQ_printf(m, "\nrt_rq[%d]:\n", cpu);
-#endif
-

#define P(x) \
SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(rt_rq->x))
diff --git a/kernel/sys.c b/kernel/sys.c
index 7f5a0cd..2745dcd 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1080,8 +1080,10 @@ SYSCALL_DEFINE0(setsid)
err = session;
out:
write_unlock_irq(&tasklist_lock);
- if (err > 0)
+ if (err > 0) {
proc_sid_connector(group_leader);
+ sched_autogroup_create_attach(group_leader);
+ }
return err;
}

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a00fdef..121e4ff 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -370,6 +370,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SCHED_AUTOGROUP
+ {
+ .procname = "sched_autogroup_enabled",
+ .data = &sysctl_sched_autogroup_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",

2010-11-30 16:28:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, 2010-11-30 at 15:18 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > On Mon, 2010-11-29 at 20:20 +0100, Ingo Molnar wrote:
> > > * Mike Galbraith <[email protected]> wrote:
> > >
> > > > > I know, from the testing so far we _thought_ it was fairly sane. Apparently
> > > > > there's still some work to do.
> > > >
> > > > Damn thing bisected to:
> > > >
> > > > commit 92fd4d4d67b945c0766416284d4ab236b31542c4
> > > > Merge: fe7de49 e53beac
> > > > Author: Ingo Molnar <[email protected]>
> > > > Date: Thu Nov 18 13:22:14 2010 +0100
> > > >
> > > > Merge commit 'v2.6.37-rc2' into sched/core
> > > >
> > > > Merge reason: Move to a .37-rc base.
> > > >
> > > > Signed-off-by: Ingo Molnar <[email protected]>
> > > >
> > > > 92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit
> > >
> > > Hm, i'd suggest to double check the two originator points:
> > >
> > > e53beac - is it really 'bad' ?
> > > fe7de49 - is it really 'good'?
> >
> > Nope. I did a bisection this morning in text mode with a pipe-test
> > based measurement proggy, and it bisected cleanly.
> >
> > 2069dd75c7d0f49355939e5586daf5a9ab216db7 is the first bad commit
> >
> > commit 2069dd75c7d0f49355939e5586daf5a9ab216db7
> > Author: Peter Zijlstra <[email protected]>
> > Date: Mon Nov 15 15:47:00 2010 -0800
> >
> > sched: Rewrite tg_shares_up)
>
> Ok. And has this fixed it:
>
> 822bc180a7f7: sched: Fix unregister_fair_sched_group()
>
> ... or are there two bugs?

Two bugs. 822bc180a7f7 fixes the explosions that were happening in tip.
The interactivity issue is some problem in the update_shares() stuff.

-Mike

2010-11-30 16:41:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, 2010-11-30 at 15:13 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > On Mon, 2010-11-29 at 20:23 -0800, Paul Turner wrote:
> >
> > > I've left some machines running tip + fix above + autogroup to see if
> > > anything else emerges. Hasn't crashed yet, I'll leave it going
> > > overnight.
> >
> > Thanks. Below is the hopefully final version against tip. The last I
> > sent contained a couple remnants.
>
> Note, I removed this chunk:
>
> > kernel/sched_debug.c | 29 ++--
>
> > Index: linux-2.6/kernel/sched_debug.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched_debug.c
> > +++ linux-2.6/kernel/sched_debug.c
> > @@ -87,6 +87,20 @@ static void print_cfs_group_stats(struct
> > }
> > #endif
> >
> > +#if defined(CONFIG_CGROUP_SCHED) && \
> > + (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
> > +static void task_group_path(struct task_group *tg, char *buf, int buflen)
> > +{
> > + /* may be NULL if the underlying cgroup isn't fully-created yet */
> > + if (!tg->css.cgroup) {
> > + if (!autogroup_path(tg, buf, buflen))
> > + buf[0] = '\0';
> > + return;
> > + }
> > + cgroup_path(tg->css.cgroup, buf, buflen);
> > +}
> > +#endif
> > +
> > static void
> > print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
> > {
> > @@ -115,7 +129,7 @@ print_task(struct seq_file *m, struct rq
> > char path[64];
> >
> > rcu_read_lock();
> > - cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
> > + task_group_path(task_group(p), path, sizeof(path));
> > rcu_read_unlock();
> > SEQ_printf(m, " %s", path);
> > }
> > @@ -147,19 +161,6 @@ static void print_rq(struct seq_file *m,
> > read_unlock_irqrestore(&tasklist_lock, flags);
> > }
> >
> > -#if defined(CONFIG_CGROUP_SCHED) && \
> > - (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
> > -static void task_group_path(struct task_group *tg, char *buf, int buflen)
> > -{
> > - /* may be NULL if the underlying cgroup isn't fully-created yet */
> > - if (!tg->css.cgroup) {
> > - buf[0] = '\0';
> > - return;
> > - }
> > - cgroup_path(tg->css.cgroup, buf, buflen);
> > -}
> > -#endif
> > -
> > void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> > {
> > s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
>
> Because it didn't build (for obvious reasons - the CONFIG conditions dont match up),
> but more importantly it's quite ugly. Some existing 'path' variables are 64 byte,
> some are 128 byte - so there's pre-existing damage - i removed it all.

Won't removing that hunk bring back oops if you cat /proc/sched_debug?

cfs_rq[0]:/autogroup-88
.exec_clock : 0.228697
.MIN_vruntime : 0.000001
.min_vruntime : 0.819879
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : -22925903.100800

> Could we do this debugging code in a bit saner way please? (as a delta patch on top
> of the -tip that i'll push out in the next hour or so.)

Guess I'll try.

-Mike


2010-11-30 17:13:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, 2010-11-30 at 10:17 -0500, Vivek Goyal wrote:

> Hi Mike,

Hi,

> I was wonderig if these autogroups can be visible in regular cgroup
> hierarchy so that once somebody mounts cpu controller, these are visible?

No, autogroup is not auto-cgroup. You get zero whistles and zero bells
with autogroup. Only dirt simple automated task groups.

> I was wondering why is a good idea to create a separate interface for
> autogroups through proc and not try to integrate it with cgroup interface.

I only put an interface there at all because it was requested, and made
it a dirt simple 'nice level' interface because there's nothing simpler
than 'nice'. The whole autogroup thing is intended for folks who don't
want to set up cgroups, shares yadayada, so tying it into the cgroups
interface seems kinda pointless.

> Without it now any user space tool shall have to either disable the
> autogroup feature completely or now also worry about /proc interface
> and there also autogroups are searchable through pid and there is no
> direct way to access these.

Maybe I should make it disable itself when you mount big brother.

> IIUC, these autogroups create flat setup and are at same level as
> init_task_group and are not children of it. Currently cpu cgroup
> is hierarchical by default and any new cgroup is child of init_task_group
> and that could lead to representation issues.

Well, it's flat, but autogroup does..
tg = sched_create_group(&init_task_group);

> Well, will we not get same kind of latency boost if we make these autogroups
> children of root? If yes, then hierarchical representation issue of autogroup
> will be a moot point.

No problem then.

> We already have /proc/<pid>/cgroup interface which points to tasks's
> cgroup. We probably can avoid creating /proc/<pid>/autgroup if there
> is an associated cgroup which appears in cgroup hierachy and then user
> can change the weight of group through cgroup interface (instead of
> introducing another interface).

That's possible (for someone familiar with cgroups;), but I don't see
any reason for a wedding.

-Mike

2010-11-30 19:36:49

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, Nov 30, 2010 at 06:13:41PM +0100, Mike Galbraith wrote:
> On Tue, 2010-11-30 at 10:17 -0500, Vivek Goyal wrote:
>
> > Hi Mike,
>
> Hi,
>
> > I was wonderig if these autogroups can be visible in regular cgroup
> > hierarchy so that once somebody mounts cpu controller, these are visible?
>
> No, autogroup is not auto-cgroup. You get zero whistles and zero bells
> with autogroup. Only dirt simple automated task groups.
>
> > I was wondering why is a good idea to create a separate interface for
> > autogroups through proc and not try to integrate it with cgroup interface.
>
> I only put an interface there at all because it was requested, and made
> it a dirt simple 'nice level' interface because there's nothing simpler
> than 'nice'. The whole autogroup thing is intended for folks who don't
> want to set up cgroups, shares yadayada, so tying it into the cgroups
> interface seems kinda pointless.
>
> > Without it now any user space tool shall have to either disable the
> > autogroup feature completely or now also worry about /proc interface
> > and there also autogroups are searchable through pid and there is no
> > direct way to access these.
>
> Maybe I should make it disable itself when you mount big brother.
>
> > IIUC, these autogroups create flat setup and are at same level as
> > init_task_group and are not children of it. Currently cpu cgroup
> > is hierarchical by default and any new cgroup is child of init_task_group
> > and that could lead to representation issues.
>
> Well, it's flat, but autogroup does..
> tg = sched_create_group(&init_task_group);
>
> > Well, will we not get same kind of latency boost if we make these autogroups
> > children of root? If yes, then hierarchical representation issue of autogroup
> > will be a moot point.
>
> No problem then.
>
> > We already have /proc/<pid>/cgroup interface which points to tasks's
> > cgroup. We probably can avoid creating /proc/<pid>/autgroup if there
> > is an associated cgroup which appears in cgroup hierachy and then user
> > can change the weight of group through cgroup interface (instead of
> > introducing another interface).
>
> That's possible (for someone familiar with cgroups;), but I don't see
> any reason for a wedding.

Few things.

- This /proc/<pid>/autogroup is good for doing this single thing but when
I start thinking of possible extensions of it down the line, it creates
issues.

- Once we have some kind of uppper limit support in cpu controller, these
autogroups are beyond control. If you want to impose some kind of
limits on them then you shall have to extend parallel interface
/proc/<pid>/autogroup to also speicify upper limit (like nice levels).

- Similiarly if this autgroup notion is extended to other cgroup
controllers, then you shall have to again extend /proc/<pid>/autogroup
to be able to specify these additional parameters.

- If there is a monitoring tool which is monitoring the system for
resource usage by the groups, then I think these autogroups are beyond
reach and any stats exported by cgroup interface will not be available.
(though right now I can't see any stats being exported by cgroup files
in cpu controller but other controllers like block and memory do.).

- I am doing some testing with the patch and w.r.t. cgroup interface some
things don't seem right.

I have applied your patch and enabled CONFIG_AUTO_GROUP. Now I boot
into the kernel and open a new ssh connection to the machine.

# echo $$
3555
# cat /proc/3555/autogroup
/autogroup-63 nice 0

IIUC, task 3555 has been moved into an autogroup. Now I mount the cpu
controller and this task is visible in root cgroup.

# mount -t cgroup -o cpu none /cgroup/cpu
# cat /cgroup/cpu/tasks | grep 3555
3555

First of all this gives user a wrong impression that task 3555 is in
root cgroup.

Now I create a child group test1 and move the task there and also change
the weight/shares of the cgroup to 10240.

# mkdir test1
# echo 3555 > test1/tasks
# echo 10240 > test1/cpu.shares
# cat /proc/3555/cgroup
3:cpu:/test1
# cat /proc/3555/autogroup
/autogroup-63 nice 0

So again, user will think that task is in cgroup test1 and is being
controlled by the respective weight but that's not the case.

Even if we prevent autogroup task from being visible in cpu controller
root group, then comes the question what happens if cpu and some other
controller is comounted. Say cpuset. Now in that case will task be
visible in root group task file and can one operate on that. Now showing
up there does not make much sense as task should still be controllable
by other controllers and its policies.

So yes, creating a /proc/<pid>/autogroup is dirt cheap and makes the life
easier in terms of implementation of this patch and it should work well.
But it is also a new user interface which does not sound too extensible and
does not seem to cooperate well with cgroup interface.

It also introduces this new notion of niceness for task groups which is sort
of equivalent to cpu.shares in cpu controller. First of all why should we
not stick to shares notion even for autogroup. Even if we introduce the notion
of niceness for groups, IMHO, it should be through cgroup interface instead of
group niceness for autogroup and shares/weights for cgroup despite the
fact that in the background they do similar things.

I think above concerns can possibly be reason enough to think about about
the wedding.

Thanks
Vivek

2010-12-01 03:39:32

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On 11/28/10 06:24, Mike Galbraith wrote:
> On Thu, 2010-11-25 at 09:00 -0700, Mike Galbraith wrote:
>
>> My vacation is (sniff) over, so I won't get a fully tested patch out the
>> door for review until I get back home.
>
> Either I forgot to pack my eyeballs, or laptop is just too dinky and
> annoying. Now back home on beloved box, this little bugger poked me
> dead in the eye.
>
> Something else is seriously wrong though. 36.1 with attached (plus
> sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> 37.git and tip with fixlet below both suck rocks. With a make -j40
> running, wakeup-latency is showing latencies of>100ms, amarok skips,
> mouse lurches badly.. generally horrid. Something went south.
>

I'm looking at this.

The share:share ratios looked good in static testing, but perhaps we
need a little more wake-up boost to improve interactivity.

Should have something tomorrow.

- Paul

> sched: fix 3d4b47b4 typo.
>
> Signed-off-by: Mike Galbraith<[email protected]>
> Cc: Peter Zijlstra<[email protected]>
> Cc: Ingo Molnar<[email protected]>
> LKML-Reference: new submission
> ---
> kernel/sched.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -8087,7 +8087,6 @@ static inline void unregister_fair_sched
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long flags;
> - int i;
>
> /*
> * Only empty task groups can be destroyed; so we can speculatively
> @@ -8097,7 +8096,7 @@ static inline void unregister_fair_sched
> return;
>
> raw_spin_lock_irqsave(&rq->lock, flags);
> - list_del_leaf_cfs_rq(tg->cfs_rq[i]);
> + list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
> raw_spin_unlock_irqrestore(&rq->lock, flags);
> }
> #else /* !CONFG_FAIR_GROUP_SCHED */
>

2010-12-01 03:39:53

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On 11/28/10 06:24, Mike Galbraith wrote:
> On Thu, 2010-11-25 at 09:00 -0700, Mike Galbraith wrote:
>
>> My vacation is (sniff) over, so I won't get a fully tested patch out the
>> door for review until I get back home.
>
> Either I forgot to pack my eyeballs, or laptop is just too dinky and
> annoying. Now back home on beloved box, this little bugger poked me
> dead in the eye.
>
> Something else is seriously wrong though. 36.1 with attached (plus
> sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> 37.git and tip with fixlet below both suck rocks. With a make -j40
> running, wakeup-latency is showing latencies of>100ms, amarok skips,
> mouse lurches badly.. generally horrid. Something went south.
>

I'm looking at this.

The share:share ratios looked good in static testing, but perhaps we
need a little more wake-up boost to improve interactivity.

Should have something tomorrow.

- Paul

> sched: fix 3d4b47b4 typo.
>
> Signed-off-by: Mike Galbraith<[email protected]>
> Cc: Peter Zijlstra<[email protected]>
> Cc: Ingo Molnar<[email protected]>
> LKML-Reference: new submission
> ---
> kernel/sched.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -8087,7 +8087,6 @@ static inline void unregister_fair_sched
> {
> struct rq *rq = cpu_rq(cpu);
> unsigned long flags;
> - int i;
>
> /*
> * Only empty task groups can be destroyed; so we can speculatively
> @@ -8097,7 +8096,7 @@ static inline void unregister_fair_sched
> return;
>
> raw_spin_lock_irqsave(&rq->lock, flags);
> - list_del_leaf_cfs_rq(tg->cfs_rq[i]);
> + list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
> raw_spin_unlock_irqrestore(&rq->lock, flags);
> }
> #else /* !CONFG_FAIR_GROUP_SCHED */
>

2010-12-01 04:56:49

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, Nov 30, 2010 at 02:36:22PM -0500, Vivek Goyal wrote:
>
>So again, user will think that task is in cgroup test1 and is being
>controlled by the respective weight but that's not the case.
>
>Even if we prevent autogroup task from being visible in cpu controller
>root group, then comes the question what happens if cpu and some other
>controller is comounted. Say cpuset. Now in that case will task be
>visible in root group task file and can one operate on that. Now showing
>up there does not make much sense as task should still be controllable
>by other controllers and its policies.
>
>So yes, creating a /proc/<pid>/autogroup is dirt cheap and makes the life
>easier in terms of implementation of this patch and it should work well.
>But it is also a new user interface which does not sound too extensible and
>does not seem to cooperate well with cgroup interface.
>
>It also introduces this new notion of niceness for task groups which is sort
>of equivalent to cpu.shares in cpu controller. First of all why should we
>not stick to shares notion even for autogroup. Even if we introduce the notion
>of niceness for groups, IMHO, it should be through cgroup interface instead of
>group niceness for autogroup and shares/weights for cgroup despite the
>fact that in the background they do similar things.
>

Hmm, maybe we can make AUTO_GROUP depend on !CGROUPS?

It seems that autogroup only uses 'struct task_group', no other cgroup things,
so I think that is reasonable and doable.

2010-12-01 05:57:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, 2010-11-30 at 14:36 -0500, Vivek Goyal wrote:

> Few things.
>
> - This /proc/<pid>/autogroup is good for doing this single thing but when
> I start thinking of possible extensions of it down the line, it creates
> issues.
>
> - Once we have some kind of uppper limit support in cpu controller, these
> autogroups are beyond control. If you want to impose some kind of
> limits on them then you shall have to extend parallel interface
> /proc/<pid>/autogroup to also speicify upper limit (like nice levels).
>
> - Similiarly if this autgroup notion is extended to other cgroup
> controllers, then you shall have to again extend /proc/<pid>/autogroup
> to be able to specify these additional parameters.

Yes, if it evolves, it's interface will need to evolve as well. It
could have been a directory containing buttons, knobs and statistics,
but KISS won.

> - If there is a monitoring tool which is monitoring the system for
> resource usage by the groups, then I think these autogroups are beyond
> reach and any stats exported by cgroup interface will not be available.
> (though right now I can't see any stats being exported by cgroup files
> in cpu controller but other controllers like block and memory do.).

If you're manually assigning bandwidth et al from userland, there's not
much point to in-kernel automation is there?

If I had married the two, the first thing that would have happened is
gripes about things appearing and disappearing in cgroups directories,
resulting in mayhem and confusion for scripts and tools.

> - I am doing some testing with the patch and w.r.t. cgroup interface some
> things don't seem right.
>
> I have applied your patch and enabled CONFIG_AUTO_GROUP. Now I boot
> into the kernel and open a new ssh connection to the machine.
>
> # echo $$
> 3555
> # cat /proc/3555/autogroup
> /autogroup-63 nice 0
>
> IIUC, task 3555 has been moved into an autogroup. Now I mount the cpu
> controller and this task is visible in root cgroup.
>
> # mount -t cgroup -o cpu none /cgroup/cpu
> # cat /cgroup/cpu/tasks | grep 3555
> 3555
>
> First of all this gives user a wrong impression that task 3555 is in
> root cgroup.

It is in the root cgroup. It is not in the root autogroup is not
auto-cgroups group.

> Now I create a child group test1 and move the task there and also change
> the weight/shares of the cgroup to 10240.
>
> # mkdir test1
> # echo 3555 > test1/tasks
> # echo 10240 > test1/cpu.shares
> # cat /proc/3555/cgroup
> 3:cpu:/test1
> # cat /proc/3555/autogroup
> /autogroup-63 nice 0
>
> So again, user will think that task is in cgroup test1 and is being
> controlled by the respective weight but that's not the case.

It is the case here.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
7573 root 20 0 7996 340 256 R 50 0.0 3:35.86 3 pert
7572 root 20 0 7996 340 256 R 50 0.0 9:21.68 3 pert
...
marge:/cgroups/test # echo 7572 > tasks
marge:/cgroups/test # echo 4096 > cpu.shares

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
7572 root 20 0 7996 340 256 R 80 0.0 10:06.92 3 pert
7573 root 20 0 7996 340 256 R 20 0.0 4:05.80 3 pert

When you move a task into a cgroup, it still has an autogroup
association, as all tasks (processes actually) do, but it's not used.

> Even if we prevent autogroup task from being visible in cpu controller
> root group, then comes the question what happens if cpu and some other
> controller is comounted. Say cpuset. Now in that case will task be
> visible in root group task file and can one operate on that. Now showing
> up there does not make much sense as task should still be controllable
> by other controllers and its policies.

The user has to specifically ask for it in his config, can turn it on or
off on the fly or at boot..

> So yes, creating a /proc/<pid>/autogroup is dirt cheap and makes the life
> easier in terms of implementation of this patch and it should work well.
> But it is also a new user interface which does not sound too extensible and
> does not seem to cooperate well with cgroup interface

..it has a different mission, with different users being targeted, so
why does it need to hold hands?

> It also introduces this new notion of niceness for task groups which is sort
> of equivalent to cpu.shares in cpu controller. First of all why should we
> not stick to shares notion even for autogroup. Even if we introduce the notion
> of niceness for groups, IMHO, it should be through cgroup interface instead of
> group niceness for autogroup and shares/weights for cgroup despite the
> fact that in the background they do similar things.

IMHO, cgroups should have been 'nice' from the start, but the folks who
wrote it did what they thought best. I like nice a lot better than
shares, so I used nice.

> I think above concerns can possibly be reason enough to think about about
> the wedding.

Perhaps in future, they'll get married, and perhaps they should, but in
the here and now, I think they have similar but not identical missions.
If you turn on one, turn off the other. Maybe that should be automated.

Systemd thingy may make autogroup short lived anyway. I had a query
from an embedded guy (hm, which I spaced) suggesting autogroup may be
quite nice for handheld stuff though, so who knows.

-Mike

2010-12-01 06:09:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Wed, 2010-12-01 at 13:01 +0800, Américo Wang wrote:

> Hmm, maybe we can make AUTO_GROUP depend on !CGROUPS?
>
> It seems that autogroup only uses 'struct task_group', no other cgroup things,
> so I think that is reasonable and doable.

Build time exclusion is not as flexible. As is, the user can have one
kernel, and use whatever he likes.

-Mike

2010-12-01 06:16:18

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, 2010-11-30 at 19:39 -0800, Paul Turner wrote:
> On 11/28/10 06:24, Mike Galbraith wrote:
> >
> > Something else is seriously wrong though. 36.1 with attached (plus
> > sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> > 37.git and tip with fixlet below both suck rocks. With a make -j40
> > running, wakeup-latency is showing latencies of>100ms, amarok skips,
> > mouse lurches badly.. generally horrid. Something went south.
>
> I'm looking at this.
>
> The share:share ratios looked good in static testing, but perhaps we
> need a little more wake-up boost to improve interactivity.

Yeah, feels like a wakeup issue. I too did a (brief) static test, and
that looked ok.

-Mike

2010-12-01 11:33:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Wed, 2010-12-01 at 06:57 +0100, Mike Galbraith wrote:
> IMHO, cgroups should have been 'nice' from the start, but the folks who
> wrote it did what they thought best. I like nice a lot better than
> shares, so I used nice.

Agreed, but by the time I realized that the shares thing was already in
the wild. I did (probably still do) have a patch that adds a nice file
to the cgroup file.

Anyway, I think the whole proc/nice interface for autogroups is already
a tad too far. If people want control they can use cgroups, but I really
don't care enough to argue much about it.


2010-12-01 11:34:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Tue, 2010-11-30 at 19:39 -0800, Paul Turner wrote:
> On 11/28/10 06:24, Mike Galbraith wrote:
> > On Thu, 2010-11-25 at 09:00 -0700, Mike Galbraith wrote:
> >
> >> My vacation is (sniff) over, so I won't get a fully tested patch out the
> >> door for review until I get back home.
> >
> > Either I forgot to pack my eyeballs, or laptop is just too dinky and
> > annoying. Now back home on beloved box, this little bugger poked me
> > dead in the eye.
> >
> > Something else is seriously wrong though. 36.1 with attached (plus
> > sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> > 37.git and tip with fixlet below both suck rocks. With a make -j40
> > running, wakeup-latency is showing latencies of>100ms, amarok skips,
> > mouse lurches badly.. generally horrid. Something went south.
> >
>
> I'm looking at this.
>
> The share:share ratios looked good in static testing, but perhaps we
> need a little more wake-up boost to improve interactivity.
>
> Should have something tomorrow.

Right, the previous thing cheated quite enormous with wakeups simply
because it was way too expensive to compute proper shares on wakeups.

Maybe we should re-instate some of that cheating.

2010-12-01 11:36:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Wed, 2010-12-01 at 13:01 +0800, Américo Wang wrote:
>
> Hmm, maybe we can make AUTO_GROUP depend on !CGROUPS?
>
> It seems that autogroup only uses 'struct task_group', no other cgroup things,
> so I think that is reasonable and doable.

That's only going to create more #ifdefery in sched.c (and we already
got way too much of that), for little to no gain.

But yes, technically that could be done.

2010-12-01 11:55:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Wed, 2010-12-01 at 12:33 +0100, Peter Zijlstra wrote:

> Anyway, I think the whole proc/nice interface for autogroups is already
> a tad too far. If people want control they can use cgroups, but I really
> don't care enough to argue much about it.

Agreed. I've no intention of expanding it.

-Mike

2010-12-01 14:56:28

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Wed, Dec 01, 2010 at 06:57:48AM +0100, Mike Galbraith wrote:

[..]
>
> > - I am doing some testing with the patch and w.r.t. cgroup interface some
> > things don't seem right.
> >
> > I have applied your patch and enabled CONFIG_AUTO_GROUP. Now I boot
> > into the kernel and open a new ssh connection to the machine.
> >
> > # echo $$
> > 3555
> > # cat /proc/3555/autogroup
> > /autogroup-63 nice 0
> >
> > IIUC, task 3555 has been moved into an autogroup. Now I mount the cpu
> > controller and this task is visible in root cgroup.
> >
> > # mount -t cgroup -o cpu none /cgroup/cpu
> > # cat /cgroup/cpu/tasks | grep 3555
> > 3555
> >
> > First of all this gives user a wrong impression that task 3555 is in
> > root cgroup.
>
> It is in the root cgroup. It is not in the root autogroup is not
> auto-cgroups group.
>
> > Now I create a child group test1 and move the task there and also change
> > the weight/shares of the cgroup to 10240.
> >
> > # mkdir test1
> > # echo 3555 > test1/tasks
> > # echo 10240 > test1/cpu.shares
> > # cat /proc/3555/cgroup
> > 3:cpu:/test1
> > # cat /proc/3555/autogroup
> > /autogroup-63 nice 0
> >
> > So again, user will think that task is in cgroup test1 and is being
> > controlled by the respective weight but that's not the case.
>
> It is the case here.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 7573 root 20 0 7996 340 256 R 50 0.0 3:35.86 3 pert
> 7572 root 20 0 7996 340 256 R 50 0.0 9:21.68 3 pert
> ...
> marge:/cgroups/test # echo 7572 > tasks
> marge:/cgroups/test # echo 4096 > cpu.shares
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 7572 root 20 0 7996 340 256 R 80 0.0 10:06.92 3 pert
> 7573 root 20 0 7996 340 256 R 20 0.0 4:05.80 3 pert
>
> When you move a task into a cgroup, it still has an autogroup
> association, as all tasks (processes actually) do, but it's not used.

Ok, so I got confused with the fact that after moving a task into a
cgroup it is still associated with an autogroup.

So IIUC, if a task is in root cgroup, then it would not necessarily be driven
by cpu.shares of root cgroup (as task could be in its own autogroup). But
if I move the task into a non-root cgroup, then it will for sure be
subjected to rules imposed by non-root cgroup cpu.shares. That's not too
bad.

Thanks
Vivek

2010-12-01 15:04:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Wed, 2010-12-01 at 09:55 -0500, Vivek Goyal wrote:

> So IIUC, if a task is in root cgroup, then it would not necessarily be driven
> by cpu.shares of root cgroup (as task could be in its own autogroup). But
> if I move the task into a non-root cgroup, then it will for sure be
> subjected to rules imposed by non-root cgroup cpu.shares. That's not too
> bad.

I think the normal case would be either one or the other being in use at
any given time, but yes, if both are active, that's how it'd work.

-Mike

2010-12-01 22:13:18

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Wed, 01 Dec 2010 13:01:29 +0800, Am?rico Wang said:

> Hmm, maybe we can make AUTO_GROUP depend on !CGROUPS?
>
> It seems that autogroup only uses 'struct task_group', no other cgroup things,
> so I think that is reasonable and doable.

A non-starter if you have a Fedora Rawhide box that has systemd installed, as
that won't even make it to single-user if you don't have CGROUPS in the kernel config.


Attachments:
(No filename) (227.00 B)

2010-12-03 05:11:57

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On 11/30/10 22:16, Mike Galbraith wrote:
> On Tue, 2010-11-30 at 19:39 -0800, Paul Turner wrote:
>> On 11/28/10 06:24, Mike Galbraith wrote:
>>>
>>> Something else is seriously wrong though. 36.1 with attached (plus
>>> sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
>>> 37.git and tip with fixlet below both suck rocks. With a make -j40
>>> running, wakeup-latency is showing latencies of>100ms, amarok skips,
>>> mouse lurches badly.. generally horrid. Something went south.
>>
>> I'm looking at this.
>>
>> The share:share ratios looked good in static testing, but perhaps we
>> need a little more wake-up boost to improve interactivity.
>
> Yeah, feels like a wakeup issue. I too did a (brief) static test, and
> that looked ok.
>
> -Mike
>

Hey Mike,

Does something like the below help?

We're quick to drive the load_contribution up (to avoid over-commit).
However on sleepy workloads this results in lots of weight being
stranded (since it reaches maximum contribution instantaneously but
decays slowly) as the thread migrates between cpus.

Avoid this by averaging "up" in the wake-up direction as well as the sleep.

We also get a boost from the fact that we use the instantaneous weight
in computing the actual received shares.

I actually don't have a desktop setup handy to test "interactivity" (sad
but true -- working on grabbing one). But it looks better on under
synthetic load.

- Paul

===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -743,12 +743,19 @@ static void update_cfs_load(struct cfs_r
return;

now = rq_of(cfs_rq)->clock;
- delta = now - cfs_rq->load_stamp;
+
+ if (likely(cfs_rq->load_stamp))
+ delta = now - cfs_rq->load_stamp;
+ else {
+ /* avoid large initial delta and initialize load_period */
+ delta = 1;
+ cfs_rq->load_stamp = 1;
+ }

/* truncate load history at 4 idle periods */
if (cfs_rq->load_stamp > cfs_rq->load_last &&
now - cfs_rq->load_last > 4 * period) {
- cfs_rq->load_period = 0;
+ cfs_rq->load_period = period/2;
cfs_rq->load_avg = 0;
}

2010-12-03 06:48:34

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Thu, 2010-12-02 at 21:11 -0800, Paul Turner wrote:
> On 11/30/10 22:16, Mike Galbraith wrote:
> > On Tue, 2010-11-30 at 19:39 -0800, Paul Turner wrote:
> >> On 11/28/10 06:24, Mike Galbraith wrote:
> >>>
> >>> Something else is seriously wrong though. 36.1 with attached (plus
> >>> sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
> >>> 37.git and tip with fixlet below both suck rocks. With a make -j40
> >>> running, wakeup-latency is showing latencies of>100ms, amarok skips,
> >>> mouse lurches badly.. generally horrid. Something went south.
> >>
> >> I'm looking at this.
> >>
> >> The share:share ratios looked good in static testing, but perhaps we
> >> need a little more wake-up boost to improve interactivity.
> >
> > Yeah, feels like a wakeup issue. I too did a (brief) static test, and
> > that looked ok.
> >
> > -Mike
> >
>
> Hey Mike,
>
> Does something like the below help?

Unfortunately not. For example, Xorg+mplayer needs (30 sec refresh)..

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
6487 root 20 0 366m 30m 5100 S 31 0.4 2:04.83 2 Xorg
4454 root 20 0 318m 28m 15m S 29 0.4 0:38.06 3 mplayer

..but gets this when a heavy kbuild is running along with them.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
6487 root 20 0 366m 30m 5136 S 12 0.4 2:25.98 1 Xorg
5595 root 20 0 318m 28m 15m R 8 0.4 0:09.31 3 mplayer

There are 4 task groups active at this time, Xorg, mplayer, Amarok and
konsole where the kbuild is running make -j40.

-Mike

2010-12-03 08:37:54

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Thu, Dec 2, 2010 at 10:48 PM, Mike Galbraith <[email protected]> wrote:
> On Thu, 2010-12-02 at 21:11 -0800, Paul Turner wrote:
>> On 11/30/10 22:16, Mike Galbraith wrote:
>> > On Tue, 2010-11-30 at 19:39 -0800, Paul Turner wrote:
>> >> On 11/28/10 06:24, Mike Galbraith wrote:
>> >>>
>> >>> Something else is seriously wrong though. ?36.1 with attached (plus
>> >>> sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
>> >>> 37.git and tip with fixlet below both suck rocks. ?With a make -j40
>> >>> running, wakeup-latency is showing latencies of>100ms, amarok skips,
>> >>> mouse lurches badly.. generally horrid. ?Something went south.
>> >>
>> >> I'm looking at this.
>> >>
>> >> The share:share ratios looked good in static testing, but perhaps we
>> >> need a little more wake-up boost to improve interactivity.
>> >
>> > Yeah, feels like a wakeup issue. ?I too did a (brief) static test, and
>> > that looked ok.
>> >
>> > ? ? -Mike
>> >
>>
>> Hey Mike,
>>
>> Does something like the below help?
>
> Unfortunately not. ?For example, Xorg+mplayer needs (30 sec refresh)..
>
> ?PID USER ? ? ?PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ? ?TIME+ ?P COMMAND
> ?6487 root ? ? ?20 ? 0 ?366m ?30m 5100 S ? 31 ?0.4 ? 2:04.83 2 Xorg
> ?4454 root ? ? ?20 ? 0 ?318m ?28m ?15m S ? 29 ?0.4 ? 0:38.06 3 mplayer
>
> ..but gets this when a heavy kbuild is running along with them.
>
> ?PID USER ? ? ?PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ? ?TIME+ ?P COMMAND
> ?6487 root ? ? ?20 ? 0 ?366m ?30m 5136 S ? 12 ?0.4 ? 2:25.98 1 Xorg
> ?5595 root ? ? ?20 ? 0 ?318m ?28m ?15m R ? ?8 ?0.4 ? 0:09.31 3 mplayer
>
> There are 4 task groups active at this time, Xorg, mplayer, Amarok and
> konsole where the kbuild is running make -j40.
>

Hmm.. unfortunate. Ok -- based on the traces of synthetic loads and
the traces of their share on wake-up I think this is the right track
at least, will refine it tomorrow.

Thanks for trying it.

> ? ? ? ?-Mike
>
>
>

2010-12-04 17:39:51

by Colin Walters

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, Nov 20, 2010 at 2:35 PM, Mike Galbraith <[email protected]> wrote:

> A recurring complaint from CFS users is that parallel kbuild has a negative
> impact on desktop interactivity.  This patch implements an idea from Linus,
> to automatically create task groups.  This patch only per session autogroups,
> but leaves the way open for enhancement.

Resurrecting this thread a bit, one question I didn't see discussed is simply:

Why doesn't "nice" work for this? On my Fedora 14 system, "ps alxf"
shows almost everything in my session is running at the default nice
0. The only exceptions are "/usr/libexec/tracker-miner-fs" at 19, and
pulseaudio at -11.

I don't know What would happen if say the scheduler effectively
group-scheduled each nice value? Then, what we tell people to do is
run "nice make". Which in fact, has been documented as a thing to do
for decades. Actually I tend to use "ionice" too, which is also
useful if any of your desktop applications happen to make the mistake
of doing I/O in the mainloop (emacs fsync()ing in UI thread, I'm
looking at you).

Quickly testing kernel-2.6.35.6-48.fc14.x86_64 on a "Intel(R)
Core(TM)2 Quad CPU Q9400 @ 2.66GHz", the difference between "make
-j 128" and "nice make -j 128" is quite noticeable. As you'd expect.
The CFS docs already say:

"The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
than the previous vanilla scheduler: both types of workloads are isolated much
more aggressively"

Does it just need to be even more aggressive, and people use "nice"?

2010-12-04 18:33:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, Dec 4, 2010 at 9:39 AM, Colin Walters <[email protected]> wrote:
>
> Why doesn't "nice" work for this? ?On my Fedora 14 system, "ps alxf"
> shows almost everything in my session is running at the default nice
> 0. ?The only exceptions are "/usr/libexec/tracker-miner-fs" at 19, and
> pulseaudio at -11.

"nice" doesn't work. It never has. Nobody ever uses it, and that has
always been true.

As you note, you can find occasional cases of it being used, but they
are either for things that are _so_ unimportant (and know they are)
and annoying cpu hogs that they wouldn't be allowed to live unless
they were niced down maximally (your tracker-miner example), or they
use nice not because they really want to, but because it is an
approximation for what they really do want (ie pulseaudio wants low
latencies, and is set up by the distro, so you'll find it niced up).

But the fundamental issue is that 'nice' is broken. It's very much
broken at a conceptual and technical design angle (absolute priority
levels, no fairness), but it's broken also from a psychological and
practical angle (ie expecting people to manually do extra work is
ridiculous and totally unrealistic).

> I don't know What would happen if say the scheduler effectively
> group-scheduled each nice value?

Why would you want to do that? If you are willing to do group
scheduling, do it on something sane and meaningful, and something that
doesn't need user interaction or decisions. And do it on something
that has more than 20 levels.

You could, for example, decide to do it per session.

>?Then, what we tell people to do is
> run "nice make". ?Which in fact, has been documented as a thing to do
> for decades.

Nobody but morons ever "documented" that. Sure, you can find people
saying it, but you won't be finding people actually _doing_ it. Look
around.

Seriously. Nobody _ever_ does "nice make", unless they are seriously
repressed beta-males (eg MIS people who get shouted at when they do
system maintenance unless they hide in dark corners and don't get
discovered). It just doesn't happen.

But more fundamentally, it's still the wrong thing to do. What nice
level should you use?

And btw, it's not just "make". One of the things that originally
caused me to want something like this is that you can enable some
pretty aggressive threading with "git diff". If you use the
"core.preloadindex" setting, git will fire up 20 threads just to do
"lstat()" system calls as quickly as it humanly can. Or "git grep"
will happily use lots of threads and really mess with your system,
except it limits the threads to a smallish number just to not be
asocial.

Do you want to do "nice git" too? Especially as the reason the
threaded lstat was implemented was that over NFS, you actually want
the threads not because you're using lots of CPU, but because you want
to fire up lots of concurrent network traffic - and you actually want
low latency. So you do NOT want to mark these threads as
"unimportant". They're not.

But what you do want is a basic and automatic fairness. When I do "git
grep", I want the full resources of the machine to do the grep for me,
so that I can get the answer in half a second (which is about the
limit at which point I start getting impatient). That's an _important_
job for me. It should get all the resources it can, there is
absolutely no excuse for nicing it down.

But at the same time, if I just happen to have sound or something
going on at the same time, I would definitely like some amount of
fairness. Just because git is smart and can use lots of threads to do
its work quickly, it shouldn't be _unfair_. It should hod the machine
- but only up to a point of some fairness.

That is something that "nice" can never give you. It's not what nice
was designed for, it's not how nice works. And if you ask people to
say "this work isn't important", you shouldn't expect them to actually
do it. If something isn't important, I certainly won't then spend
extra effort on it, for chrissake!

Now, I'm not saying that cgroups are necessarily the answer either.
But using sessions as input to group scheduling is certainly _one_
answer. And it's a hell of a better answer than 'nice' has ever been,
or will ever be.

Linus

2010-12-04 20:01:21

by Colin Walters

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, Dec 4, 2010 at 1:33 PM, Linus Torvalds
<[email protected]> wrote:
>
> But the fundamental issue is that 'nice' is broken. It's very much
> broken at a conceptual and technical design angle (absolute priority
> levels, no fairness), but it's broken also from a psychological and
> practical angle (ie expecting people to manually do extra work is
> ridiculous and totally unrealistic).

I don't see it as ridiculous - for the simple reason that it really
has existed for so long and is documented (see below).

> Why would you want to do that? If you are willing to do group
> scheduling, do it on something sane and meaningful, and something that
> doesn't need user interaction or decisions. And do it on something
> that has more than 20 levels.

In this case, the "user interaction" component is pretty damn small.
We're talking about 4 extra characters.

> Nobody but morons ever "documented" that. Sure, you can find people
> saying it, but you won't be finding people actually _doing_ it. Look
> around.

Look around...where? On what basis are you making that claim? I did
a quick web search for "unix background process", and this tutorial
(in the first page of Google search results) aimed at grad students
who use Unix at college definitely describes "nice make":
http://acs.ucsd.edu/info/jobctrl.shtml

There are some that don't, like:
http://linux.about.com/od/itl_guide/a/gdeitl35t01.htm and
http://www.albany.edu/its/quickstarts/qs-common_unix.html

But then again here's a Berkeley "Unix Tutorial" that does cover it:
http://people.ischool.berkeley.edu/~kevin/unix-tutorial/section13.html

So, does your random Linux-using college student or professional
developer know about "nice"? My guess is "likely". Do they use it
for "make"? No data. The issue is that you really only have a bad
experience on *large* projects. But if we just said to people who
come to us "Hey, when I compile webkit/linux/mozilla my system slows
down" we can tell them "use nice", especially since it's already
documented on the web, that seems to me like a pretty damn good
answer.

> Seriously. Nobody _ever_ does "nice make", unless they are seriously
> repressed beta-males (eg MIS people who get shouted at when they do
> system maintenance unless they hide in dark corners and don't get
> discovered). It just doesn't happen.

Heh. Well, I do at least (or rather, my personal automagic build
wrapper script does (it detects Makefile/autotools etc. and tries to
DTRT)).

> But more fundamentally, it's still the wrong thing to do. What nice
> level should you use?

Doesn't matter - if they all got group-scheduled together, then the
default of 10 (0+10) is totally fine.

> Do you want to do "nice git" too? Especially as the reason the
> threaded lstat was implemented was that over NFS, you actually want
> the threads not because you're using lots of CPU, but because you want
> to fire up lots of concurrent network traffic - and you actually want
> low latency. So you do NOT want to mark these threads as
> "unimportant". They're not.

Hmm...how many threads are we talking about here? If it's just say
one per core, then I doubt it needs nicing. The reason people nice
make is because the whole thing alternates between being CPU bound and
I/O bound, so you need to start more jobs than cores (sometimes a lot
more) to ensure maximal utilization.

> But what you do want is a basic and automatic fairness. When I do "git
> grep", I want the full resources of the machine to do the grep for me,
> so that I can get the answer in half a second (which is about the
> limit at which point I start getting impatient). That's an _important_
> job for me. It should get all the resources it can, there is
> absolutely no excuse for nicing it down.

Sure...though I imagine for "most" people that's totally I/O bound
(either on ext4 journal or hard disk seeks).

> Now, I'm not saying that cgroups are necessarily the answer either.
> But using sessions as input to group scheduling is certainly _one_
> answer. And it's a hell of a better answer than 'nice' has ever been,
> or will ever be.

Well, the text of Documentation/scheduler/sched-design-CFS.txt
certainly seems to be claiming it was a big improvement in this kind
of situation from the previous scheduler. If we're finding out there
are cases where it's not, it's definitely worth asking the question
why it's not working.

Speaking of the scheduler documentation - note that its sample shell
code contains exactly the problem showing what's wrong with
auto-grouping-by-tty, which is:

# firefox & # Launch firefox and move it to "browser" group

As soon as you do that from the same terminal that you're going to
launch the "make" from, you're back to total lossage. Are you going
to explain to a student that "oh, you need to create a new
gnome-terminal tab and launch firefox from that"?

2010-12-04 22:40:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, Dec 4, 2010 at 12:01 PM, Colin Walters <[email protected]> wrote:
>
> But then again here's a Berkeley "Unix Tutorial" that does cover it:
> http://people.ischool.berkeley.edu/~kevin/unix-tutorial/section13.html

What part of "nobody does that" didn't you understand?

I know about "nice". I think it was part of the original Unix course I
ever took, and it probably made more sense back then (sixteen people
at a time on a microvax, compiling stuff). And I never used it, afaik.
Nor does really anybody else.

But hey, whatever floats your boat. You can use it. And you can feel
special and better than the rest of us exactly because you know you
really _are_ special.

> Hmm...how many threads are we talking about here? ?If it's just say
> one per core, then I doubt it needs nicing.

I think git defaults to a maximum of 20 for it. Remember: it's not
about "cores". It's about IO, and then 20 is a "let's not mess up
everybody else _too_ much when we're actually CPU-bound".

But that's not the point. The point is that "nice" is totally the
wrong thing to do. It's _always_ the wrong thing to do. The only
reason it's in tutorials and taught in intro Unix classes is that it's
the only thing there is in traditional unix.

And we can be better. We don't need to be stupid and traditional.

But you go right on and use it. Nobody stops you.

> Sure...though I imagine for "most" people that's totally I/O bound
> (either on ext4 journal or hard disk seeks).

Sure. And "most" people do something totally different. What's your
point? The fact is, the session-based group scheduling really does
work. It works on a lot of different loads. It's nice for things like
my use, but it's _also_ nice for things like me ssh'ing into my kids
or wife's computers to update their kernel. And it's nice for things
like "make -j test" for git etc.

And it doesn't hurt you. If you're happy with "nice", go on and use
it. Why are you even discussing it? I'm telling you the FACT that
others aren't happy with nice, and that smart people consider nice to
be totally useless.

But none of that means that you can't go on using it. Comprende?

> # firefox & ? ? # Launch firefox and move it to "browser" group
>
> As soon as you do that from the same terminal that you're going to
> launch the "make" from, you're back to total lossage.

"Mommy mommy, it hurts when I stick forks in my eyes!"

What's your point again? It's a heuristic. It works great for the
cases many normal people have. If you have a graphical desktop, most
sane people would tend to start the browser from that nice big browser
icon. But again, if you want to stick forks in your eyes, go right
ahead. It's not _my_ problem.

And similarly, it's not _your_ problem if other people want to do
saner things, is it?

Linus

2010-12-04 23:32:38

by David Lang

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, 4 Dec 2010, Colin Walters wrote:

> Speaking of the scheduler documentation - note that its sample shell
> code contains exactly the problem showing what's wrong with
> auto-grouping-by-tty, which is:
>
> # firefox & # Launch firefox and move it to "browser" group
>
> As soon as you do that from the same terminal that you're going to
> launch the "make" from, you're back to total lossage. Are you going
> to explain to a student that "oh, you need to create a new
> gnome-terminal tab and launch firefox from that"?

as someone who starts firefox from a terminal session all the time, I
always want to start it from it's own dedicated session, if for no other
reason that it spits out a TON of error messages over time, and I don't
want them popping up in a window where I'm doing something else.

so this is a very bad example.

David Lang

2010-12-04 23:43:47

by Colin Walters

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds
<[email protected]> wrote:

> And it doesn't hurt you. If you're happy with "nice", go on and use
> it. Why are you even discussing it?

Because it seems to me like a bug if it isn't as good as group
scheduling? Most of your message is saying it's worthless, and I
don't disagree that it's not very good *right now*. I guess where we
disagree is whether it's worth fixing.

> What's your point again? It's a heuristic.

So if it's a heuristic the OS can get wrong, wouldn't it be a good
idea to support a way for programs and/or interactive users to
explicitly specify things? Unfortunately the cgroups utilities don't
make this easy (and of course there's the issue that no major released
OS exports write permission to the cpu cgroup for a desktop session
uid). I guess "nice" could be patched to, if the user has permission
to the cgroups, to auto-create a group. Or...nice could be fixed.

On a more productive note, I see now
Documentation/scheduler/sched-nice-design.txt has a lot of really
useful history regarding "nice" and the complaints over time (I guess
this is where some of your assertions that it's failed/worthless comes
from).

2010-12-04 23:55:59

by James Dutton

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On 3 December 2010 05:11, Paul Turner <[email protected]> wrote:
>
> I actually don't have a desktop setup handy to test "interactivity" (sad but
> true -- working on grabbing one).  But it looks better on under synthetic
> load.
>

What tools are actually used to test "interactivity" ?
I posted a tool to the list some time ago, but I don't think anyone noticed.
My tool is very simple.
When you hold a key down, it should repeat. It should repeat at a
constant predictable interval.
So, my tool just waits for key presses and times when each one occurred.
The tester simply presses a key and holds it down.
If the time between each key press is constant, it indicates good
"interactivity". If the time between each key press varies a lot, it
indicates bad "interactivity".
You can reliably test if one kernel is better than the next using
actual measurable figures.

Kind Regards

James

2010-12-05 00:32:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, Dec 4, 2010 at 3:43 PM, Colin Walters <[email protected]> wrote:
> On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds
> <[email protected]> wrote:
>
>> And it doesn't hurt you. If you're happy with "nice", go on and use
>> it. Why are you even discussing it?
>
> Because it seems to me like a bug if it isn't as good as group
> scheduling? ?Most of your message is saying it's worthless, and I
> don't disagree that it's not very good *right now*. ?I guess where we
> disagree is whether it's worth fixing.

It's not worth 'fixing", because it works exactly like it's designed -
and supposed - to work.

There really isn't anything to fix. 'nice' is what it is. It's a
simple legacy interface to scheduler priority. The fact that it's also
almost totally useless is irrelevant. It's like male nipples. We
wouldn't be better off lactating, and they look like some odd wart
that doesn't do much good. But it would be worse to remove it.

'nice' is a bad idea. It's a bad idea that has perfectly
understandable historical reasons for it, but it's an _unfixably_ bad
idea.

Linus

2010-12-05 05:12:06

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, Dec 4, 2010 at 3:55 PM, James Courtier-Dutton
<[email protected]> wrote:
> On 3 December 2010 05:11, Paul Turner <[email protected]> wrote:
>>
>> I actually don't have a desktop setup handy to test "interactivity" (sad but
>> true -- working on grabbing one). ?But it looks better on under synthetic
>> load.
>>
>
> What tools are actually used to test "interactivity" ?
> I posted a tool to the list some time ago, but I don't think anyone noticed.
> My tool is very simple.
> When you hold a key down, it should repeat. It should repeat at a
> constant predictable interval.
> So, my tool just waits for key presses and times when each one occurred.
> The tester simply presses a key and holds it down.
> If the time between each key press is constant, it indicates good
> "interactivity". If the time between each key press varies a lot, it
> indicates bad "interactivity".
> You can reliably test if one kernel is better than the next using
> actual measurable figures.
>
> Kind Regards
>
> James
>

Could you drop me a pointer? I can certainly give it a try. It would
be extra useful if it included any histogram functionality.

I've been using a combination of various synthetic wakeup and load
scripts and measuring the received bandwidth / wakeup latency.

They have not succeeded in reproducing the starvation or poor latency
observed by Mike above however. (Although I've pulled a box to try
reproducing his exact conditions [ e.g. user environment ] on Monday).

2010-12-05 07:47:33

by Ray Lee

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, Dec 4, 2010 at 3:43 PM, Colin Walters <[email protected]> wrote:
> So if it's a heuristic the OS can get wrong, wouldn't it be a good
> idea to support a way for programs and/or interactive users to
> explicitly specify things?

Consider a multi-user machine. `nice` is an orthogonal concern in that
case. Therefore, fixing nice doesn't address all issues. Also: Most
linux systems are multi-user (root and the physical tty user.)
Further, even a single user wears multiple hats on a single system.
The idea is to infer those hats, and deal with them fairly.

No one is taking nice away from you. Keep using it if you like.

If you want to allow users to explicitly specify group scheduling,
then good news: we already have that feature. You just seem to not be
using it. Much like the other 99.993% of us.

The kernel is supposed to have *sane defaults*. That's what is under
discussion here.

~r.

2010-12-05 10:19:17

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

Greets.

I applaud your efforts to continue addressing interactivity and responsiveness
but, I know I'm going to regret this, I feel strongly enough to speak up about
this change.

On Sun, 5 Dec 2010 10:43:44 Colin Walters wrote:
> On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds
> <[email protected]> wrote:
> > What's your point again? It's a heuristic.
>
> So if it's a heuristic the OS can get wrong,

This is precisely what I see as the flaw in this approach. The whole reason
you have CFS now is that we had a scheduler which was pretty good for all the
other things in the O(1) scheduler, but needed heuristics to get interactivity
right. I put them there. Then I spent the next few years trying to find a way
to get rid of them. The reason is precisely what Colin says above. Heuristics
get it wrong sometimes. So no matter how smart you think your heuristics are,
it is impossible to get it right 100% of the time. If the heuristics make it
better 99% of the time, and introduce disastrous corner cases, regressions and
exploits 1% of the time, that's unforgivable. That's precisely what we had
with the old O(1) scheduler and that's what you got rid of when you put CFS
into mainline. The whole reason CFS was better was it was mostly fair and
concentrated on ensuring decent latency rather than trying to guess what would
be right, so it was predictable and reliable.

So if you introduce heuristics once again into the scheduler to try and
improve the desktop by unfairly distributing CPU, you will go back to where
you once were. Mostly better but sometimes really badly wrong. No matter how
smart you think you can be with heuristics they cannot be right all the time.
And there are regressions with these tty followed by per session group
patches. Search forums where desktop users go and you'll see that people are
afraid to speak up on lkml but some users are having mplayer and amarok
skipping under light load when trying them. You want to program more
intelligence in to work around these regressions, you'll just get yourself
deeper and deeper into the same quagmire. The 'quick fix' you seek now is not
something you should be defending so vehemently. The "I have a solution now"
just doesn't make sense in this light. I for one do not welcome our new
heuristic overlords.

If you're serious about really improving the desktop from within the kernel,
as you seem to be with this latest change, then make a change that's
predictable and gets it right ALL the time and is robust for the future. Stop
working within all the old fashioned concepts and allow userspace to tell the
kernel what it wants, and give the user the power to choose. If you think this
is too hard and not doable, or that the user is too uninformed or want to
modify things themselves, then allow me to propose a relatively simple change
that can expedite this.

There are two aspects to getting good desktop behaviour, enough CPU and low
latency. 'nice' by your own admission is too crude and doesn't really describe
how either of these should really be modified. Furthermore there are 40 levels
of it and only about 4 or 5 are ever used. We also know that users don't even
bother using it.

What I propose is a new syscall latnice for "latency nice". It only need have
4 levels, 1 for default, 0 for latency insensitive, 2 for relatively latency
sensitive gui apps, and 3 for exquisitely latency sensitive uses such as
audio. These should not require extra privileges to use and thus should also
not be usable for "exploiting" extra CPU by default. It's simply a matter of
working with lower latencies yet shorter quota (or timeslices) which would
mean throughput on these apps is sacrificed due to cache trashing but then
that's not what latency sensitive applications need. These can then be
encouraged to be included within the applications themselves, making this a
more long term change. 'Firefox' could set itself 2, 'Amarok' and 'mplayer' 3,
and 'make' - bless its soul - 0, and so on. Keeping the range simple and
defined will make it easy for userspace developers to cope with, and users to
fiddle with.

But that would only be the first step. The second step is to take the plunge
and accept that we DO want selective unfairness on the desktop, but where WE
want it, not where the kernel thinks we might want it. It's not an exploit if
my full screen HD video continues to consume 80% of the CPU while make is
running - on a desktop. Take a leaf out of other desktop OSs and allow the
user to choose say levels 0, 1, or 2 for desktop interactivity with a simple
/proc/sys/kernel/interactive tunable, a bit like the "optimise for foreground
applications" seen elsewhere. This could then be used to decide whether to use
the scheduling hints from latnice to either just ensure low latency but keep
the same CPU usage - 0, or actually give progressively more CPU for latniced
tasks as the interactive tunable is increased. Then distros can set this on
installation and make it part of the many funky GUIs to choose between the
different levels. This then takes the user out of the picture almost entirely,
yet gives them the power to change it if they so desire.

The actual scheduler changes required to implement this are absurdly simple
and doable now, and will not cost in overhead the way cgroups do. It also
should cause no regressions when interactive mode is disabled and would have
no effect till changes are made elsewhere, or the users use the latnice
utility.

Move away from the fragile heuristic tweaks and find a longer term robust
solution.

Regards,
Con

--
-ck

P.S. I'm very happy for someone else to do it. Alternatively you could include
BFS and I'd code it up for that in my spare time.

2010-12-05 11:11:06

by Nikos Chantziaras

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On 12/04/2010 10:01 PM, Colin Walters wrote:
>[...]
> Speaking of the scheduler documentation - note that its sample shell
> code contains exactly the problem showing what's wrong with
> auto-grouping-by-tty, which is:
>
> # firefox& # Launch firefox and move it to "browser" group
>
> As soon as you do that from the same terminal that you're going to
> launch the "make" from, you're back to total lossage. Are you going
> to explain to a student that "oh, you need to create a new
> gnome-terminal tab and launch firefox from that"?

Btw, most people don't do that anymore. They don't use terminals. They
click the application icons on their desktops and start menus or double
click the executables in their file managers. So it's not a matter of
opening a second terminal tab, because the first one isn't even open.

To have a fluid desktop one shouldn't require to hack with terminal
commands.

2010-12-05 11:36:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, 2010-12-05 at 21:18 +1100, Con Kolivas wrote:
> Greets.
>
> I applaud your efforts to continue addressing interactivity and responsiveness
> but, I know I'm going to regret this, I feel strongly enough to speak up about
> this change.
>
> On Sun, 5 Dec 2010 10:43:44 Colin Walters wrote:
> > On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds
> > <[email protected]> wrote:
> > > What's your point again? It's a heuristic.
> >
> > So if it's a heuristic the OS can get wrong,
>
> This is precisely what I see as the flaw in this approach. The whole reason
> you have CFS now is that we had a scheduler which was pretty good for all the
> other things in the O(1) scheduler, but needed heuristics to get interactivity
> right. I put them there.

Actually, Linus laid the foundation with sleeper fairness, Ingo expanded
it to requeue "interactive" tasks in the active array, and you tweaked
the result.

> Then I spent the next few years trying to find a way
> to get rid of them. The reason is precisely what Colin says above. Heuristics
> get it wrong sometimes. So no matter how smart you think your heuristics are,
> it is impossible to get it right 100% of the time. If the heuristics make it
> better 99% of the time, and introduce disastrous corner cases, regressions and
> exploits 1% of the time, that's unforgivable. That's precisely what we had
> with the old O(1) scheduler and that's what you got rid of when you put CFS
> into mainline. The whole reason CFS was better was it was mostly fair and
> concentrated on ensuring decent latency rather than trying to guess what would
> be right, so it was predictable and reliable.

And it still is Con. I didn't rewrite the thing, I just added an
automated task grouping. Session to session fairness is just as holy as
any sacred cow definition of fair you care to trot out.

> So if you introduce heuristics once again into the scheduler to try and
> improve the desktop by unfairly distributing CPU, you will go back to where
> you once were. Mostly better but sometimes really badly wrong. No matter how
> smart you think you can be with heuristics they cannot be right all the time.
> And there are regressions with these tty followed by per session group
> patches. Search forums where desktop users go and you'll see that people are
> afraid to speak up on lkml but some users are having mplayer and amarok
> skipping under light load when trying them.

Shrug. I can't debug what isn't reported.

> You want to program more
> intelligence in to work around these regressions, you'll just get yourself
> deeper and deeper into the same quagmire. The 'quick fix' you seek now is not
> something you should be defending so vehemently. The "I have a solution now"
> just doesn't make sense in this light. I for one do not welcome our new
> heuristic overlords.

I for one don't welcome childish name calling.

> If you're serious about really improving the desktop from within the kernel,
> as you seem to be with this latest change, then make a change that's
> predictable and gets it right ALL the time and is robust for the future. Stop
> working within all the old fashioned concepts and allow userspace to tell the
> kernel what it wants, and give the user the power to choose. If you think this
> is too hard and not doable, or that the user is too uninformed or want to
> modify things themselves, then allow me to propose a relatively simple change
> that can expedite this.
>
> There are two aspects to getting good desktop behaviour, enough CPU and low
> latency. 'nice' by your own admission is too crude and doesn't really describe
> how either of these should really be modified. Furthermore there are 40 levels
> of it and only about 4 or 5 are ever used. We also know that users don't even
> bother using it.
>
> What I propose is a new syscall latnice for "latency nice". It only need have
> 4 levels, 1 for default, 0 for latency insensitive, 2 for relatively latency
> sensitive gui apps, and 3 for exquisitely latency sensitive uses such as
> audio. These should not require extra privileges to use and thus should also
> not be usable for "exploiting" extra CPU by default. It's simply a matter of
> working with lower latencies yet shorter quota (or timeslices) which would
> mean throughput on these apps is sacrificed due to cache trashing but then
> that's not what latency sensitive applications need. These can then be
> encouraged to be included within the applications themselves, making this a
> more long term change. 'Firefox' could set itself 2, 'Amarok' and 'mplayer' 3,
> and 'make' - bless its soul - 0, and so on. Keeping the range simple and
> defined will make it easy for userspace developers to cope with, and users to
> fiddle with.

An automated per session task group is an evil heuristic, so we should
use kinda sorta sensitive, really REALLY sensitive, don't give a damn,
or no frickin' clue... to make 100% accurate non-heuristic scheduling
decisions instead? Did I get that right?

Goodbye.

-Mike

2010-12-05 15:12:39

by Alan

[permalink] [raw]
Subject: Re: [PATCH v4] Regression: sched: automated per session task groups

> > As soon as you do that from the same terminal that you're going to
> > launch the "make" from, you're back to total lossage. Are you going
> > to explain to a student that "oh, you need to create a new
> > gnome-terminal tab and launch firefox from that"?
>
> Btw, most people don't do that anymore. They don't use terminals. They

Its a regression for those who do - and often have good reason to do.
This is of course why you don't put policy in the kernel and the original
patch was bogus anyway.

> To have a fluid desktop one shouldn't require to hack with terminal
> commands.

Which is the classic mentality that ruins the big bloated GNOME Linux
desktop "It works my way and every other way is wrong so go screw". Of
course the other half of the problem is exactly that - firefox was once a
small browser, its now a bloated monster too so the scheduler is quite
sensible to pick on it.

Alan

2010-12-05 16:17:13

by Florian Mickler

[permalink] [raw]
Subject: Re: [PATCH v4] Regression: sched: automated per session task groups

On Sun, 5 Dec 2010 15:12:20 +0000
Alan Cox <[email protected]> wrote:

> > To have a fluid desktop one shouldn't require to hack with terminal
> > commands.
>
> Which is the classic mentality that ruins the big bloated GNOME Linux
> desktop "It works my way and every other way is wrong so go screw". Of
> course the other half of the problem is exactly that - firefox was once a
> small browser, its now a bloated monster too so the scheduler is quite
> sensible to pick on it.
>
> Alan

Your rant about big bloated GNOME is... well just a rant. You will
never be able to change it. You can just hope, that over time the
evolutionary aspects of open source development will fix it.

There is nothing wrong with trying to provide ease of use. Graphical
interfaces that are well designed are easier to use.
Most command-line people just can't cope with the unstable nature of
interfaces in the graphical world.

CLI's are mostly better from an ergonomic view (old people,
heavy working hackers and other power users) because they provide
a stable focus point.

But this comes at some cost because the human mind is (originally) not
tuned for text processing and remembering abstract things like 'words'.
It's unique ability to adapt itself to this is.. extraordinary.
Most hackers probably don't realize this, but images/icons and other
graphical interfaces are more similar to real life and are thus easier
to use for 'unadapted' people.

"Master minds" that can remember numbers with [really-big-number] of
digits often use a trick to do this: They associate every digit-pair
with an everyday item. When they learn the number, they construct a
story using those items. This story is what they then later use to
restore the original number. All that is because humans can remember
real life things better than digits or words.

Regards,
Flo

2010-12-05 16:59:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] Regression: sched: automated per session task groups

On Sun, 2010-12-05 at 15:12 +0000, Alan Cox wrote:
> > > As soon as you do that from the same terminal that you're going to
> > > launch the "make" from, you're back to total lossage. Are you going
> > > to explain to a student that "oh, you need to create a new
> > > gnome-terminal tab and launch firefox from that"?
> >
> > Btw, most people don't do that anymore. They don't use terminals. They
>
> Its a regression for those who do - and often have good reason to do.
> This is of course why you don't put policy in the kernel and the original
> patch was bogus anyway.

What is a very clear regression is a threaded app (say firefox) vs a
single threaded app, particularly on UP. The per thread scheduling
model wins hands down there, because the scheduler very heavily favors
the threaded application. Take that unfairness away, and you have an
undeniable regression. Yes, it's not black and white, never is.

-Mike

2010-12-05 17:09:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] Regression: sched: automated per session task groups

On Sun, 2010-12-05 at 17:59 +0100, Mike Galbraith wrote:
> On Sun, 2010-12-05 at 15:12 +0000, Alan Cox wrote:
> > > > As soon as you do that from the same terminal that you're going to
> > > > launch the "make" from, you're back to total lossage. Are you going
> > > > to explain to a student that "oh, you need to create a new
> > > > gnome-terminal tab and launch firefox from that"?
> > >
> > > Btw, most people don't do that anymore. They don't use terminals. They
> >
> > Its a regression for those who do - and often have good reason to do.
> > This is of course why you don't put policy in the kernel and the original
> > patch was bogus anyway.
>
> What is a very clear regression is a threaded app (say firefox) vs a
> single threaded app, particularly on UP. The per thread scheduling
> model wins hands down there, because the scheduler very heavily favors
> the threaded application. Take that unfairness away, and you have an
> undeniable regression. Yes, it's not black and white, never is.

P.S. You also have an obvious _progression_ from the perspective of the
single threaded application, which may just as well be interactive.

2010-12-05 17:15:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v4] Regression: sched: automated per session task groups

On Sun, 2010-12-05 at 18:09 +0100, Mike Galbraith wrote:
> On Sun, 2010-12-05 at 17:59 +0100, Mike Galbraith wrote:
> > On Sun, 2010-12-05 at 15:12 +0000, Alan Cox wrote:
> > > > > As soon as you do that from the same terminal that you're going to
> > > > > launch the "make" from, you're back to total lossage. Are you going
> > > > > to explain to a student that "oh, you need to create a new
> > > > > gnome-terminal tab and launch firefox from that"?
> > > >
> > > > Btw, most people don't do that anymore. They don't use terminals. They
> > >
> > > Its a regression for those who do - and often have good reason to do.
> > > This is of course why you don't put policy in the kernel and the original
> > > patch was bogus anyway.
> >
> > What is a very clear regression is a threaded app (say firefox) vs a
> > single threaded app, particularly on UP. The per thread scheduling
> > model wins hands down there, because the scheduler very heavily favors
> > the threaded application. Take that unfairness away, and you have an
> > undeniable regression. Yes, it's not black and white, never is.
>
> P.S. You also have an obvious _progression_ from the perspective of the
> single threaded application, which may just as well be interactive.

P.P.S :)

systemd will have the same regressions/progressions. It doesn't matter
one whit whether it's kernel/userland making policy. If distro-X
includes either one, or neither, they are guaranteed to be wrong :)

2010-12-05 19:22:19

by Colin Walters

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, Dec 5, 2010 at 2:47 AM, Ray Lee <[email protected]> wrote:
> On Sat, Dec 4, 2010 at 3:43 PM, Colin Walters <[email protected]> wrote:
>> So if it's a heuristic the OS can get wrong, wouldn't it be a good
>> idea to support a way for programs and/or interactive users to
>> explicitly specify things?
>
> Consider a multi-user machine. `nice` is an orthogonal concern in that
> case. Therefore, fixing nice doesn't address all issues.

For the purposes of this discussion again, let's say "fixing nice"
means say "group schedule each nice level above 0". There are
obviously many possibilities here, but let's consider this one
precisely.

How, exactly, under what scenario in a "multi-user machine" does this
break? How exactly is it orthogonal?

Two people logged in would get their "make" jobs group scheduled
together. What is the problem?

Since Linus appears to be more interested in talking about nipples
than explaining exactly what it would break, but you appear to agree
with him, hopefully you'll be able to explain...

2010-12-05 19:48:59

by Alan

[permalink] [raw]
Subject: Re: [PATCH v4] Regression: sched: automated per session task groups

> Your rant about big bloated GNOME is... well just a rant. You will
> never be able to change it. You can just hope, that over time the
> evolutionary aspects of open source development will fix it.

They already are - its dying slowly but surely.

> There is nothing wrong with trying to provide ease of use. Graphical
> interfaces that are well designed are easier to use.

Interesting how you associate ease of use with being bloated and oversize.

> Most command-line people just can't cope with the unstable nature of
> interfaces in the graphical world.

Ah of course. How sweet, your response to a point about the arrogance of
certain desktop attitudes is to lecture, and make bogus pronouncements
about command-line people. You might want to put the shovel away instead.

> CLI's are mostly better from an ergonomic view (old people,
> heavy working hackers and other power users) because they provide
> a stable focus point.

And a bit of pop psychology to go with it. I assume you are trying to
talk about internalised and externalised models ?

> Most hackers probably don't realize this, but images/icons and other

No of course not we are all dim, thank you for using small words. I am
actually familiar with the real models here btw and there are a couple of
rather important basics you are missing

- Different people have different strengths in different areas - these
don't specifically line up with the senses. There isn't vast amounts of
evidence to support computing people are all strong in a particular
area either. You'll see in studies that some of them prefer to
diagram and flowchart, some write text, some have kinesthetic models
(movement and flow). I don't doubt there are others who sense it in
different ways.

- Visual and textual data communicate different things more effectively
(as do sounds, smells, movements, ....) and are processed with
different natural preferences by different people

- Oh and there is no evidence I've ever seen to suggest old people are
more text oriented.

So any notion of CLI's or GUI's being better for [class of people] is
generally na?ve. It depends what is being done, who is doing it and the
situation.

If you really want to understand the trade-offs in a graphical world
watch some good CAD operators for an hour. They'll use the same tools in
very different ways - some very command line, some heavily
mouse/trackball, others graphical but with hotkeys, and they'll often
shift approach according to task.

> graphical interfaces are more similar to real life and are thus easier
> to use for 'unadapted' people.

If you wish to see an unadapted person you probably want to go look at
amazonian tribes. Bit different.

But then the GUI world is the world that put "logout" under "start" 8)
and thinks that "Insert OLE Object" is a good thing to put on a menu.

Alan

2010-12-05 20:48:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, Dec 5, 2010 at 11:22 AM, Colin Walters <[email protected]> wrote:
>
> For the purposes of this discussion again, let's say "fixing nice"
> means say "group schedule each nice level above 0". ?There are
> obviously many possibilities here, but let's consider this one
> precisely.

THAT IS NOT HOW 'nice' WORKS!

For chissake, how hard is it to understand?

The semantics of "nice" are not - and have never been - to put things
into process scheduling groups of their own.

When somebody says "nice xyzzy", they are explicitly stating that
"xyzzy" isn't as important as other processes. It's done for stuff
that you don't care about, and more specifically, for stuff that you
really don't want to impact anything else. So if there are other
things to be run, 'nice' means that those should get more CPU time.

(Obviously, negative nice levels work the other way around).

This is very much documented. People rely on it. Look at the man-page.
It talks about "most favorable" vs "least favorable" scheduling.

> Two people logged in would get their "make" jobs group scheduled
> together. ?What is the problem?

The problem is that you don't know what the hell you are talking about.

Different nice levels shouldn't get group scheduled together - they
should be scheduled *less*. And it's not about "make", since nobody
really ever uses nice on make anyway, it's about things like
pulseaudio (that wants higher priorities) and random background
filesystem indexers etc (that want lower priorities).

Nice levels are _not_ about group scheduling. They're about
priorities. And since the cgroup code doesn't even support priority
levels for the groups, it's a really *horrible* match.

And the thing is, the nice semantics are traditional. They are also
*horrible*, but that doesn't allow you to change their semantics.
People rely on those crazy traditional and mostly useless semantics.
Not very much (because they are mostly useless), but there really are
people who use it.

And they use it knowing that positive nice levels means that something
is less important.

In contrast, giving processes a scheduling group doesn't imply "less
important". Not AT ALL. It doesn't really mean "more important"
either, it just means "somewhat insulated from other groups".

So let's say that you have a filesystem indexer, and you nice it up to
make sure that it doesn't steal CPU bandwidth from your "real work".
Now, let's say that you start a "make -16" to build something
important.

Do you *really* think that the person who niced the filesystem indexer
down wants the indexer to get 50% of the CPU, just because it's
scheduled separately from the parallel make?

HELL NO!

So stop this idiocy. "nice" has absolutely nothing to do with group
scheduling. It cannot. It must not. It's a legacy interface, and it
has real semantics.

> Since Linus appears to be more interested in talking about nipples
> than explaining exactly what it would break, but you appear to agree
> with him, hopefully you'll be able to explain...

The reason I was talking about make nipples should be clear by now.
Think "legacy interface". Think "don't mess with it, because people
are used to it".

They may be useless, but dammit, they do what they do.

Don't try to turn male nipples into something they aren't. And don't
try to turn 'nice' into something it isn't.

Linus

2010-12-05 20:59:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups


* Con Kolivas <[email protected]> wrote:

> Greets.
>
> I applaud your efforts to continue addressing interactivity and responsiveness
> but, I know I'm going to regret this, I feel strongly enough to speak up about
> this change.
>
> On Sun, 5 Dec 2010 10:43:44 Colin Walters wrote:
> > On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds
> > <[email protected]> wrote:
> > > What's your point again? It's a heuristic.
> >
> > So if it's a heuristic the OS can get wrong,
>
> This is precisely what I see as the flaw in this approach. [...]

I think you are misunderstanding Mike's auto-group scheduling feature.

The scheduling itself is not 'heuristics'.

It is the _composition of a group_ that has a heuristic default. (We use the 'tty'
to act as the grouping)

But that can be changed: the cgroup interfaces can be (and are) used by Gnome to
create different groups. They can be used by users as well, using cgroup tooling.

What the kernel does is that it provides sane defaults.

> [...]
>
> Move away from the fragile heuristic tweaks and find a longer term robust
> solution.

This is not some kernel heuristic that cannot be modified - which was the main
problem of the O(1) scheduler. This is a common-sense default that can be overriden
by user-space if it wants to.

So i definitely think you are confusing the two cases.

Thanks,

Ingo

2010-12-05 22:47:52

by Colin Walters

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, Dec 5, 2010 at 3:47 PM, Linus Torvalds
<[email protected]> wrote:
>
> The semantics of "nice" are not - and have never been - to put things
> into process scheduling groups of their own.

Again, I obviously understand that - the point is to explore the space
of changes here and consider what would (and wouldn't) break. And
actually, what would improve.

> This is very much documented. People rely on it.

Well, we established my Fedora 14 system doesn't. You said "no one"
uses "nice" interactively. So...that leaves - who? If you were
saying to me something like "I know Yahoo has some code in their data
centers which uses a range of nice values; if we made this change, all
of a sudden they'd get more CPU contention..." Or like, "I'm pretty
sure Maemo uses very low nice values for some UI code". But you so
far haven't really done that, it's just been (mostly)
assertions/handwaving. Now you obviously have a lot more experience
that gives those assertions and handwaving a lot of credibility - but
all we need is one concrete example to shut me up =)

Playing around with Google code search a bit, hits for "nice" were
almost all duplicates of various C library headers/implementations.
"setpriority" was a bit more interesting, it appears Chromium has some
code to bump up the nice value by 5 for "background" processes:

http://google.com/codesearch/p?hl=en#OAMlx_jo-ck/src/base/process_linux.cc&q=setpriority&exact_package=chromium&l=21

But all my Chrome related processes here are 0, so who knows what
that's used for. There are also hits for chromium's copy of embedded
cygwin+perl...terrifying. I assume (hope, desperately) that
Cygwin+Perl is just used for building...

Another hit here in some random X screensaver code:
http://google.com/codesearch/p?hl=en#tJJawb1IJ20/driver/exec.c&q=setpriority%20file:.*.c&l=218

But I can't find a place where it's setting a non-zero value for that.

So...ah, here's one in Android's "development" git:
http://google.com/codesearch/p?hl=en#CRBM04-7BoA/simulator/wrapsim/Init.c&q=setpriority%20file:.*.c&l=91

Except it appears to be unused =/

Oh! Here we go, one in the Android UI code:
http://google.com/codesearch/p?hl=en#uX1GffpyOZk/libs/rs/rsContext.cpp&q=setpriority%20file:.*.c&sa=N&cd=29&ct=rc
Pasting this one so people don't have to follow the link:

void * Context::threadProc(void *vrsc)
{
...
setpriority(PRIO_PROCESS, rsc->mNativeThreadId, ANDROID_PRIORITY_DISPLAY);

}

Where ANDROID_PRIORITY_DISPLAY = -4. Actually the whole enum is interesting:
http://google.com/codesearch/p?hl=en#uX1GffpyOZk/include/utils/threads.h&q=ANDROID_PRIORITY_DISPLAY&l=39

One interesting bit here is that they renice UI that the user is
presently interacting with:

/* threads currently running a UI that the user is interacting with */
ANDROID_PRIORITY_FOREGROUND = -2,

(Something "we" (and by "we" I mean GNOME) don't do, I believe Windows
does though). Though, honestly I could whip up a
gnome-settings-daemon plugin to do this in about 10 minutes. Maybe
after dinner.

So...we've established that important released operating systems do
use negative nice values (not surprising). I can't offhand find any
uses of e.g. ANDROID_PRIORITY_BACKGROUND (i.e. a positive nice value)
in the "base" sources though.

> Different nice levels shouldn't get group scheduled together - they
> should be scheduled *less*.

But it seems obvious (right?) that putting them in one group *will*
ensure they get scheduled less, since that one group has to contend
with all other processes.

> And it's not about "make", since nobody
> really ever uses nice on make anyway, it's about things like
> pulseaudio (that wants higher priorities)

Note that pulse is actually using the RT scheduling class, so (I
think) its actual nice value is irrelevant.

Again using F14, the only things using negative nice besides pulse is
udev and auditd.

> Not very much (because they are mostly useless), but there really are
> people who use it.

Still trying to extract specific examples of "people who use it" from you...

> Do you *really* think that the person who niced the filesystem indexer
> down wants the indexer to get 50% of the CPU, just because it's
> scheduled separately from the parallel make?

Finally, an example! I can work with this. So let's assume I'm using
some JavaScript-intensive website in Firefox in GNOME, and
tracker-miner-fs kicks in after noticing I just saved a Word document
I want to look at later. And an otherwise idle system. You're
suggesting that, now tracker-miner-fs would be using a lot more CPU if
it was in an empty group than it would have before?

That does seem likely to be true. But would it be a *problem*? I
don't know, it's not obvious to me offhand. Especially on any
hardware that's dual-core, where SpiderMonkey can be burning one core
(since that's all it will use, modulo Web Workers), and tracker on
another.

Anyways, I don't have the kernel-fu to make a patch myself here,
especially since the scheduler is probably one of the hardest parts of
the OS. So ultimately I guess, if you just totally disagree, fine.
But I wasn't satisfied with the response - my engineering intuition is
to work through problems and try to really understand what would be
wrong. It's hard to accept "just trust me, that's stupid".

2010-12-05 23:05:33

by Jesper Juhl

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, 5 Dec 2010, Colin Walters wrote:

> On Sun, Dec 5, 2010 at 3:47 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > The semantics of "nice" are not - and have never been - to put things
> > into process scheduling groups of their own.
>
> Again, I obviously understand that - the point is to explore the space
> of changes here and consider what would (and wouldn't) break. And
> actually, what would improve.
>
> > This is very much documented. People rely on it.
>
> Well, we established my Fedora 14 system doesn't. You said "no one"
> uses "nice" interactively. So...that leaves - who? If you were
> saying to me something like "I know Yahoo has some code in their data
> centers which uses a range of nice values; if we made this change, all
> of a sudden they'd get more CPU contention..." Or like, "I'm pretty
> sure Maemo uses very low nice values for some UI code". But you so
> far haven't really done that, it's just been (mostly)
> assertions/handwaving. Now you obviously have a lot more experience
> that gives those assertions and handwaving a lot of credibility - but
> all we need is one concrete example to shut me up =)
>
[...]

I'll give you two re-world examples from two (closed source, but
still) apps we develop at my current employer.

The first one is a server/network monitoring app where there are lots of
child processes devoted to performing checks, storing data, displaying
results etc. Most of these processes just run at the default nice level.
One of the processes sometimes has a need for a cryptographic key pair and
it can generate this when it needs it, but it's better if one is reaily
available, so we have a seperate child process running that maintains a
small pool of new key pairs - this process runs at a high nice level since
it should not take CPU time away from the rest of the processes (it's not
important, it's just a small optimization), the need for key pairs comes
at large intervals, so the pool will almost never be depleted even if
this process doesn't get very much CPU time for a long time and besides,
if the pool ever gets depleted its no disaster since the consumer will
then just generate a key pair when needed and burn the required CPU.

The second is a backup aplication where one child process is in charge of
doing background disk scanning, compression and encryption. This process
is not interactive, it must result in minimal interference with whatever
the user is currently using the machine for as his primary task and
time-to-completion is not really that important. So, this process runs at
a rather high nice level to avoid stealing CPU from the users primary
task(s).


--
Jesper Juhl <[email protected]> http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

2010-12-05 23:12:16

by Jesper Juhl

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, 5 Dec 2010, Jesper Juhl wrote:

> On Sun, 5 Dec 2010, Colin Walters wrote:
>
> > On Sun, Dec 5, 2010 at 3:47 PM, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > The semantics of "nice" are not - and have never been - to put things
> > > into process scheduling groups of their own.
> >
> > Again, I obviously understand that - the point is to explore the space
> > of changes here and consider what would (and wouldn't) break. And
> > actually, what would improve.
> >
> > > This is very much documented. People rely on it.
> >
> > Well, we established my Fedora 14 system doesn't. You said "no one"
> > uses "nice" interactively. So...that leaves - who? If you were
> > saying to me something like "I know Yahoo has some code in their data
> > centers which uses a range of nice values; if we made this change, all
> > of a sudden they'd get more CPU contention..." Or like, "I'm pretty
> > sure Maemo uses very low nice values for some UI code". But you so
> > far haven't really done that, it's just been (mostly)
> > assertions/handwaving. Now you obviously have a lot more experience
> > that gives those assertions and handwaving a lot of credibility - but
> > all we need is one concrete example to shut me up =)
> >
> [...]
>
> I'll give you two re-world examples from two (closed source, but
> still) apps we develop at my current employer.
>
> The first one is a server/network monitoring app where there are lots of
> child processes devoted to performing checks, storing data, displaying
> results etc. Most of these processes just run at the default nice level.
> One of the processes sometimes has a need for a cryptographic key pair and
> it can generate this when it needs it, but it's better if one is reaily
> available, so we have a seperate child process running that maintains a
> small pool of new key pairs - this process runs at a high nice level since
> it should not take CPU time away from the rest of the processes (it's not
> important, it's just a small optimization), the need for key pairs comes
> at large intervals, so the pool will almost never be depleted even if
> this process doesn't get very much CPU time for a long time and besides,
> if the pool ever gets depleted its no disaster since the consumer will
> then just generate a key pair when needed and burn the required CPU.
>
> The second is a backup aplication where one child process is in charge of
> doing background disk scanning, compression and encryption. This process
> is not interactive, it must result in minimal interference with whatever
> the user is currently using the machine for as his primary task and
> time-to-completion is not really that important. So, this process runs at
> a rather high nice level to avoid stealing CPU from the users primary
> task(s).
>

Ohh and a third example. On my home laptop I got sufficiently annoyed with
'updatedb' starting up from cron while I was in the middle of something
so that cron job now runs updatedb with 'nice 19' and also uses ionice so
it runs at the 'best effort' class and with priority 7 (lowest).


--
Jesper Juhl <[email protected]> http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

2010-12-06 00:29:13

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sat, 04 Dec 2010 15:01:16 EST, Colin Walters said:

> Look around...where? On what basis are you making that claim? I did
> a quick web search for "unix background process", and this tutorial
> (in the first page of Google search results) aimed at grad students
> who use Unix at college definitely describes "nice make":
> http://acs.ucsd.edu/info/jobctrl.shtml

The fact that something is documented doesn't mean the documentation actually
is correct.

There exists a Linux guide written by somebody (who has enough of a rep that
you can safely say "should have known better") who didn't understand the
difference between traditional Unix and Linux, nor what the original concept
was, and it documented the proper way to take a system down quickly as:

# sync;sync;sync;halt

Of course, the *original* was:

# sync
# sync
# sync
# halt

And the whole point of 3 syncs was that the typing time of the second and third
sync's chewed up the time till the first sync finished. Of course, sync;sync
doesn't start the first sync and then make you type. And it overlooked that
the Linux sync is a lot more synchronous than the ATT Unix sync, which returned
as soon as the I/O was scheduled, not completed.


Attachments:
(No filename) (227.00 B)

2010-12-06 16:04:40

by Florian Mickler

[permalink] [raw]
Subject: Re: [PATCH v4] Regression: sched: automated per session task groups

On Sun, 5 Dec 2010 19:48:11 +0000
Alan Cox <[email protected]> wrote:

> > Your rant about big bloated GNOME is... well just a rant. You will
> > never be able to change it. You can just hope, that over time the
> > evolutionary aspects of open source development will fix it.
>
> They already are - its dying slowly but surely.
>
But maybe they will start slimming it down and all will be well...


> > There is nothing wrong with trying to provide ease of use. Graphical
> > interfaces that are well designed are easier to use.
>
> Interesting how you associate ease of use with being bloated and oversize.

Didn't want to imply that. But I think I maybe assumed wrongly that
your usage of 'bloat' and 'oversize' was an overstatement and you were
ranting inherently at the complexity of graphical user interfaces..

>
> > Most command-line people just can't cope with the unstable nature of
> > interfaces in the graphical world.
>
> Ah of course. How sweet, your response to a point about the arrogance of
> certain desktop attitudes is to lecture, and make bogus pronouncements
> about command-line people. You might want to put the shovel away instead.

You seem to be angry. What's the problem? I find your reaction strange.

Isn't that a valid point?

I really have problems with graphical programs and tools where you have
to click on the left side of the window on some 'X' Icon in Ubuntu and
right top corner on fluxbox. and on the third desktop setup ratpoison
is running with some super special key combos and all you can do is
strg-backspace, except it's disabled.

With CLI programs I don't have those problems. I just strg-c most of
the time, if even necessary.

>
> > CLI's are mostly better from an ergonomic view (old people,
> > heavy working hackers and other power users) because they provide
> > a stable focus point.
>
> And a bit of pop psychology to go with it. I assume you are trying to
> talk about internalised and externalised models ?

*head scratch*

>
> > Most hackers probably don't realize this, but images/icons and other
>
> No of course not we are all dim, thank you for using small words. I am
> actually familiar with the real models here btw and there are a couple of
> rather important basics you are missing

I find your tone inappropriate. I was not trying to insult you or
anything.

>
> - Different people have different strengths in different areas - these
> don't specifically line up with the senses. There isn't vast amounts of
> evidence to support computing people are all strong in a particular
> area either. You'll see in studies that some of them prefer to
> diagram and flowchart, some write text, some have kinesthetic models
> (movement and flow). I don't doubt there are others who sense it in
> different ways.

Why do you think I did imply something other? I was talking about
humans in general. Language is an abstract concept.
It's not something we intuitivly know when we are born, we have to
learn it.

I was getting at the fact that the brain adapts towards it's usage.
People doing lot's of sport do have less trouble learning a new kind of
sport.
People that do hack a lot on computers have less trouble learning to
use a new tool.
A physics professor has less trouble understanding a new theory
about the beginning of the world.

People working a lot with programming languages and on the commandline
are more used to it. Other people find interfaces that resemble real
world items easier to use, because they aren't used to inputting
commands in text form.


> - Visual and textual data communicate different things more effectively
> (as do sounds, smells, movements, ....) and are processed with
> different natural preferences by different people
>
> - Oh and there is no evidence I've ever seen to suggest old people are
> more text oriented.
>
> So any notion of CLI's or GUI's being better for [class of people] is
> generally na?ve. It depends what is being done, who is doing it and the
> situation.

Indeed. I didn't try to classify people at all. I'm kind of sad you
would assume that.

>
> If you really want to understand the trade-offs in a graphical world
> watch some good CAD operators for an hour. They'll use the same tools in
> very different ways - some very command line, some heavily
> mouse/trackball, others graphical but with hotkeys, and they'll often
> shift approach according to task.
>
> > graphical interfaces are more similar to real life and are thus easier
> > to use for 'unadapted' people.
>
> If you wish to see an unadapted person you probably want to go look at
> amazonian tribes. Bit different.

I guess that is what I mean.

>
> But then the GUI world is the world that put "logout" under "start" 8)
> and thinks that "Insert OLE Object" is a good thing to put on a menu.

:-)

>
> Alan

Regards,
Flo

2010-12-07 11:33:32

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

Desktop hardware came in today and I can now reproduce the issues
Mike's been seeing; tuning in progress.

On Sat, Dec 4, 2010 at 9:11 PM, Paul Turner <[email protected]> wrote:
> On Sat, Dec 4, 2010 at 3:55 PM, James Courtier-Dutton
> <[email protected]> wrote:
>> On 3 December 2010 05:11, Paul Turner <[email protected]> wrote:
>>>
>>> I actually don't have a desktop setup handy to test "interactivity" (sad but
>>> true -- working on grabbing one). ?But it looks better on under synthetic
>>> load.
>>>
>>
>> What tools are actually used to test "interactivity" ?
>> I posted a tool to the list some time ago, but I don't think anyone noticed.
>> My tool is very simple.
>> When you hold a key down, it should repeat. It should repeat at a
>> constant predictable interval.
>> So, my tool just waits for key presses and times when each one occurred.
>> The tester simply presses a key and holds it down.
>> If the time between each key press is constant, it indicates good
>> "interactivity". If the time between each key press varies a lot, it
>> indicates bad "interactivity".
>> You can reliably test if one kernel is better than the next using
>> actual measurable figures.
>>
>> Kind Regards
>>
>> James
>>
>
> Could you drop me a pointer? ?I can certainly give it a try. ?It would
> be extra useful if it included any histogram functionality.
>
> I've been using a combination of various synthetic wakeup and load
> scripts and measuring the received bandwidth / wakeup latency.
>
> They have not succeeded in reproducing the starvation or poor latency
> observed by Mike above however. ?(Although I've pulled a box to try
> reproducing his exact conditions [ e.g. user environment ] on Monday).
>

2010-12-07 18:52:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

On Sun, 2010-12-05 at 12:47 -0800, Linus Torvalds wrote:
> Nice levels are _not_ about group scheduling. They're about
> priorities. And since the cgroup code doesn't even support priority
> levels for the groups, it's a really *horrible* match.

It does in fact, nice maps to a weight, we then schedule so that each
entity (be it task or group) gets a proportional amount of time relative
to the other entities (of the same parent).

The scheduler basically solves the following differential equation:
dt_i = w_i * dt / \Sum_j w_j


For tasks we map nice to weight like:

static const int prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};

For groups we expose the weight directly in cgroupfs://cpu.shares with a
default equivalent to nice-0 (1024).

So 'nice make -j9' will run make and all its children with weight=110,
if this task hierarchy has ~9 runnable tasks it will get about as much
time as a single nice-0 competing task.

[ 9*110 = 990, 1*1024 = 1024, which gives: 49% vs 51% ]


Now group scheduling is in fact closely related to nice, the only thing
group scheduling does is:

w_i = \unit * \Prod_j { w_i,j / \Sum_k w_k,j }, where:

j \elem i and its parents
k \elem entities of group j (where a task is a trivial group)

Where we compute a task's effective weight (w_i) by multiplying it with
the effective weight of their ancestors.

Suppose a grouped make -j9 against 1 competing task (all nice-0 or
equivalent), and make's 9 active children [a..i] in the group G:


R
/ \
t G
/ \
a...i

So w_t = 1024, w_G = 1024 and w_[a..i] = 1024.

Now, per the above the effective weight (weight as in the root group) of
each grouped task is:

w_[a..i] = 1024 * 1024/2048 * 1024/9216 ~= 56
w_t = 1024 * 1024/2048 = 512

[ \Sum w_[a..i] = 512, vs 512 gives: 50% vs 50% ]

So effectively: nice make -j9, and stuffing the make -j9 in a group are
roughly equivalent.

The only difference between groups and nice is the interface, with nice
you set the weight directly, with groups you set it implicitly,
depending on the runnable task state.

2010-12-15 12:10:43

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH v4] sched: automated per session task groups

This goose is now cooked.

It turns out the new shares code is doing the "right" thing, but in
the wrong order for tasks with very small time slices. This made it
rather gnarly to track down since at all times the new evaluation /
the old evaluation / and an "OPT" evaluation were all in agreement!

As we hierarchically dequeue we now instantaneously adjust entity
weights to account for the new global state (good). However, when we
then update on the parent (e.g. the group entity owning the
just-adjusted cfs_rq), the accrued unaccounted time is charged at the
new weight for that entity instead of the old.

For longer running processes, the periodic updates hide this.
However, for an interactive process, such as Xorg (which uses many
_small_ timeslices -- e.g. almost all accounting ends up being at
dequeue as opposed to periodic) this results in significant vruntime
over-charging and a loss of fairness. In Xorg's case the loss of
fairness is compounded by the fact that there is only one runnable
thread means we transition between NICE_0_LOAD and MIN_SHARES for the
over-charging above.

This is fixed by charging the unaccounted time versus a group entity
before we manipulate its weight (as a result of child movement).

Thanks for your patience while I tracked this down.. it's been a few
sleepless nights while I cranked through a number of dead-end theories
(rather frustrating when the numbers are all right but the results are
all wrong! ;). Cleaned up patch inbound in the morning.

- Paul

On Tue, Dec 7, 2010 at 3:32 AM, Paul Turner <[email protected]> wrote:
> Desktop hardware came in today and I can now reproduce the issues
> Mike's been seeing; tuning in progress.
>
> On Sat, Dec 4, 2010 at 9:11 PM, Paul Turner <[email protected]> wrote:
>> On Sat, Dec 4, 2010 at 3:55 PM, James Courtier-Dutton
>> <[email protected]> wrote:
>>> On 3 December 2010 05:11, Paul Turner <[email protected]> wrote:
>>>>
>>>> I actually don't have a desktop setup handy to test "interactivity" (sad but
>>>> true -- working on grabbing one). ?But it looks better on under synthetic
>>>> load.
>>>>
>>>
>>> What tools are actually used to test "interactivity" ?
>>> I posted a tool to the list some time ago, but I don't think anyone noticed.
>>> My tool is very simple.
>>> When you hold a key down, it should repeat. It should repeat at a
>>> constant predictable interval.
>>> So, my tool just waits for key presses and times when each one occurred.
>>> The tester simply presses a key and holds it down.
>>> If the time between each key press is constant, it indicates good
>>> "interactivity". If the time between each key press varies a lot, it
>>> indicates bad "interactivity".
>>> You can reliably test if one kernel is better than the next using
>>> actual measurable figures.
>>>
>>> Kind Regards
>>>
>>> James
>>>
>>
>> Could you drop me a pointer? ?I can certainly give it a try. ?It would
>> be extra useful if it included any histogram functionality.
>>
>> I've been using a combination of various synthetic wakeup and load
>> scripts and measuring the received bandwidth / wakeup latency.
>>
>> They have not succeeded in reproducing the starvation or poor latency
>> observed by Mike above however. ?(Although I've pulled a box to try
>> reproducing his exact conditions [ e.g. user environment ] on Monday).
>>
>

2010-12-15 17:57:29

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

Sorry for the late reply, I am slowly crawling through my mbox...

On 11/30, tip-bot for Mike Galbraith wrote:
>
> Commit-ID: 5091faa449ee0b7d73bc296a93bca9540fc51d0a
> Gitweb: http://git.kernel.org/tip/5091faa449ee0b7d73bc296a93bca9540fc51d0a
> Author: Mike Galbraith <[email protected]>
> AuthorDate: Tue, 30 Nov 2010 14:18:03 +0100
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 30 Nov 2010 16:03:35 +0100

I assume this is the latest version. In this case I think it needs
minor fixes.

> +#ifdef CONFIG_PROC_FS
> +
> +/* Called with siglock held. */

This is not true, and that is why we can't blindly use kref_get().

> +int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
> +{
> + static unsigned long next = INITIAL_JIFFIES;
> + struct autogroup *ag;
> + int err;
> +
> + if (*nice < -20 || *nice > 19)
> + return -EINVAL;
> +
> + err = security_task_setnice(current, *nice);
> + if (err)
> + return err;
> +
> + if (*nice < 0 && !can_nice(current, *nice))
> + return -EPERM;
> +
> + /* this is a heavy operation taking global locks.. */
> + if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
> + return -EAGAIN;
> +
> + next = HZ / 10 + jiffies;
> + ag = autogroup_kref_get(p->signal->autogroup);

We can race with autogroup_move_group() and use the already freed
->autogroup. We need ->siglock or task_rq_lock() to read it.

IOW, I think we need something like the patch below, but - sorry -
if was completely untested.

And the question,

> + down_write(&ag->lock);
> + err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);

Do we really want this if ag == autogroup_default ? Say, autogroup_create()
fails, now the owner of this process can affect init_task_group. Or admin
can change init_task_group "by accident" (although currently this is hardly
possible, sched_autogroup_detach() has no callers). Just curious.

Oleg.

--- a/kernel/sched_autogroup.c
+++ a/kernel/sched_autogroup.c
@@ -41,6 +41,19 @@ static inline struct autogroup *autogrou
return ag;
}

+static struct autogroup *task_get_autogroup(struct task_struct *p)
+{
+ struct autogroup *ag = NULL;
+ unsigned long flags;
+
+ if (lock_task_sighand(p, &flags)) {
+ ag = autogroup_kref_get(p->signal->autogroup);
+ unlock_task_sighand(p, &flags);
+ }
+
+ return ag;
+}
+
static inline struct autogroup *autogroup_create(void)
{
struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
@@ -149,11 +162,7 @@ EXPORT_SYMBOL(sched_autogroup_detach);

void sched_autogroup_fork(struct signal_struct *sig)
{
- struct task_struct *p = current;
-
- spin_lock_irq(&p->sighand->siglock);
- sig->autogroup = autogroup_kref_get(p->signal->autogroup);
- spin_unlock_irq(&p->sighand->siglock);
+ sig->autogroup = task_get_autogroup(current);
}

void sched_autogroup_exit(struct signal_struct *sig)
@@ -172,7 +181,6 @@ __setup("noautogroup", setup_autogroup);

#ifdef CONFIG_PROC_FS

-/* Called with siglock held. */
int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
{
static unsigned long next = INITIAL_JIFFIES;
@@ -193,9 +201,11 @@ int proc_sched_autogroup_set_nice(struct
if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
return -EAGAIN;

- next = HZ / 10 + jiffies;
- ag = autogroup_kref_get(p->signal->autogroup);
+ ag = task_get_autogroup(p);
+ if (!ag)
+ return -ESRCH;

+ next = HZ / 10 + jiffies;
down_write(&ag->lock);
err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);
if (!err)
@@ -209,7 +219,10 @@ int proc_sched_autogroup_set_nice(struct

void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
{
- struct autogroup *ag = autogroup_kref_get(p->signal->autogroup);
+ struct autogroup *ag = task_get_autogroup(p);
+
+ if (!ag)
+ return;

down_read(&ag->lock);
seq_printf(m, "/autogroup-%ld nice %d\n", ag->id, ag->nice);

2010-12-16 07:54:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Wed, 2010-12-15 at 18:50 +0100, Oleg Nesterov wrote:

> I assume this is the latest version. In this case I think it needs
> minor fixes.
>
> > +#ifdef CONFIG_PROC_FS
> > +
> > +/* Called with siglock held. */
>
> This is not true, and that is why we can't blindly use kref_get().

I was going to lock it all up, but convinced myself it wasn't necessary.
The comment should have also gone away.

> > +int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
> > +{
> > + static unsigned long next = INITIAL_JIFFIES;
> > + struct autogroup *ag;
> > + int err;
> > +
> > + if (*nice < -20 || *nice > 19)
> > + return -EINVAL;
> > +
> > + err = security_task_setnice(current, *nice);
> > + if (err)
> > + return err;
> > +
> > + if (*nice < 0 && !can_nice(current, *nice))
> > + return -EPERM;
> > +
> > + /* this is a heavy operation taking global locks.. */
> > + if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
> > + return -EAGAIN;
> > +
> > + next = HZ / 10 + jiffies;
> > + ag = autogroup_kref_get(p->signal->autogroup);
>
> We can race with autogroup_move_group() and use the already freed
> ->autogroup. We need ->siglock or task_rq_lock() to read it.

I don't see how/why. I took a reference to the new group before
assignment of p->signal->autogroup, and put the previous group after
it's assigned.

Ponders that.. uhoh.

Mover does atomic write, but signal->autogroup write comes after that,
so can still be in flight when reader dereferences. Game over unless
the reader beats ->autogroup writer to the punch.

Thanks again for your excellent eyeballs. The below should plug that
hole, no? (hope so, seems pointless to lock movement)

> IOW, I think we need something like the patch below, but - sorry -
> if was completely untested.
>
> And the question,
>
> > + down_write(&ag->lock);
> > + err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);
>
> Do we really want this if ag == autogroup_default ? Say, autogroup_create()
> fails, now the owner of this process can affect init_task_group. Or admin
> can change init_task_group "by accident" (although currently this is hardly
> possible, sched_autogroup_detach() has no callers). Just curious.

sched_group_set_shares() does the right thing, says no to changing the
root task group's shares.



sched: fix potential access to freed memory

Oleg pointed out that the /proc interface kref_get() useage may race with
the final put during autogroup_move_group(). A signal->autogroup assignment
may be in flight when the /proc interface dereference, leaving them taking
a reference to an already dead group.

Reported-by: Oleg Nesterov <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>

diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c
index 57a7ac2..713b6c0 100644
--- a/kernel/sched_autogroup.c
+++ b/kernel/sched_autogroup.c
@@ -41,6 +41,12 @@ static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
return ag;
}

+static inline struct autogroup *autogroup_task_get(struct task_struct *p)
+{
+ smp_rmb();
+ return autogroup_kref_get(p->signal->autogroup);
+}
+
static inline struct autogroup *autogroup_create(void)
{
struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
@@ -119,6 +125,7 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
}

p->signal->autogroup = autogroup_kref_get(ag);
+ smp_mb();

t = p;
do {
@@ -172,7 +179,6 @@ __setup("noautogroup", setup_autogroup);

#ifdef CONFIG_PROC_FS

-/* Called with siglock held. */
int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
{
static unsigned long next = INITIAL_JIFFIES;
@@ -194,7 +200,7 @@ int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
return -EAGAIN;

next = HZ / 10 + jiffies;
- ag = autogroup_kref_get(p->signal->autogroup);
+ ag = autogroup_task_get(p);

down_write(&ag->lock);
err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);
@@ -209,7 +215,7 @@ int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)

void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
{
- struct autogroup *ag = autogroup_kref_get(p->signal->autogroup);
+ struct autogroup *ag = autogroup_task_get(p);

down_read(&ag->lock);
seq_printf(m, "/autogroup-%ld nice %d\n", ag->id, ag->nice);

2010-12-16 14:09:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Thu, 2010-12-16 at 08:53 +0100, Mike Galbraith wrote:
> On Wed, 2010-12-15 at 18:50 +0100, Oleg Nesterov wrote:

> Thanks again for your excellent eyeballs. The below should plug that
> hole, no? (hope so, seems pointless to lock movement)

I'd also have to disable interrupts though, so may as well just lock it.

I didn't do the -ESRCH or no display bit. As far as autogroup is
concerned, if you couldn't lock, it's history, so belongs to init.

sched: fix potential access to freed memory

Oleg pointed out that the /proc interface kref_get() useage may race with
the final put during autogroup_move_group(). A signal->autogroup assignment
may be in flight when the /proc interface dereference, leaving them taking
a reference to an already dead group.

Reported-by: Oleg Nesterov <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>

diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c
index 57a7ac2..c80fedc 100644
--- a/kernel/sched_autogroup.c
+++ b/kernel/sched_autogroup.c
@@ -41,6 +41,20 @@ static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
return ag;
}

+static inline struct autogroup *autogroup_task_get(struct task_struct *p)
+{
+ struct autogroup *ag;
+ unsigned long flags;
+
+ if (!lock_task_sighand(p, &flags))
+ return autogroup_kref_get(&autogroup_default);
+
+ ag = autogroup_kref_get(p->signal->autogroup);
+ unlock_task_sighand(p, &flags);
+
+ return ag;
+}
+
static inline struct autogroup *autogroup_create(void)
{
struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
@@ -149,11 +163,7 @@ EXPORT_SYMBOL(sched_autogroup_detach);

void sched_autogroup_fork(struct signal_struct *sig)
{
- struct task_struct *p = current;
-
- spin_lock_irq(&p->sighand->siglock);
- sig->autogroup = autogroup_kref_get(p->signal->autogroup);
- spin_unlock_irq(&p->sighand->siglock);
+ sig->autogroup = autogroup_task_get(current);
}

void sched_autogroup_exit(struct signal_struct *sig)
@@ -172,7 +182,6 @@ __setup("noautogroup", setup_autogroup);

#ifdef CONFIG_PROC_FS

-/* Called with siglock held. */
int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
{
static unsigned long next = INITIAL_JIFFIES;
@@ -194,7 +203,7 @@ int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
return -EAGAIN;

next = HZ / 10 + jiffies;
- ag = autogroup_kref_get(p->signal->autogroup);
+ ag = autogroup_task_get(p);

down_write(&ag->lock);
err = sched_group_set_shares(ag->tg, prio_to_weight[*nice + 20]);
@@ -209,7 +218,7 @@ int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)

void proc_sched_autogroup_show_task(struct task_struct *p, struct seq_file *m)
{
- struct autogroup *ag = autogroup_kref_get(p->signal->autogroup);
+ struct autogroup *ag = autogroup_task_get(p);

down_read(&ag->lock);
seq_printf(m, "/autogroup-%ld nice %d\n", ag->id, ag->nice);

2010-12-16 15:15:43

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On 12/16, Mike Galbraith wrote:
>
> I'd also have to disable interrupts though,

Even irq_disable can't help, I think.

> so may as well just lock it.

this also looks simpler.

> I didn't do the -ESRCH or no display bit. As far as autogroup is
> concerned, if you couldn't lock, it's history, so belongs to init.

Agreed. I considered this option too, but I was worried about
sched_group_set_shares(root).

However, as you pointed out,

> sched_group_set_shares() does the right thing, says no to changing the
> root task group's shares.

Aha, I see, thanks.


I believe the patch is correct and closes the hole.

Oleg.

2010-12-20 13:08:22

by Bharata B Rao

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Tue, Nov 30, 2010 at 9:09 PM, tip-bot for Mike Galbraith
<[email protected]> wrote:
> Commit-ID: ?5091faa449ee0b7d73bc296a93bca9540fc51d0a
> Gitweb: ? ? http://git.kernel.org/tip/5091faa449ee0b7d73bc296a93bca9540fc51d0a
> Author: ? ? Mike Galbraith <[email protected]>
> AuthorDate: Tue, 30 Nov 2010 14:18:03 +0100
> Committer: ?Ingo Molnar <[email protected]>
> CommitDate: Tue, 30 Nov 2010 16:03:35 +0100
>
> sched: Add 'autogroup' scheduling feature: automated per session task groups
>
> diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
> index e95b774..1dfae3d 100644
> --- a/kernel/sched_debug.c
> +++ b/kernel/sched_debug.c
> @@ -54,8 +54,7 @@ static unsigned long nsec_low(unsigned long long nsec)
> ?#define SPLIT_NS(x) nsec_high(x), nsec_low(x)
>
> ?#ifdef CONFIG_FAIR_GROUP_SCHED
> -static void print_cfs_group_stats(struct seq_file *m, int cpu,
> - ? ? ? ? ? ? ? struct task_group *tg)
> +static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
> ?{
> ? ? ? ?struct sched_entity *se = tg->se[cpu];
> ? ? ? ?if (!se)
> @@ -110,16 +109,6 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
> ? ? ? ? ? ? ? ?0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
> ?#endif
>
> -#ifdef CONFIG_CGROUP_SCHED
> - ? ? ? {
> - ? ? ? ? ? ? ? char path[64];
> -
> - ? ? ? ? ? ? ? rcu_read_lock();
> - ? ? ? ? ? ? ? cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
> - ? ? ? ? ? ? ? rcu_read_unlock();
> - ? ? ? ? ? ? ? SEQ_printf(m, " %s", path);
> - ? ? ? }
> -#endif
> ? ? ? ?SEQ_printf(m, "\n");
> ?}
>
> @@ -147,19 +136,6 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
> ? ? ? ?read_unlock_irqrestore(&tasklist_lock, flags);
> ?}
>
> -#if defined(CONFIG_CGROUP_SCHED) && \
> - ? ? ? (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
> -static void task_group_path(struct task_group *tg, char *buf, int buflen)
> -{
> - ? ? ? /* may be NULL if the underlying cgroup isn't fully-created yet */
> - ? ? ? if (!tg->css.cgroup) {
> - ? ? ? ? ? ? ? buf[0] = '\0';
> - ? ? ? ? ? ? ? return;
> - ? ? ? }
> - ? ? ? cgroup_path(tg->css.cgroup, buf, buflen);
> -}
> -#endif
> -
> ?void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> ?{
> ? ? ? ?s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
> @@ -168,16 +144,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> ? ? ? ?struct sched_entity *last;
> ? ? ? ?unsigned long flags;
>
> -#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_FAIR_GROUP_SCHED)
> - ? ? ? char path[128];
> - ? ? ? struct task_group *tg = cfs_rq->tg;
> -
> - ? ? ? task_group_path(tg, path, sizeof(path));
> -
> - ? ? ? SEQ_printf(m, "\ncfs_rq[%d]:%s\n", cpu, path);
> -#else
> ? ? ? ?SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
> -#endif
> ? ? ? ?SEQ_printf(m, " ?.%-30s: %Ld.%06ld\n", "exec_clock",
> ? ? ? ? ? ? ? ? ? ? ? ?SPLIT_NS(cfs_rq->exec_clock));
>
> @@ -215,7 +182,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> ? ? ? ?SEQ_printf(m, " ?.%-30s: %ld\n", "load_contrib",
> ? ? ? ? ? ? ? ? ? ? ? ?cfs_rq->load_contribution);
> ? ? ? ?SEQ_printf(m, " ?.%-30s: %d\n", "load_tg",
> - ? ? ? ? ? ? ? ? ? ? ? atomic_read(&tg->load_weight));
> + ? ? ? ? ? ? ? ? ? ? ? atomic_read(&cfs_rq->tg->load_weight));
> ?#endif
>
> ? ? ? ?print_cfs_group_stats(m, cpu, cfs_rq->tg);
> @@ -224,17 +191,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>
> ?void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
> ?{
> -#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_RT_GROUP_SCHED)
> - ? ? ? char path[128];
> - ? ? ? struct task_group *tg = rt_rq->tg;
> -
> - ? ? ? task_group_path(tg, path, sizeof(path));
> -
> - ? ? ? SEQ_printf(m, "\nrt_rq[%d]:%s\n", cpu, path);
> -#else
> ? ? ? ?SEQ_printf(m, "\nrt_rq[%d]:\n", cpu);
> -#endif
> -

The above change as well as the recent changes due to tg_shares_up
improvements have two (undesirable ?) side effects on
/proc/sched_debug output.

The autogroup patchset removes the display of cgroup name from
sched_debug output.

On a 16 CPU system, with 2 groups having one task each and one task in
root group, the difference in o/p appears like this:

$ grep while1 sched_debug-2.6.37-rc5
R while1 2208 13610.855787 1960 120
13610.855787 19272.857661 0.000000 /2
R while1 2207 20255.605110 3160 120
20255.605110 31572.634065 0.000000 /1
R while1 2209 63913.721827 1273 120
63913.721827 12604.411880 0.000000 /

$ grep while1 sched_debug-2.6.37-rc5-tip
R while1 2173 17603.479529 2754 120
17603.479529 25818.279858 4.560010
R while1 2174 11435.667691 1669 120
11435.667691 16456.476663 0.000000
R while1 2175 10074.709060 1019 120
10074.709060 10075.915495 0.000000

So you can see in the latter case, it becomes difficult to see which
task belongs to which group.

The group names are also missing from per-CPU rq information. Hence in
the above example of 2 groups, I see 2 blocks of data for 2 cfs_rqs,
but it is not possible to know which group they represent.

Also, with tg_shares_up improvements, the leaf cfs_rqs are maintained
on rq->leaf_cfs_rq_list only if they carry any load. But the code to
display cfs_rq information for sched_debug isn't updated and hence
information from a few cfs_rqs are missing from sched_debug.

$ grep cfs_rq sched_debug-2.6.37-rc5
cfs_rq[0]:/2
cfs_rq[0]:/1
cfs_rq[0]:/
cfs_rq[1]:/2
cfs_rq[1]:/1
cfs_rq[1]:/
cfs_rq[2]:/2
cfs_rq[2]:/1
cfs_rq[2]:/
cfs_rq[3]:/2
cfs_rq[3]:/1
cfs_rq[3]:/
cfs_rq[4]:/2
cfs_rq[4]:/1
cfs_rq[4]:/
cfs_rq[5]:/2
cfs_rq[5]:/1
cfs_rq[5]:/
cfs_rq[6]:/2
cfs_rq[6]:/1
cfs_rq[6]:/
cfs_rq[7]:/2
cfs_rq[7]:/1
cfs_rq[7]:/
cfs_rq[8]:/2
cfs_rq[8]:/1
cfs_rq[8]:/
cfs_rq[9]:/2
cfs_rq[9]:/1
cfs_rq[9]:/
cfs_rq[10]:/2
cfs_rq[10]:/1
cfs_rq[10]:/
cfs_rq[11]:/2
cfs_rq[11]:/1
cfs_rq[11]:/
cfs_rq[12]:/2
cfs_rq[12]:/1
cfs_rq[12]:/
cfs_rq[13]:/2
cfs_rq[13]:/1
cfs_rq[13]:/
cfs_rq[14]:/2
cfs_rq[14]:/1
cfs_rq[14]:/
cfs_rq[15]:/2
cfs_rq[15]:/1
cfs_rq[15]:/

$ grep cfs_rq sched_debug-2.6.37-rc5-tip
cfs_rq[0]:
cfs_rq[0]:
cfs_rq[1]:
cfs_rq[2]:
cfs_rq[3]:
cfs_rq[4]:
cfs_rq[4]:
cfs_rq[5]:
cfs_rq[5]:
cfs_rq[6]:
cfs_rq[7]:
cfs_rq[8]:
cfs_rq[9]:
cfs_rq[10]:
cfs_rq[11]:
cfs_rq[11]:
cfs_rq[12]:
cfs_rq[12]:
cfs_rq[13]:
cfs_rq[13]:
cfs_rq[14]:
cfs_rq[14]:
cfs_rq[15]:
cfs_rq[15]:

Regards,
Bharata.

2010-12-20 13:19:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Mon, 2010-12-20 at 18:38 +0530, Bharata B Rao wrote:
> The autogroup patchset removes the display of cgroup name from
> sched_debug output.

Hrmph.. that wasn't supposed to happen, care to send a patch to fix that
up?


> Also, with tg_shares_up improvements, the leaf cfs_rqs are maintained
> on rq->leaf_cfs_rq_list only if they carry any load. But the code to
> display cfs_rq information for sched_debug isn't updated and hence
> information from a few cfs_rqs are missing from sched_debug.

Well, that's a _good_ thing, right?

I mean, if we know they're empty, and don't contribute to schedule, why
bother displaying them?

2010-12-20 15:46:51

by Bharata B Rao

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Mon, Dec 20, 2010 at 6:49 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2010-12-20 at 18:38 +0530, Bharata B Rao wrote:
>> The autogroup patchset removes the display of cgroup name from
>> sched_debug output.
>
> Hrmph.. that wasn't supposed to happen, care to send a patch to fix that
> up?

There are two aspects here:

- Printing cgroup name for per-CPU cfs_rqs shouldn't be affected by
autogroup and the old code should work here.
- Printing cgroup name for tasks depends on task_group(), which has
been changed by autogroup patch. I haven't really looked deep into
autogroup patch, but from whatever I can gather, Mike had a reason
to remove this bit from sched_debug. The task groups created for
autogroups don't have cgroups associated with them and hence no
dentries and hence no pathnames.

I guess we could do fix this as shown in the attached patch.

>
>
>> Also, with tg_shares_up improvements, the leaf cfs_rqs are maintained
>> on rq->leaf_cfs_rq_list only if they carry any load. But the code to
>> display cfs_rq information for sched_debug isn't updated and hence
>> information from a few cfs_rqs are missing from sched_debug.
>
> Well, that's a _good_ thing, right?
>
> I mean, if we know they're empty, and don't contribute to schedule, why
> bother displaying them?

In addition to tasks, we do display other details pertaining to the cfs_rq.
I thought, having a complete view of all the cfs_rqs in the system would
be better and consistent than obtaining different cfs_rqs on different
captures of /proc/sched_debug.

Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm, http://raobharata.wordpress.com/


Attachments:
fix-sched-debug.patch (2.57 kB)

2010-12-20 15:53:12

by Bharata B Rao

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Mon, Dec 20, 2010 at 9:16 PM, Bharata B Rao <[email protected]> wrote:
> On Mon, Dec 20, 2010 at 6:49 PM, Peter Zijlstra <[email protected]> wrote:
>> On Mon, 2010-12-20 at 18:38 +0530, Bharata B Rao wrote:
>>> The autogroup patchset removes the display of cgroup name from
>>> sched_debug output.
>>
>> Hrmph.. that wasn't supposed to happen, care to send a patch to fix that
>> up?
>
> There are two aspects here:
>
> - Printing cgroup name for per-CPU cfs_rqs shouldn't be affected by
> ?autogroup and the old code should work here.
> - Printing cgroup name for tasks depends on task_group(), which has
> been changed by autogroup patch. I haven't really looked deep into
> autogroup patch, but from whatever I can gather, Mike had a reason
> to remove this bit from sched_debug. The task groups created for
> autogroups don't have cgroups associated with them and hence no
> dentries and hence no pathnames.
>
> I guess we could do fix this as shown in the attached patch.

I missed adding the similar bits to RT in sched_dubug.c. If this
approach is reasonable, I can send the next one with RT changes
included.

Regards,
Bharata.

2010-12-20 16:39:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Mon, 2010-12-20 at 21:16 +0530, Bharata B Rao wrote:
> On Mon, Dec 20, 2010 at 6:49 PM, Peter Zijlstra <[email protected]> wrote:
> > On Mon, 2010-12-20 at 18:38 +0530, Bharata B Rao wrote:
> >> The autogroup patchset removes the display of cgroup name from
> >> sched_debug output.
> >
> > Hrmph.. that wasn't supposed to happen, care to send a patch to fix that
> > up?
>
> There are two aspects here:
>
> - Printing cgroup name for per-CPU cfs_rqs shouldn't be affected by
> autogroup and the old code should work here.
> - Printing cgroup name for tasks depends on task_group(), which has
> been changed by autogroup patch. I haven't really looked deep into
> autogroup patch, but from whatever I can gather, Mike had a reason
> to remove this bit from sched_debug. The task groups created for
> autogroups don't have cgroups associated with them and hence no
> dentries and hence no pathnames.

Mike didn't remove it, but _was_ supposed to get around to it.

> I guess we could do fix this as shown in the attached patch.


+#ifdef CONFIG_CGROUP_SCHED
+ {
+ char path[64];
+

...

+#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_FAIR_GROUP_SCHED)
+ char path[128];


One reason it was removed was path[64/128].

-Mike

2010-12-21 05:04:29

by Bharata B Rao

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Mon, Dec 20, 2010 at 10:09 PM, Mike Galbraith <[email protected]> wrote:
> On Mon, 2010-12-20 at 21:16 +0530, Bharata B Rao wrote:
>> On Mon, Dec 20, 2010 at 6:49 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Mon, 2010-12-20 at 18:38 +0530, Bharata B Rao wrote:
>> >> The autogroup patchset removes the display of cgroup name from
>> >> sched_debug output.
>> >
>> > Hrmph.. that wasn't supposed to happen, care to send a patch to fix that
>> > up?
>>
>> There are two aspects here:
>>
>> - Printing cgroup name for per-CPU cfs_rqs shouldn't be affected by
>> ? autogroup and the old code should work here.
>> - Printing cgroup name for tasks depends on task_group(), which has
>> been changed by autogroup patch. I haven't really looked deep into
>> autogroup patch, but from whatever I can gather, Mike had a reason
>> to remove this bit from sched_debug. The task groups created for
>> autogroups don't have cgroups associated with them and hence no
>> dentries and hence no pathnames.
>
> Mike didn't remove it, but _was_ supposed to get around to it.
>
>> I guess we could do fix this as shown in the attached patch.
>
>
> +#ifdef CONFIG_CGROUP_SCHED
> + ? ? ? {
> + ? ? ? ? ? ? ? char path[64];
> +
>
> ...
>
> +#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_FAIR_GROUP_SCHED)
> + ? ? ? char path[128];
>
>
> One reason it was removed was path[64/128].

Other callers of cgroup_path use PATH_MAX=4096 here. I believe the
original reason for these short path sizes was to be light on stack
and as well as to avoid allocation. Can we have some reasonable length
(256 or 512 ?) and live with truncation (if that ever happens) ?

Also, while displaying group name with tasks, does it make sense to
display autogroup-<id> (the one shown in /proc/<pid>/autogroup) ?

Regards,
Bharata.

2010-12-21 05:50:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Tue, 2010-12-21 at 10:34 +0530, Bharata B Rao wrote:
> On Mon, Dec 20, 2010 at 10:09 PM, Mike Galbraith <[email protected]> wrote:
> > On Mon, 2010-12-20 at 21:16 +0530, Bharata B Rao wrote:
> >> On Mon, Dec 20, 2010 at 6:49 PM, Peter Zijlstra <[email protected]> wrote:
> >> > On Mon, 2010-12-20 at 18:38 +0530, Bharata B Rao wrote:
> >> >> The autogroup patchset removes the display of cgroup name from
> >> >> sched_debug output.
> >> >
> >> > Hrmph.. that wasn't supposed to happen, care to send a patch to fix that
> >> > up?
> >>
> >> There are two aspects here:
> >>
> >> - Printing cgroup name for per-CPU cfs_rqs shouldn't be affected by
> >> autogroup and the old code should work here.
> >> - Printing cgroup name for tasks depends on task_group(), which has
> >> been changed by autogroup patch. I haven't really looked deep into
> >> autogroup patch, but from whatever I can gather, Mike had a reason
> >> to remove this bit from sched_debug. The task groups created for
> >> autogroups don't have cgroups associated with them and hence no
> >> dentries and hence no pathnames.
> >
> > Mike didn't remove it, but _was_ supposed to get around to it.
> >
> >> I guess we could do fix this as shown in the attached patch.
> >
> >
> > +#ifdef CONFIG_CGROUP_SCHED
> > + {
> > + char path[64];
> > +
> >
> > ...
> >
> > +#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_FAIR_GROUP_SCHED)
> > + char path[128];
> >
> >
> > One reason it was removed was path[64/128].
>
> Other callers of cgroup_path use PATH_MAX=4096 here. I believe the
> original reason for these short path sizes was to be light on stack
> and as well as to avoid allocation. Can we have some reasonable length
> (256 or 512 ?) and live with truncation (if that ever happens) ?

I don't see why not, unlikely situation in non-critical path.

> Also, while displaying group name with tasks, does it make sense to
> display autogroup-<id> (the one shown in /proc/<pid>/autogroup) ?

I did to me obviously :) I'm fine with it not showing up, though if it
survives and evolves, maybe it'll want visibility. ATM, you know what's
in what group via ps -o foo,session.

-Mike

2010-12-21 08:33:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Add 'autogroup' scheduling feature: automated per session task groups

On Mon, 2010-12-20 at 21:23 +0530, Bharata B Rao wrote:
>
> I missed adding the similar bits to RT in sched_dubug.c. If this
> approach is reasonable, I can send the next one with RT changes
> included.

No that code needs a serious cleanup, I think you even (re-)introduced a
NULL pointer deref (but I didn't look too closely).

Simply re-instating the code that was removed isn't sufficient.

Just write your patch as if its a new feature (and never even look at
the old code), you'll very probably end up with a much saner patch.