2010-12-14 03:48:51

by Rik van Riel

[permalink] [raw]
Subject: [RFC -v2 PATCH 0/3] directed yield for Pause Loop Exiting

When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up. This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others. Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got
preempted, hopefully the lock holder.

v2:
- make lots of cleanups and improvements suggested
- do not implement timeslice scheduling or fairness stuff
yet, since it is not entirely clear how to do that right
(suggestions welcome)

--
All rights reversed.


2010-12-14 03:48:45

by Rik van Riel

[permalink] [raw]
Subject: [RFC -v2 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu

Keep track of which task is running a KVM vcpu. This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single "run" of the vcpu.

Signed-off-by: Rik van Riel <[email protected]>
---
- move vcpu->task manipulation as suggested by Chris Wright

include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..180085b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
#endif
int vcpu_id;
struct mutex mutex;
+ struct task_struct *task;
int cpu;
atomic_t guest_mode;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5225052..c95bad1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2248,6 +2248,7 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

+ vcpu->task = NULL;
kvm_arch_vcpu_load(vcpu, cpu);
}

@@ -2256,6 +2257,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

+ vcpu->task = current;
kvm_arch_vcpu_put(vcpu);
}

2010-12-14 03:48:55

by Rik van Riel

[permalink] [raw]
Subject: [RFC -v2 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to hand
the rest of our timeslice to another vcpu in the same KVM guest.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 180085b..af11701 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -92,6 +92,7 @@ struct kvm_vcpu {
int fpu_active;
int guest_fpu_loaded, guest_xcr0_loaded;
wait_queue_head_t wq;
+ int spinning;
int sigset_active;
sigset_t sigset;
struct kvm_vcpu_stat stat;
@@ -187,6 +188,7 @@ struct kvm {
#endif
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
atomic_t online_vcpus;
+ int last_boosted_vcpu;
struct list_head vm_list;
struct mutex lock;
struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c95bad1..17c6c25 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1289,18 +1289,50 @@ void kvm_resched(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_resched);

-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
{
- ktime_t expires;
- DEFINE_WAIT(wait);
+ struct kvm *kvm = me->kvm;
+ struct kvm_vcpu *vcpu;
+ int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
+ int yielded = 0;
+ int pass;
+ int i;

- prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
+ me->spinning = 1;

- /* Sleep for 100 us, and hope lock-holder got scheduled */
- expires = ktime_add_ns(ktime_get(), 100000UL);
- schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+ /*
+ * We boost the priority of a VCPU that is runnable but not
+ * currently running, because it got preempted by something
+ * else and called schedule in __vcpu_run. Hopefully that
+ * VCPU is holding the lock that we need and will release it.
+ * We approximate round-robin by starting at the last boosted VCPU.
+ */
+ for (pass = 0; pass < 2 && !yielded; pass++) {
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ struct task_struct *task = vcpu->task;
+ if (!pass && i < last_boosted_vcpu) {
+ i = last_boosted_vcpu;
+ continue;
+ } else if (pass && i > last_boosted_vcpu)
+ break;
+ if (vcpu == me)
+ continue;
+ if (vcpu->spinning)
+ continue;
+ if (!task)
+ continue;
+ if (waitqueue_active(&vcpu->wq))
+ continue;
+ if (task->flags & PF_VCPU)
+ continue;
+ kvm->last_boosted_vcpu = i;
+ yielded = 1;
+ yield_to(task);
+ break;
+ }
+ }

- finish_wait(&vcpu->wq, &wait);
+ me->spinning = 0;
}
EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);

2010-12-14 03:48:48

by Rik van Riel

[permalink] [raw]
Subject: [RFC -v2 PATCH 2/3] sched: add yield_to function

Add a yield_to function to the scheduler code, allowing us to
give the remainder of our timeslice to another thread.

We may want to use this to provide a sys_yield_to system call
one day.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
---
- move to a per sched class yield_to
- fix the locking

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..408326f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1086,6 +1086,8 @@ struct sched_class {
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_move_group) (struct task_struct *p, int on_rq);
#endif
+
+ void (*yield_to) (struct rq *rq, struct task_struct *p);
};

struct load_weight {
@@ -1947,6 +1949,7 @@ extern void set_user_nice(struct task_struct *p, long nice);
extern int task_prio(const struct task_struct *p);
extern int task_nice(const struct task_struct *p);
extern int can_nice(const struct task_struct *p, const int nice);
+extern void requeue_task(struct rq *rq, struct task_struct *p);
extern int task_curr(const struct task_struct *p);
extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
@@ -2020,6 +2023,10 @@ extern int wake_up_state(struct task_struct *tsk, unsigned int state);
extern int wake_up_process(struct task_struct *tsk);
extern void wake_up_new_task(struct task_struct *tsk,
unsigned long clone_flags);
+
+extern u64 slice_remain(struct task_struct *);
+extern void yield_to(struct task_struct *);
+
#ifdef CONFIG_SMP
extern void kick_process(struct task_struct *tsk);
#else
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..6399641 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5166,6 +5166,46 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
return ret;
}

+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the same
+ * process, eg. when p holds a resource other threads are waiting for.
+ * Giving priority to p may help get that resource released sooner.
+ */
+void yield_to(struct task_struct *p)
+{
+ unsigned long flags;
+ struct rq *rq, *p_rq;
+
+ local_irq_save(flags);
+ rq = this_rq();
+again:
+ p_rq = task_rq(p);
+ double_rq_lock(rq, p_rq);
+ if (p_rq != task_rq(p)) {
+ double_rq_unlock(rq, p_rq);
+ goto again;
+ }
+
+ /* We can't yield to a process that doesn't want to run. */
+ if (!p->se.on_rq)
+ goto out;
+
+ /*
+ * We can only yield to a runnable task, in the same schedule class
+ * as the current task, if the schedule class implements yield_to_task.
+ */
+ if (!task_running(rq, p) && current->sched_class == p->sched_class &&
+ current->sched_class->yield_to)
+ current->sched_class->yield_to(rq, p);
+
+out:
+ double_rq_unlock(rq, p_rq);
+ local_irq_restore(flags);
+ yield();
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
/**
* sys_sched_yield - yield the current processor to other threads.
*
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 00ebd76..d8c4116 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -980,6 +980,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
* CFS operations on tasks:
*/

+u64 slice_remain(struct task_struct *p)
+{
+ unsigned long flags;
+ struct sched_entity *se = &p->se;
+ struct cfs_rq *cfs_rq;
+ struct rq *rq;
+ u64 slice, ran;
+ s64 delta;
+
+ rq = task_rq_lock(p, &flags);
+ cfs_rq = cfs_rq_of(se);
+ slice = sched_slice(cfs_rq, se);
+ ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+ delta = slice - ran;
+ task_rq_unlock(rq, &flags);
+
+ return max(delta, 0LL);
+}
+
#ifdef CONFIG_SCHED_HRTICK
static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
@@ -1126,6 +1145,20 @@ static void yield_task_fair(struct rq *rq)
se->vruntime = rightmost->vruntime + 1;
}

+static void yield_to_fair(struct rq *rq, struct task_struct *p)
+{
+ struct sched_entity *se = &p->se;
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ u64 remain = slice_remain(current);
+
+ dequeue_task(rq, p, 0);
+ se->vruntime -= remain;
+ if (se->vruntime < cfs_rq->min_vruntime)
+ se->vruntime = cfs_rq->min_vruntime;
+ enqueue_task(rq, p, 0);
+ check_preempt_curr(rq, p, 0);
+}
+
#ifdef CONFIG_SMP

static void task_waking_fair(struct rq *rq, struct task_struct *p)
@@ -3962,6 +3995,8 @@ static const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_move_group = task_move_group_fair,
#endif
+
+ .yield_to = yield_to_fair,
};

#ifdef CONFIG_SCHED_DEBUG

2010-12-14 06:08:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Mon, 2010-12-13 at 22:46 -0500, Rik van Riel wrote:

> diff --git a/kernel/sched.c b/kernel/sched.c
> index dc91a4d..6399641 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5166,6 +5166,46 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
> return ret;
> }
>
> +/*
> + * Yield the CPU, giving the remainder of our time slice to task p.
> + * Typically used to hand CPU time to another thread inside the same
> + * process, eg. when p holds a resource other threads are waiting for.
> + * Giving priority to p may help get that resource released sooner.
> + */
> +void yield_to(struct task_struct *p)
> +{
> + unsigned long flags;
> + struct rq *rq, *p_rq;
> +
> + local_irq_save(flags);
> + rq = this_rq();
> +again:
> + p_rq = task_rq(p);
> + double_rq_lock(rq, p_rq);
> + if (p_rq != task_rq(p)) {
> + double_rq_unlock(rq, p_rq);
> + goto again;
> + }
> +
> + /* We can't yield to a process that doesn't want to run. */
> + if (!p->se.on_rq)
> + goto out;
> +
> + /*
> + * We can only yield to a runnable task, in the same schedule class
> + * as the current task, if the schedule class implements yield_to_task.
> + */
> + if (!task_running(rq, p) && current->sched_class == p->sched_class &&
> + current->sched_class->yield_to)
> + current->sched_class->yield_to(rq, p);
> +
> +out:
> + double_rq_unlock(rq, p_rq);
> + local_irq_restore(flags);
> + yield();
> +}
> +EXPORT_SYMBOL_GPL(yield_to);

That part looks ok, except for the yield cross cpu bit. Trying to yield
a resource you don't have doesn't make much sense to me.

> /**
> * sys_sched_yield - yield the current processor to other threads.
> *
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 00ebd76..d8c4116 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -980,6 +980,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> * CFS operations on tasks:
> */
>
> +u64 slice_remain(struct task_struct *p)
> +{
> + unsigned long flags;
> + struct sched_entity *se = &p->se;
> + struct cfs_rq *cfs_rq;
> + struct rq *rq;
> + u64 slice, ran;
> + s64 delta;
> +
> + rq = task_rq_lock(p, &flags);
> + cfs_rq = cfs_rq_of(se);
> + slice = sched_slice(cfs_rq, se);
> + ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> + delta = slice - ran;
> + task_rq_unlock(rq, &flags);
> +
> + return max(delta, 0LL);
> +}

<ramble>
slice_remain() measures the distance to your last preemption, which has
no relationship with entitlement. sched_slice() is not used to issue
entitlement, it's only a ruler.

You have entitlement on your current runqueue only, that entitlement
being the instantaneous distance to min_vruntime in a closed and fluid
system. You can't inject some instantaneous relationship from one
closed system into an another without making the math go kind of fuzzy,
so you need tight constraints on how fuzzy it can get.

We do that with migrations, inject fuzz. There is no global fair-stick,
but we invent one by injecting little bits of fuzz. It's constrained by
chaos and the magnitude constraints of the common engine. The more you
migrate, the more tightly you couple systems. As long as we stay fairly
well balanced, we can migrate without the fuzz getting out of hand, and
end up with a globally ~fair system.

What you're injecting isn't instantaneously irrelevant lag-fuzz, which
distributed over time becomes relevant. you're inventing entitlement out
of the void. Likely not a big hairy deal unless you do it frequently,
but you're doing something completely bogus and seemingly unconstrained.
</ramble>

> #ifdef CONFIG_SCHED_HRTICK
> static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
> {
> @@ -1126,6 +1145,20 @@ static void yield_task_fair(struct rq *rq)
> se->vruntime = rightmost->vruntime + 1;
> }
>
> +static void yield_to_fair(struct rq *rq, struct task_struct *p)
> +{
> + struct sched_entity *se = &p->se;
> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
> + u64 remain = slice_remain(current);
> +
> + dequeue_task(rq, p, 0);
> + se->vruntime -= remain;
> + if (se->vruntime < cfs_rq->min_vruntime)
> + se->vruntime = cfs_rq->min_vruntime;

This has an excellent chance of moving the recipient rightward.. and the
yielding task didn't yield anything. This may achieve the desired
result or may just create a nasty latency spike... but it makes no
arithmetic sense.

-Mike

2010-12-14 10:24:16

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote:
> > +/*
> > + * Yield the CPU, giving the remainder of our time slice to task p.
> > + * Typically used to hand CPU time to another thread inside the same
> > + * process, eg. when p holds a resource other threads are waiting for.
> > + * Giving priority to p may help get that resource released sooner.
> > + */
> > +void yield_to(struct task_struct *p)
> > +{
> > + unsigned long flags;
> > + struct rq *rq, *p_rq;
> > +
> > + local_irq_save(flags);
> > + rq = this_rq();
> > +again:
> > + p_rq = task_rq(p);
> > + double_rq_lock(rq, p_rq);
> > + if (p_rq != task_rq(p)) {
> > + double_rq_unlock(rq, p_rq);
> > + goto again;
> > + }
> > +
> > + /* We can't yield to a process that doesn't want to run. */
> > + if (!p->se.on_rq)
> > + goto out;
> > +
> > + /*
> > + * We can only yield to a runnable task, in the same schedule class
> > + * as the current task, if the schedule class implements yield_to_task.
> > + */
> > + if (!task_running(rq, p) && current->sched_class == p->sched_class &&
> > + current->sched_class->yield_to)
> > + current->sched_class->yield_to(rq, p);
> > +
> > +out:
> > + double_rq_unlock(rq, p_rq);
> > + local_irq_restore(flags);
> > + yield();
> > +}
> > +EXPORT_SYMBOL_GPL(yield_to);
>
> That part looks ok, except for the yield cross cpu bit. Trying to yield
> a resource you don't have doesn't make much sense to me.

So another (crazy) idea is to move the "yieldee" task on another cpu over to
yielding task's cpu, let it run till the end of yielding tasks slice and then
let it go back to the original cpu at the same vruntime position!

- vatsa

2010-12-14 11:04:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Tue, 2010-12-14 at 15:54 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote:

> > That part looks ok, except for the yield cross cpu bit. Trying to yield
> > a resource you don't have doesn't make much sense to me.
>
> So another (crazy) idea is to move the "yieldee" task on another cpu over to
> yielding task's cpu, let it run till the end of yielding tasks slice and then
> let it go back to the original cpu at the same vruntime position!

Yeah, pulling the intended recipient makes fine sense. If he doesn't
preempt you, you can try to swap vruntimes or whatever makes arithmetic
sense and will help. Dunno how you tell him how long he can keep the
cpu though, and him somehow going back home needs to be a plain old
migration, no fancy restoration of ancient history vruntime.

-Mike

2010-12-14 11:27:03

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Tue, Dec 14, 2010 at 12:03:58PM +0100, Mike Galbraith wrote:
> On Tue, 2010-12-14 at 15:54 +0530, Srivatsa Vaddagiri wrote:
> > On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote:
>
> > > That part looks ok, except for the yield cross cpu bit. Trying to yield
> > > a resource you don't have doesn't make much sense to me.
> >
> > So another (crazy) idea is to move the "yieldee" task on another cpu over to
> > yielding task's cpu, let it run till the end of yielding tasks slice and then
> > let it go back to the original cpu at the same vruntime position!
>
> Yeah, pulling the intended recipient makes fine sense. If he doesn't
> preempt you, you can try to swap vruntimes or whatever makes arithmetic
> sense and will help. Dunno how you tell him how long he can keep the
> cpu though,

can't we adjust the new task's [prev_]sum_exec_runtime a bit so that it is
preempted at the end of yielding task's timeslice?

> and him somehow going back home needs to be a plain old
> migration, no fancy restoration of ancient history vruntime.

What is the issue if it gets queued at the old vruntime (assuming fair stick is
still behind that)? Without that it will hurt fairness for the yieldee (and
perhaps of the overall VM in this case).

- vatsa

2010-12-14 12:22:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Mon, 2010-12-13 at 22:46 -0500, Rik van Riel wrote:


> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2c79e92..408326f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1086,6 +1086,8 @@ struct sched_class {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> void (*task_move_group) (struct task_struct *p, int on_rq);
> #endif
> +
> + void (*yield_to) (struct rq *rq, struct task_struct *p);
> };
>
> struct load_weight {
> @@ -1947,6 +1949,7 @@ extern void set_user_nice(struct task_struct *p, long nice);
> extern int task_prio(const struct task_struct *p);
> extern int task_nice(const struct task_struct *p);
> extern int can_nice(const struct task_struct *p, const int nice);
> +extern void requeue_task(struct rq *rq, struct task_struct *p);

That definitely doesn't want to be a globally visible symbol.

> extern int task_curr(const struct task_struct *p);
> extern int idle_cpu(int cpu);
> extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
> @@ -2020,6 +2023,10 @@ extern int wake_up_state(struct task_struct *tsk, unsigned int state);
> extern int wake_up_process(struct task_struct *tsk);
> extern void wake_up_new_task(struct task_struct *tsk,
> unsigned long clone_flags);
> +
> +extern u64 slice_remain(struct task_struct *);

idem.


> +void yield_to(struct task_struct *p)
> +{
> + unsigned long flags;
> + struct rq *rq, *p_rq;
> +
> + local_irq_save(flags);
> + rq = this_rq();
> +again:
> + p_rq = task_rq(p);
> + double_rq_lock(rq, p_rq);
> + if (p_rq != task_rq(p)) {
> + double_rq_unlock(rq, p_rq);
> + goto again;
> + }
> +
> + /* We can't yield to a process that doesn't want to run. */
> + if (!p->se.on_rq)
> + goto out;
> +
> + /*
> + * We can only yield to a runnable task, in the same schedule class
> + * as the current task, if the schedule class implements yield_to_task.
> + */
> + if (!task_running(rq, p) && current->sched_class == p->sched_class &&
> + current->sched_class->yield_to)
> + current->sched_class->yield_to(rq, p);

rq and p don't match, see below.

> +
> +out:
> + double_rq_unlock(rq, p_rq);
> + local_irq_restore(flags);
> + yield();

That wants to be plain: schedule(), possibly conditional on having
called sched_class::yield_to.

> +}
> +EXPORT_SYMBOL_GPL(yield_to);

> +u64 slice_remain(struct task_struct *p)
> +{
> + unsigned long flags;
> + struct sched_entity *se = &p->se;
> + struct cfs_rq *cfs_rq;
> + struct rq *rq;
> + u64 slice, ran;
> + s64 delta;
> +
> + rq = task_rq_lock(p, &flags);

You're calling this from
yield_to()->sched_class::yield_to()->yield_to_fair()->slice_remain(),
yield_to() already holds p's rq lock.

> + cfs_rq = cfs_rq_of(se);
> + slice = sched_slice(cfs_rq, se);
> + ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> + delta = slice - ran;
> + task_rq_unlock(rq, &flags);
> +
> + return max(delta, 0LL);
> +}

Like Mike said, the returned figure doesn't really mean anything, its
definitely not the remaining time of a slice. It might qualify for a
weak random number generator though.. :-)

> +static void yield_to_fair(struct rq *rq, struct task_struct *p)
> +{
> + struct sched_entity *se = &p->se;
> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
> + u64 remain = slice_remain(current);
> +
> + dequeue_task(rq, p, 0);

Here you assume @p lives on @rq, but you passed:

+ current->sched_class->yield_to(rq, p);

and rq = this_rq(), so this will go splat.

> + se->vruntime -= remain;

You cannot simply subtract wall-time from virtual time, see the usage of
calc_delta_fair() in the proposal below.

> + if (se->vruntime < cfs_rq->min_vruntime)
> + se->vruntime = cfs_rq->min_vruntime;

Then clipping it to min_vruntime doesn't make any sense at all.

> + enqueue_task(rq, p, 0);
> + check_preempt_curr(rq, p, 0);
> +}

Also, modifying the vruntime of one task without also modifying the
vruntime of the other task breaks stuff. You're injecting time into p
without taking time out of current.


Maybe something like:

static void yield_to_fair(struct rq *p_rq, struct task_struct *p)
{
struct rq *rq = this_rq();
struct sched_entity *se = &current->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
struct sched_entity *pse = &p->se;
struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);

/*
* Transfer wakeup_gran worth of time from current to @p,
* this should ensure current is no longer eligible to run.
*/
unsigned long wakeup_gran = ACCESS_ONCE(sysctl_sched_wakeup_granularity);

update_rq_clock(rq);
update_curr(cfs_rq);

if (pse != p_cfs_rq->curr) {
__dequeue_entity(p_cfs_rq, pse);
} else {
update_rq_clock(p_rq);
update_curr(p_cfs_rq);
}

se->vruntime += calc_delta_fair(wakeup_gran, se);
pse->vruntime -= calc_delta_fair(wakeup_gran, pse);

clear_buddies(cfs_rq, se);

if (pse != p_cfs_rq->curr) {
__enqueue_entity(p_cfs_rq, pse);
check_preempt_curr(prq, p, 0)
}
}

This isn't strictly correct for the group scheduling case though, that
wants a for_each_sched_entity() loop for both se and pse, but I'd have
to like actually think about that ;-)

A quick hack might be simply dis-allowing yield_to between different
groups, add something like the below to the above function:

#ifdef CONFIG_FAIR_GROUP_SCHED
if (cfs_rq->tg != p_cfs_rq->tg)
return;
#endif

2010-12-14 12:47:39

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Tue, 2010-12-14 at 16:56 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Dec 14, 2010 at 12:03:58PM +0100, Mike Galbraith wrote:
> > On Tue, 2010-12-14 at 15:54 +0530, Srivatsa Vaddagiri wrote:
> > > On Tue, Dec 14, 2010 at 07:08:16AM +0100, Mike Galbraith wrote:
> >
> > > > That part looks ok, except for the yield cross cpu bit. Trying to yield
> > > > a resource you don't have doesn't make much sense to me.
> > >
> > > So another (crazy) idea is to move the "yieldee" task on another cpu over to
> > > yielding task's cpu, let it run till the end of yielding tasks slice and then
> > > let it go back to the original cpu at the same vruntime position!
> >
> > Yeah, pulling the intended recipient makes fine sense. If he doesn't
> > preempt you, you can try to swap vruntimes or whatever makes arithmetic
> > sense and will help. Dunno how you tell him how long he can keep the
> > cpu though,
>
> can't we adjust the new task's [prev_]sum_exec_runtime a bit so that it is
> preempted at the end of yielding task's timeslice?

And dork up accounting. Why? Besides, it won't work because you have
no idea who may preempt whom, when, and for how long.

(Why do people keep talking about timeslice? The only thing that exists
is lag that changes the instant anyone does anything of interest.)

> > and him somehow going back home needs to be a plain old
> > migration, no fancy restoration of ancient history vruntime.
>
> What is the issue if it gets queued at the old vruntime (assuming fair stick is
> still behind that)? Without that it will hurt fairness for the yieldee (and
> perhaps of the overall VM in this case).

Who all are you placing this task in front of or behind based upon a
non-existent relationship?

Your recipient may well have been preempted, and is now further behind
than the completely irrelevant to the current situation stored vruntime
would indicate, so why would you want to move it rightward? Certainly
not in the interest of fairness.

-Mike

2010-12-16 19:49:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/14/2010 01:08 AM, Mike Galbraith wrote:
> On Mon, 2010-12-13 at 22:46 -0500, Rik van Riel wrote:
>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index dc91a4d..6399641 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -5166,6 +5166,46 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len,
>> return ret;
>> }
>>
>> +/*
>> + * Yield the CPU, giving the remainder of our time slice to task p.
>> + * Typically used to hand CPU time to another thread inside the same
>> + * process, eg. when p holds a resource other threads are waiting for.
>> + * Giving priority to p may help get that resource released sooner.
>> + */
>> +void yield_to(struct task_struct *p)
>> +{
>> + unsigned long flags;
>> + struct rq *rq, *p_rq;
>> +
>> + local_irq_save(flags);
>> + rq = this_rq();
>> +again:
>> + p_rq = task_rq(p);
>> + double_rq_lock(rq, p_rq);
>> + if (p_rq != task_rq(p)) {
>> + double_rq_unlock(rq, p_rq);
>> + goto again;
>> + }
>> +
>> + /* We can't yield to a process that doesn't want to run. */
>> + if (!p->se.on_rq)
>> + goto out;
>> +
>> + /*
>> + * We can only yield to a runnable task, in the same schedule class
>> + * as the current task, if the schedule class implements yield_to_task.
>> + */
>> + if (!task_running(rq, p)&& current->sched_class == p->sched_class&&
>> + current->sched_class->yield_to)
>> + current->sched_class->yield_to(rq, p);
>> +
>> +out:
>> + double_rq_unlock(rq, p_rq);
>> + local_irq_restore(flags);
>> + yield();
>> +}
>> +EXPORT_SYMBOL_GPL(yield_to);
>
> That part looks ok, except for the yield cross cpu bit. Trying to yield
> a resource you don't have doesn't make much sense to me.

The current task just donated the rest of its timeslice.

Surely that makes it a reasonable idea to call yield, and
get one of the other tasks on the current CPU running for
a bit?

> <ramble>
> slice_remain() measures the distance to your last preemption, which has
> no relationship with entitlement. sched_slice() is not used to issue
> entitlement, it's only a ruler.
>
> You have entitlement on your current runqueue only, that entitlement
> being the instantaneous distance to min_vruntime in a closed and fluid
> system. You can't inject some instantaneous relationship from one
> closed system into an another without making the math go kind of fuzzy,
> so you need tight constraints on how fuzzy it can get.
>
> We do that with migrations, inject fuzz. There is no global fair-stick,
> but we invent one by injecting little bits of fuzz. It's constrained by
> chaos and the magnitude constraints of the common engine. The more you
> migrate, the more tightly you couple systems. As long as we stay fairly
> well balanced, we can migrate without the fuzz getting out of hand, and
> end up with a globally ~fair system.
>
> What you're injecting isn't instantaneously irrelevant lag-fuzz, which
> distributed over time becomes relevant. you're inventing entitlement out
> of the void. Likely not a big hairy deal unless you do it frequently,
> but you're doing something completely bogus and seemingly unconstrained.
> </ramble>

I'm open to suggestions on what to do instead.

>> +static void yield_to_fair(struct rq *rq, struct task_struct *p)
>> +{
>> + struct sched_entity *se =&p->se;
>> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> + u64 remain = slice_remain(current);
>> +
>> + dequeue_task(rq, p, 0);
>> + se->vruntime -= remain;
>> + if (se->vruntime< cfs_rq->min_vruntime)
>> + se->vruntime = cfs_rq->min_vruntime;
>
> This has an excellent chance of moving the recipient rightward.. and the
> yielding task didn't yield anything. This may achieve the desired
> result or may just create a nasty latency spike... but it makes no
> arithmetic sense.

Good point, the current task calls yield() in the function
that calls yield_to_fair, but I seem to have lost the code
that penalizes the current task's runtime...

I'll reinstate that.

--
All rights reversed

2010-12-17 06:57:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Thu, 2010-12-16 at 14:49 -0500, Rik van Riel wrote:
> On 12/14/2010 01:08 AM, Mike Galbraith wrote:

> >> +EXPORT_SYMBOL_GPL(yield_to);
> >
> > That part looks ok, except for the yield cross cpu bit. Trying to yield
> > a resource you don't have doesn't make much sense to me.
>
> The current task just donated the rest of its timeslice.

It doesn't have a tangible slice to donate.

> Surely that makes it a reasonable idea to call yield, and
> get one of the other tasks on the current CPU running for
> a bit?

There's nothing wrong with trying to give up the cpu. It's the concept
of a cross cpu yield_to() that I find mighty strange.

> I'm open to suggestions on what to do instead.

If you want to yield_to(task), task needs to move to the resource you're
sitting on. That's the only thing that makes any sense to me. Pull the
task, maybe do some vruntime twiddling (likely very bad idea) to improve
success chances, and schedule.

Dunno how effective that would be at solving real problems though. You
might get him to run, might not, you'll be fighting load balancing, and
maybe just inflicting cache misses on everyone as tasks bounce around.

> >> +static void yield_to_fair(struct rq *rq, struct task_struct *p)
> >> +{
> >> + struct sched_entity *se =&p->se;
> >> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >> + u64 remain = slice_remain(current);
> >> +
> >> + dequeue_task(rq, p, 0);
> >> + se->vruntime -= remain;
> >> + if (se->vruntime< cfs_rq->min_vruntime)
> >> + se->vruntime = cfs_rq->min_vruntime;
> >
> > This has an excellent chance of moving the recipient rightward.. and the
> > yielding task didn't yield anything. This may achieve the desired
> > result or may just create a nasty latency spike... but it makes no
> > arithmetic sense.
>
> Good point, the current task calls yield() in the function
> that calls yield_to_fair, but I seem to have lost the code
> that penalizes the current task's runtime...
>
> I'll reinstate that.

See comment in parentheses above :)

-Mike

2010-12-17 07:15:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Fri, 2010-12-17 at 07:57 +0100, Mike Galbraith wrote:
> On Thu, 2010-12-16 at 14:49 -0500, Rik van Riel wrote:

> > >> +static void yield_to_fair(struct rq *rq, struct task_struct *p)
> > >> +{
> > >> + struct sched_entity *se =&p->se;
> > >> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > >> + u64 remain = slice_remain(current);
> > >> +
> > >> + dequeue_task(rq, p, 0);
> > >> + se->vruntime -= remain;
> > >> + if (se->vruntime< cfs_rq->min_vruntime)
> > >> + se->vruntime = cfs_rq->min_vruntime;
> > >
> > > This has an excellent chance of moving the recipient rightward.. and the
> > > yielding task didn't yield anything. This may achieve the desired
> > > result or may just create a nasty latency spike... but it makes no
> > > arithmetic sense.
> >
> > Good point, the current task calls yield() in the function
> > that calls yield_to_fair, but I seem to have lost the code
> > that penalizes the current task's runtime...
> >
> > I'll reinstate that.
>
> See comment in parentheses above :)

BTW, with this vruntime donation thingy, what prevents a task from
forking off accomplices who do nothing but wait for a wakeup and
yield_to(exploit)?

Even swapping vruntimes in the same cfs_rq is dangerous as hell, because
one party is going backward.

-Mike

2010-12-17 15:09:53

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/17/2010 08:56 AM, Mike Galbraith wrote:
> > Surely that makes it a reasonable idea to call yield, and
> > get one of the other tasks on the current CPU running for
> > a bit?
>
> There's nothing wrong with trying to give up the cpu. It's the concept
> of a cross cpu yield_to() that I find mighty strange.

What's so strange about it? From a high level there are N runnable
tasks contending for M cpus. If task X really needs task Y to run, what
does it matter if task Y last ran on the same cpu as task X or not?

Do I correctly read between the lines that CFS maintains complete
fairness only on a cpu, but not globally? I hope I'm wrong, but

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
8648 avi 20 0 106m 1092 148 R 80.4 0.0 0:26.03 1 bash
8656 avi 20 0 106m 1080 136 R 47.3 0.0 0:14.73 0 bash
8652 avi 20 0 106m 1080 136 R 47.0 0.0 0:15.36 0 bash

doesn't look too good (three infinite loops in bash started at the
roughly same time)

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-12-17 19:51:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Fri, 2010-12-17 at 17:09 +0200, Avi Kivity wrote:
> On 12/17/2010 08:56 AM, Mike Galbraith wrote:
> > > Surely that makes it a reasonable idea to call yield, and
> > > get one of the other tasks on the current CPU running for
> > > a bit?
> >
> > There's nothing wrong with trying to give up the cpu. It's the concept
> > of a cross cpu yield_to() that I find mighty strange.
>
> What's so strange about it? From a high level there are N runnable
> tasks contending for M cpus. If task X really needs task Y to run, what
> does it matter if task Y last ran on the same cpu as task X or not?

Task X wants control of when runnable task Y gets the cpu. Task X
clearly wants to be the scheduler. This isn't about _yielding_ diddly
spit, it's about individual tasks wanting to make scheduling decisions,
so calling it a yield is high grade horse-pookey. You're trying to give
the scheduler a hint, the stronger that hint, the happier you'll be.

I can see the problem, and I'm not trying to be Mr. Negative here, I'm
only trying to point out problems I see with what's been proposed.

If the yielding task had a concrete fee he could pay, that would be
fine, but he does not.

If he did have something, how often do you think it should be possible
for task X to bribe the scheduler into selecting task Y? Will his
pockets be deep enough to actually solve the problem? Once he's
yielded, he's out of the picture for a while if he really gave anything
up. What happens to donated entitlement when the recipient goes to
sleep? If you try to give it back, what happens if the donor exited?
Where did the entitlement come from if task A running alone on cpu A
tosses some entitlement over the fence to his pal task B on cpu B.. and
keeps on trucking on cpu A? Where does that leave task C, B's
competition?

> Do I correctly read between the lines that CFS maintains complete
> fairness only on a cpu, but not globally?

Nothing between the lines about it. There are N individual engines,
coupled via load balancing.

-Mike

2010-12-18 14:50:48

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/14/2010 07:22 AM, Peter Zijlstra wrote:

... fixed all the obvious stuff. No idea what the hell I was
thinking while doing that "cleanup" - probably too busy looking
at the tests that I was running on a previous codebase :(

For the next version of the patches, I have switched to your
version of yield_to_fair :)

--
All rights reversed

2010-12-18 17:03:14

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/17/2010 09:51 PM, Mike Galbraith wrote:
> On Fri, 2010-12-17 at 17:09 +0200, Avi Kivity wrote:
> > On 12/17/2010 08:56 AM, Mike Galbraith wrote:
> > > > Surely that makes it a reasonable idea to call yield, and
> > > > get one of the other tasks on the current CPU running for
> > > > a bit?
> > >
> > > There's nothing wrong with trying to give up the cpu. It's the concept
> > > of a cross cpu yield_to() that I find mighty strange.
> >
> > What's so strange about it? From a high level there are N runnable
> > tasks contending for M cpus. If task X really needs task Y to run, what
> > does it matter if task Y last ran on the same cpu as task X or not?
>
> Task X wants control of when runnable task Y gets the cpu. Task X
> clearly wants to be the scheduler. This isn't about _yielding_ diddly
> spit, it's about individual tasks wanting to make scheduling decisions,
> so calling it a yield is high grade horse-pookey. You're trying to give
> the scheduler a hint, the stronger that hint, the happier you'll be.

Please suggest a better name then.

> I can see the problem, and I'm not trying to be Mr. Negative here, I'm
> only trying to point out problems I see with what's been proposed.
>
> If the yielding task had a concrete fee he could pay, that would be
> fine, but he does not.

It does. The yielding task is entitled to its fair share of the cpu, as
modified by priority and group scheduling. The yielding task is willing
to give up some of this cpu, in return for increasing another task's
share. Other tasks would not be negatively affected by this.

> If he did have something, how often do you think it should be possible
> for task X to bribe the scheduler into selecting task Y?

In extreme cases, very often. Say 100KHz.

> Will his
> pockets be deep enough to actually solve the problem? Once he's
> yielded, he's out of the picture for a while if he really gave anything
> up.

Unless the other task donates some cpu share back. This is exactly what
will happen in those extreme cases.

> What happens to donated entitlement when the recipient goes to
> sleep?

Nothing.

> If you try to give it back, what happens if the donor exited?

It's lost, too bad.

> Where did the entitlement come from if task A running alone on cpu A
> tosses some entitlement over the fence to his pal task B on cpu B.. and
> keeps on trucking on cpu A? Where does that leave task C, B's
> competition?

Eventually C would replace A, since its share will be exhausted. If C
is pinned... good question. How does fairness work with pinned tasks?

> > Do I correctly read between the lines that CFS maintains complete
> > fairness only on a cpu, but not globally?
>
> Nothing between the lines about it. There are N individual engines,
> coupled via load balancing.

Is this not seen as a major deficiency?

I can understand intra-cpu scheduling decisions at 300 Hz and inter-cpu
decisions at 10 Hz (or even lower, with some intermediate rate for
intra-socket scheduling). But this looks like a major deviation from
fairness - instead of 33%/33%/33% you get 50%/25%/25% depending on
random placement.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-12-18 17:08:40

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/17/2010 09:15 AM, Mike Galbraith wrote:
> BTW, with this vruntime donation thingy, what prevents a task from
> forking off accomplices who do nothing but wait for a wakeup and
> yield_to(exploit)?
>

What's the difference between that and forking off accomplices who
run(exploit) directly?

Many threads dominating the scheduler has been solved by group
scheduling. We need to make sure directed yield doesn't violate that,
but I don't see new problems.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-12-18 18:06:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Sat, 2010-12-18 at 19:02 +0200, Avi Kivity wrote:
> On 12/17/2010 09:51 PM, Mike Galbraith wrote:
> > On Fri, 2010-12-17 at 17:09 +0200, Avi Kivity wrote:
> > > On 12/17/2010 08:56 AM, Mike Galbraith wrote:
> > > > > Surely that makes it a reasonable idea to call yield, and
> > > > > get one of the other tasks on the current CPU running for
> > > > > a bit?
> > > >
> > > > There's nothing wrong with trying to give up the cpu. It's the concept
> > > > of a cross cpu yield_to() that I find mighty strange.
> > >
> > > What's so strange about it? From a high level there are N runnable
> > > tasks contending for M cpus. If task X really needs task Y to run, what
> > > does it matter if task Y last ran on the same cpu as task X or not?
> >
> > Task X wants control of when runnable task Y gets the cpu. Task X
> > clearly wants to be the scheduler. This isn't about _yielding_ diddly
> > spit, it's about individual tasks wanting to make scheduling decisions,
> > so calling it a yield is high grade horse-pookey. You're trying to give
> > the scheduler a hint, the stronger that hint, the happier you'll be.
>
> Please suggest a better name then.

Don't have one handy.

> > I can see the problem, and I'm not trying to be Mr. Negative here, I'm
> > only trying to point out problems I see with what's been proposed.
> >
> > If the yielding task had a concrete fee he could pay, that would be
> > fine, but he does not.
>
> It does. The yielding task is entitled to its fair share of the cpu, as
> modified by priority and group scheduling. The yielding task is willing
> to give up some of this cpu, in return for increasing another task's
> share. Other tasks would not be negatively affected by this.

The donor does absolutely NOT have any claim to any specific quantum of
any cpu at any given time. There is no reservation, only a running
clock. If you let one task move another task's clock backward, you open
a can of starvation worms. This is exactly why vruntimes are monotonic.

> > If he did have something, how often do you think it should be possible
> > for task X to bribe the scheduler into selecting task Y?
>
> In extreme cases, very often. Say 100KHz.

Hm, so it needs to be very cheap, and highly repeatable.

What if: so you're trying to get spinners out of the way right? You
somehow know they're spinning, so instead of trying to boost some task,
can you do a directed yield in terms of directing a spinner that you
have the right to diddle to yield. Drop his lag, and resched him. He's
not accomplishing anything anyway.

If the only thing running is virtualization, and nobody else can use the
interface being invented, all is fair, but this passing of vruntime
around is problematic when innocent bystanders may want to play too.
Forcing a spinning task to parity doesn't have the same problems.

> > Will his
> > pockets be deep enough to actually solve the problem? Once he's
> > yielded, he's out of the picture for a while if he really gave anything
> > up.
>
> Unless the other task donates some cpu share back. This is exactly what
> will happen in those extreme cases.

So vruntime donation won't work.

> > What happens to donated entitlement when the recipient goes to
> > sleep?
>
> Nothing.

It's vaporized by the sleep.

> > If you try to give it back, what happens if the donor exited?
>
> It's lost, too bad.

Yep, so much for accounting.

> > Where did the entitlement come from if task A running alone on cpu A
> > tosses some entitlement over the fence to his pal task B on cpu B.. and
> > keeps on trucking on cpu A? Where does that leave task C, B's
> > competition?
>
> Eventually C would replace A, since its share will be exhausted. If C
> is pinned... good question. How does fairness work with pinned tasks?

In the case I described, C had it's pocket picked by A.

> > > Do I correctly read between the lines that CFS maintains complete
> > > fairness only on a cpu, but not globally?
> >
> > Nothing between the lines about it. There are N individual engines,
> > coupled via load balancing.
>
> Is this not seen as a major deficiency?

Doesn't seem to be. That's what SMP-nice was all about. It's not
perfect, but seems to work well.

> I can understand intra-cpu scheduling decisions at 300 Hz and inter-cpu
> decisions at 10 Hz (or even lower, with some intermediate rate for
> intra-socket scheduling). But this looks like a major deviation from
> fairness - instead of 33%/33%/33% you get 50%/25%/25% depending on
> random placement.

Yeah, you have to bounce tasks around regularly to make that work out.

-Mike

2010-12-18 18:13:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Sat, 2010-12-18 at 19:08 +0200, Avi Kivity wrote:
> On 12/17/2010 09:15 AM, Mike Galbraith wrote:
> > BTW, with this vruntime donation thingy, what prevents a task from
> > forking off accomplices who do nothing but wait for a wakeup and
> > yield_to(exploit)?
> >
>
> What's the difference between that and forking off accomplices who
> run(exploit) directly?

The clock still advances. What I described is a clock stopper.

> Many threads dominating the scheduler has been solved by group
> scheduling. We need to make sure directed yield doesn't violate that,
> but I don't see new problems.

Let it become possible to stop clocks, and you surely will :)

-Mike

2010-12-19 06:08:58

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/18/2010 09:13 PM, Mike Galbraith wrote:
> On Sat, 2010-12-18 at 19:08 +0200, Avi Kivity wrote:
> > On 12/17/2010 09:15 AM, Mike Galbraith wrote:
> > > BTW, with this vruntime donation thingy, what prevents a task from
> > > forking off accomplices who do nothing but wait for a wakeup and
> > > yield_to(exploit)?
> > >
> >
> > What's the difference between that and forking off accomplices who
> > run(exploit) directly?
>
> The clock still advances. What I described is a clock stopper.

That's scheduler jargon, I'm not familiar with scheduler internals.
Let's talk about this at a high level, since whenever I describe it in
scheduler terms I'm likely to get it wrong.

If there are N tasks on the machine, each is entitled to 1/N of the
machine's cpu resources (ignoring nice and cgroups for the moment).
What I'd like is for one task to temporarily pass a portion of its
entitlement to another. No other task's entitlement is affected; they
still get their 1/N.

If a task donates its entitlement and immediately commits suicide, still
we won't have fundamentally changed anything; the task could have kept
itself alive and consumed that entitlement, so the scheduler was already
prepared to give it this entitlement so nothing was gained by the
forking task. The problem of fork() creating new entitlements out of
thin air has nothing to do with directed yield, and is solved by cgroups.

We already do something similar with priority inheritance. This doesn't
involve cpu entitlement, but we have the same situation where a task's
ability to make progress depends on another task receiving cpu time.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-12-19 06:21:31

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/18/2010 09:06 PM, Mike Galbraith wrote:
> > > I can see the problem, and I'm not trying to be Mr. Negative here, I'm
> > > only trying to point out problems I see with what's been proposed.
> > >
> > > If the yielding task had a concrete fee he could pay, that would be
> > > fine, but he does not.
> >
> > It does. The yielding task is entitled to its fair share of the cpu, as
> > modified by priority and group scheduling. The yielding task is willing
> > to give up some of this cpu, in return for increasing another task's
> > share. Other tasks would not be negatively affected by this.
>
> The donor does absolutely NOT have any claim to any specific quantum of
> any cpu at any given time. There is no reservation, only a running
> clock.

Of course not, but we aren't interested in a specific quantum of time.
All we want is to boost task A, and to unboost the donor in order to
maintain fairness. We don't want A to run now (though it would be
nice); all we want is to run it before B.

> If you let one task move another task's clock backward, you open
> a can of starvation worms. This is exactly why vruntimes are monotonic.
>
> > > If he did have something, how often do you think it should be possible
> > > for task X to bribe the scheduler into selecting task Y?
> >
> > In extreme cases, very often. Say 100KHz.
>
> Hm, so it needs to be very cheap, and highly repeatable.
>
> What if: so you're trying to get spinners out of the way right? You
> somehow know they're spinning, so instead of trying to boost some task,
> can you do a directed yield in terms of directing a spinner that you
> have the right to diddle to yield. Drop his lag, and resched him. He's
> not accomplishing anything anyway.

There are a couple of problems with this approach:

- current yield() is a no-op
- even if it weren't, the process (containing the spinner and the
lock-holder) would yield as a whole. If it yielded for exactly the time
needed (until the lock holder releases the lock), it wouldn't matter,
since the spinner isn't accomplishing anything, but we don't know what
the exact time is. So we want to preserve our entitlement.

With a pure yield implementation the process would get less than its
fair share, even discounting spin time, which we'd be happy to donate to
the rest of the system.

> If the only thing running is virtualization, and nobody else can use the
> interface being invented, all is fair, but this passing of vruntime
> around is problematic when innocent bystanders may want to play too.

We definitely want to maintain fairness. Both with a dedicated virt
host and with a mixed workload.

> Forcing a spinning task to parity doesn't have the same problems.

Sorry, parse failure.

> > > Will his
> > > pockets be deep enough to actually solve the problem? Once he's
> > > yielded, he's out of the picture for a while if he really gave anything
> > > up.
> >
> > Unless the other task donates some cpu share back. This is exactly what
> > will happen in those extreme cases.
>
> So vruntime donation won't work.
>
> > > What happens to donated entitlement when the recipient goes to
> > > sleep?
> >
> > Nothing.
>
> It's vaporized by the sleep.
>
> > > If you try to give it back, what happens if the donor exited?
> >
> > It's lost, too bad.
>
> Yep, so much for accounting.

What's the problem exactly? What's the difference, system-wide, with
the donor continuing to run for that same entitlement? Other tasks see
the same thing.

> > > Where did the entitlement come from if task A running alone on cpu A
> > > tosses some entitlement over the fence to his pal task B on cpu B.. and
> > > keeps on trucking on cpu A? Where does that leave task C, B's
> > > competition?
> >
> > Eventually C would replace A, since its share will be exhausted. If C
> > is pinned... good question. How does fairness work with pinned tasks?
>
> In the case I described, C had it's pocket picked by A.

Would that happen if global fairness was maintained?

> > > > Do I correctly read between the lines that CFS maintains complete
> > > > fairness only on a cpu, but not globally?
> > >
> > > Nothing between the lines about it. There are N individual engines,
> > > coupled via load balancing.
> >
> > Is this not seen as a major deficiency?
>
> Doesn't seem to be. That's what SMP-nice was all about. It's not
> perfect, but seems to work well.

I guess random perturbations cause task migrations periodically and
things balance out. But it seems wierd to have this devotion to
fairness on a single cpu and completely ignore fairness on a macro level.

> > I can understand intra-cpu scheduling decisions at 300 Hz and inter-cpu
> > decisions at 10 Hz (or even lower, with some intermediate rate for
> > intra-socket scheduling). But this looks like a major deviation from
> > fairness - instead of 33%/33%/33% you get 50%/25%/25% depending on
> > random placement.
>
> Yeah, you have to bounce tasks around regularly to make that work out.

That's fine as long as the bounce period is significantly larger than
the cost of a migration.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-12-19 09:06:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Sun, 2010-12-19 at 08:21 +0200, Avi Kivity wrote:
> On 12/18/2010 09:06 PM, Mike Galbraith wrote:

> > Hm, so it needs to be very cheap, and highly repeatable.
> >
> > What if: so you're trying to get spinners out of the way right? You
> > somehow know they're spinning, so instead of trying to boost some task,
> > can you do a directed yield in terms of directing a spinner that you
> > have the right to diddle to yield. Drop his lag, and resched him. He's
> > not accomplishing anything anyway.
>
> There are a couple of problems with this approach:
>
> - current yield() is a no-op

That's why you'd drop lag, set to max(se->vruntime, cfs_rq->min_vruntime).

> - even if it weren't, the process (containing the spinner and the
> lock-holder) would yield as a whole.

I don't get this part. How does the whole process yield if one thread
yields?

> If it yielded for exactly the time
> needed (until the lock holder releases the lock), it wouldn't matter,
> since the spinner isn't accomplishing anything, but we don't know what
> the exact time is. So we want to preserve our entitlement.

And that's the hard part. If can drop lag, you may hurt yourself, but
at least only yourself.

> With a pure yield implementation the process would get less than its
> fair share, even discounting spin time, which we'd be happy to donate to
> the rest of the system.
>
> > If the only thing running is virtualization, and nobody else can use the
> > interface being invented, all is fair, but this passing of vruntime
> > around is problematic when innocent bystanders may want to play too.
>
> We definitely want to maintain fairness. Both with a dedicated virt
> host and with a mixed workload.

That makes it difficult to the point of impossible.

You want a specific task to run NOW for good reasons, but any number of
tasks may want the same godlike power for equally good reasons.

You could create a force select which only godly tasks could use that
didn't try to play games with vruntimes, just let the bugger run, and
let him also eat the latency hit he'll pay for that extra bit of cpu IFF
you didn't care about being able to mix loads.

Or, you could just bump his nice level with an automated return to
previous level on resched.

Any intervention has unavoidable consequences for all comers though.

> > Forcing a spinning task to parity doesn't have the same problems.
>
> Sorry, parse failure.

(dropping lag on the floor)

> > > > Will his
> > > > pockets be deep enough to actually solve the problem? Once he's
> > > > yielded, he's out of the picture for a while if he really gave anything
> > > > up.
> > >
> > > Unless the other task donates some cpu share back. This is exactly what
> > > will happen in those extreme cases.
> >
> > So vruntime donation won't work.
> >
> > > > What happens to donated entitlement when the recipient goes to
> > > > sleep?
> > >
> > > Nothing.
> >
> > It's vaporized by the sleep.
> >
> > > > If you try to give it back, what happens if the donor exited?
> > >
> > > It's lost, too bad.
> >
> > Yep, so much for accounting.
>
> What's the problem exactly? What's the difference, system-wide, with
> the donor continuing to run for that same entitlement? Other tasks see
> the same thing.

SOME tasks receive gifts from the void. The difference is the bias.

> > > > Where did the entitlement come from if task A running alone on cpu A
> > > > tosses some entitlement over the fence to his pal task B on cpu B.. and
> > > > keeps on trucking on cpu A? Where does that leave task C, B's
> > > > competition?
> > >
> > > Eventually C would replace A, since its share will be exhausted. If C
> > > is pinned... good question. How does fairness work with pinned tasks?
> >
> > In the case I described, C had it's pocket picked by A.
>
> Would that happen if global fairness was maintained?

What's that? :) No task may run until there are enough of you to fill
the box? God help you when somebody else wakes up Mr. Early-bird? ...

> > > > > Do I correctly read between the lines that CFS maintains complete
> > > > > fairness only on a cpu, but not globally?
> > > >
> > > > Nothing between the lines about it. There are N individual engines,
> > > > coupled via load balancing.
> > >
> > > Is this not seen as a major deficiency?
> >
> > Doesn't seem to be. That's what SMP-nice was all about. It's not
> > perfect, but seems to work well.
>
> I guess random perturbations cause task migrations periodically and
> things balance out. But it seems wierd to have this devotion to
> fairness on a single cpu and completely ignore fairness on a macro level.

It doesn't ignore it complete, it just doesn't try to do all the math
continuously (danger Will Robinson: Peter has scary patches). Prodding
it in the right general direction with migrations is cheaper.

-Mike

2010-12-19 09:19:42

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/19/2010 12:05 PM, Mike Galbraith wrote:
> On Sun, 2010-12-19 at 08:21 +0200, Avi Kivity wrote:
> > On 12/18/2010 09:06 PM, Mike Galbraith wrote:
>
> > > Hm, so it needs to be very cheap, and highly repeatable.
> > >
> > > What if: so you're trying to get spinners out of the way right? You
> > > somehow know they're spinning, so instead of trying to boost some task,
> > > can you do a directed yield in terms of directing a spinner that you
> > > have the right to diddle to yield. Drop his lag, and resched him. He's
> > > not accomplishing anything anyway.
> >
> > There are a couple of problems with this approach:
> >
> > - current yield() is a no-op
>
> That's why you'd drop lag, set to max(se->vruntime, cfs_rq->min_vruntime).

Internal scheduler terminology again, don't follow.

> > - even if it weren't, the process (containing the spinner and the
> > lock-holder) would yield as a whole.
>
> I don't get this part. How does the whole process yield if one thread
> yields?

The process is the sum of its threads. If a thread yield loses 1 msec
of runtime due to the yield, the process loses 1 msec due to the yield.
If the lock is held for, say, 100 usec, it would be better for the
process to spin rather than yield.

With directed yield the process loses nothing by yielding to one of its
threads.

> > If it yielded for exactly the time
> > needed (until the lock holder releases the lock), it wouldn't matter,
> > since the spinner isn't accomplishing anything, but we don't know what
> > the exact time is. So we want to preserve our entitlement.
>
> And that's the hard part. If can drop lag, you may hurt yourself, but
> at least only yourself.

We already have a "hurt only yourself" thing. We sleep for 100 usec
when we detect spinning. It's awful.

> > With a pure yield implementation the process would get less than its
> > fair share, even discounting spin time, which we'd be happy to donate to
> > the rest of the system.

We aren't happy to donate it to the rest of the system, since it will
cause a guest with lots of internal contention to make very little
forward progress.

> >
> > > If the only thing running is virtualization, and nobody else can use the
> > > interface being invented, all is fair, but this passing of vruntime
> > > around is problematic when innocent bystanders may want to play too.
> >
> > We definitely want to maintain fairness. Both with a dedicated virt
> > host and with a mixed workload.
>
> That makes it difficult to the point of impossible.
>
> You want a specific task to run NOW for good reasons, but any number of
> tasks may want the same godlike power for equally good reasons.

I don't want it to run now. I want it to run before some other task. I
don't care if N other tasks run before both. So no godlike powers
needed, simply a courteous "after you".

> You could create a force select which only godly tasks could use that
> didn't try to play games with vruntimes, just let the bugger run, and
> let him also eat the latency hit he'll pay for that extra bit of cpu IFF
> you didn't care about being able to mix loads.
>
> Or, you could just bump his nice level with an automated return to
> previous level on resched.
>
> Any intervention has unavoidable consequences for all comers though.

Since task A is running now, clearly the scheduler thinks it deserves to
run. What I want to do is take just enough of the "deserves" part to
make it not run any more, and move it to task B.

> > >
> > > Yep, so much for accounting.
> >
> > What's the problem exactly? What's the difference, system-wide, with
> > the donor continuing to run for that same entitlement? Other tasks see
> > the same thing.
>
> SOME tasks receive gifts from the void. The difference is the bias.

Isn't fork() a gift from the void?

> > > > > Where did the entitlement come from if task A running alone on cpu A
> > > > > tosses some entitlement over the fence to his pal task B on cpu B.. and
> > > > > keeps on trucking on cpu A? Where does that leave task C, B's
> > > > > competition?
> > > >
> > > > Eventually C would replace A, since its share will be exhausted. If C
> > > > is pinned... good question. How does fairness work with pinned tasks?
> > >
> > > In the case I described, C had it's pocket picked by A.
> >
> > Would that happen if global fairness was maintained?
>
> What's that? :)

If you run three tasks on a two cpu box, each gets 2/3 of a cpu.


> No task may run until there are enough of you to fill
> the box?

Why is that a consequence of global fairness? three tasks get 100% cpu
on a 4-cpu box, the fourth cpu idles. Is that not fair for some reason?

> God help you when somebody else wakes up Mr. Early-bird? ...

What?

> >
> > I guess random perturbations cause task migrations periodically and
> > things balance out. But it seems wierd to have this devotion to
> > fairness on a single cpu and completely ignore fairness on a macro level.
>
> It doesn't ignore it complete, it just doesn't try to do all the math
> continuously (danger Will Robinson: Peter has scary patches). Prodding
> it in the right general direction with migrations is cheaper.

Doesn't seem to work from my brief experiment.

--
error compiling committee.c: too many arguments to function

2010-12-19 10:18:06

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Sun, 2010-12-19 at 11:19 +0200, Avi Kivity wrote:
> On 12/19/2010 12:05 PM, Mike Galbraith wrote:

> > That's why you'd drop lag, set to max(se->vruntime, cfs_rq->min_vruntime).
>
> Internal scheduler terminology again, don't follow.

(distance to the fair stick, worthiness to receive cpu)

> > > - even if it weren't, the process (containing the spinner and the
> > > lock-holder) would yield as a whole.
> >
> > I don't get this part. How does the whole process yield if one thread
> > yields?
>
> The process is the sum of its threads. If a thread yield loses 1 msec
> of runtime due to the yield, the process loses 1 msec due to the yield.
> If the lock is held for, say, 100 usec, it would be better for the
> process to spin rather than yield.
>
> With directed yield the process loses nothing by yielding to one of its
> threads.
>
> > > If it yielded for exactly the time
> > > needed (until the lock holder releases the lock), it wouldn't matter,
> > > since the spinner isn't accomplishing anything, but we don't know what
> > > the exact time is. So we want to preserve our entitlement.
> >
> > And that's the hard part. If can drop lag, you may hurt yourself, but
> > at least only yourself.
>
> We already have a "hurt only yourself" thing. We sleep for 100 usec
> when we detect spinning. It's awful.

Wondered about that, awful makes sense.

> > You want a specific task to run NOW for good reasons, but any number of
> > tasks may want the same godlike power for equally good reasons.
>
> I don't want it to run now. I want it to run before some other task. I
> don't care if N other tasks run before both. So no godlike powers
> needed, simply a courteous "after you".

If behaviors are very similar, and tasks are not likely to try to
exploit it (as described), you can likely swap lags without horrible
consequences.

I'm just pointing out the dangers.

> > > What's the problem exactly? What's the difference, system-wide, with
> > > the donor continuing to run for that same entitlement? Other tasks see
> > > the same thing.
> >
> > SOME tasks receive gifts from the void. The difference is the bias.
>
> Isn't fork() a gift from the void?

>From process perspective, yup. It can't stop time though.

> > > > > > Where did the entitlement come from if task A running alone on cpu A
> > > > > > tosses some entitlement over the fence to his pal task B on cpu B.. and
> > > > > > keeps on trucking on cpu A? Where does that leave task C, B's
> > > > > > competition?
> > > > >
> > > > > Eventually C would replace A, since its share will be exhausted. If C
> > > > > is pinned... good question. How does fairness work with pinned tasks?
> > > >
> > > > In the case I described, C had it's pocket picked by A.
> > >
> > > Would that happen if global fairness was maintained?
> >
> > What's that? :)
>
> If you run three tasks on a two cpu box, each gets 2/3 of a cpu.
>
>
> > No task may run until there are enough of you to fill
> > the box?
>
> Why is that a consequence of global fairness? three tasks get 100% cpu
> on a 4-cpu box, the fourth cpu idles. Is that not fair for some reason?

Depends on fair reference frame, but..

> > God help you when somebody else wakes up Mr. Early-bird? ...
>
> What?

..I was just trying to say that "global fairness" is not well defined.

Never mind.

-Mike

2010-12-20 08:39:36

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Sun, 2010-12-19 at 11:19 +0200, Avi Kivity wrote:
> On 12/19/2010 12:05 PM, Mike Galbraith wrote:

> > > We definitely want to maintain fairness. Both with a dedicated virt
> > > host and with a mixed workload.
> >
> > That makes it difficult to the point of impossible.
> >
> > You want a specific task to run NOW for good reasons, but any number of
> > tasks may want the same godlike power for equally good reasons.
>
> I don't want it to run now. I want it to run before some other task. I
> don't care if N other tasks run before both. So no godlike powers
> needed, simply a courteous "after you".

Ponders that...

What if: we test that both tasks are in the same thread group, if so,
use cfs_rq->next to pass the scheduler a HINT of what you would LIKE to
happen. If the current task on that rq is also in your group, resched
it, then IFF the task you would like to run isn't too far right, it'll
be selected. If the current task is not one of yours, tough, you can
set cfs_rq->next and hope it doesn't get overwritten, but you may not
preempt a stranger. If you happen to be sharing an rq, cool, you
accomplished your yield_to(). If not, there's no practical way (I can
think of) to ensure that the target runs before you run again if you try
to yield, but you did your best to try to get him to the cpu sooner, and
in a manner that preserves fairness without dangerous vruntime diddling.

Would that be good enough to stop (or seriously improve) cpu wastage?

-Mike

2010-12-20 08:45:24

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/20/2010 10:39 AM, Mike Galbraith wrote:
> On Sun, 2010-12-19 at 11:19 +0200, Avi Kivity wrote:
> > On 12/19/2010 12:05 PM, Mike Galbraith wrote:
>
> > > > We definitely want to maintain fairness. Both with a dedicated virt
> > > > host and with a mixed workload.
> > >
> > > That makes it difficult to the point of impossible.
> > >
> > > You want a specific task to run NOW for good reasons, but any number of
> > > tasks may want the same godlike power for equally good reasons.
> >
> > I don't want it to run now. I want it to run before some other task. I
> > don't care if N other tasks run before both. So no godlike powers
> > needed, simply a courteous "after you".
>
> Ponders that...
>
> What if: we test that both tasks are in the same thread group, if so,

In my use case, both tasks are in the same thread group, so that works.
I don't see why the requirement is needed though.

> use cfs_rq->next to pass the scheduler a HINT of what you would LIKE to
> happen.

Hint is fine, so long as the scheduler seriously considers it.

> If the current task on that rq is also in your group, resched
> it, then IFF the task you would like to run isn't too far right, it'll
> be selected. If the current task is not one of yours, tough, you can
> set cfs_rq->next and hope it doesn't get overwritten, but you may not
> preempt a stranger. If you happen to be sharing an rq, cool, you
> accomplished your yield_to(). If not, there's no practical way (I can
> think of) to ensure that the target runs before you run again if you try
> to yield, but you did your best to try to get him to the cpu sooner, and
> in a manner that preserves fairness without dangerous vruntime diddling.
>
> Would that be good enough to stop (or seriously improve) cpu wastage?

The cross-cpu limitation is bothersome. Since there are many cpus in
modern machines, particularly ones used for virt, the probability of the
two tasks being on the same cpu is quite low.

--
error compiling committee.c: too many arguments to function

2010-12-20 08:55:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Mon, 2010-12-20 at 10:45 +0200, Avi Kivity wrote:
> On 12/20/2010 10:39 AM, Mike Galbraith wrote:
> > On Sun, 2010-12-19 at 11:19 +0200, Avi Kivity wrote:
> > > On 12/19/2010 12:05 PM, Mike Galbraith wrote:
> >
> > > > > We definitely want to maintain fairness. Both with a dedicated virt
> > > > > host and with a mixed workload.
> > > >
> > > > That makes it difficult to the point of impossible.
> > > >
> > > > You want a specific task to run NOW for good reasons, but any number of
> > > > tasks may want the same godlike power for equally good reasons.
> > >
> > > I don't want it to run now. I want it to run before some other task. I
> > > don't care if N other tasks run before both. So no godlike powers
> > > needed, simply a courteous "after you".
> >
> > Ponders that...
> >
> > What if: we test that both tasks are in the same thread group, if so,
>
> In my use case, both tasks are in the same thread group, so that works.
> I don't see why the requirement is needed though.

Because preempting a perfect stranger is not courteous, all tasks have
to play nice.

> > use cfs_rq->next to pass the scheduler a HINT of what you would LIKE to
> > happen.
>
> Hint is fine, so long as the scheduler seriously considers it.

It will take the hint if the target the target hasn't had too much cpu.

> > If the current task on that rq is also in your group, resched
> > it, then IFF the task you would like to run isn't too far right, it'll
> > be selected. If the current task is not one of yours, tough, you can
> > set cfs_rq->next and hope it doesn't get overwritten, but you may not
> > preempt a stranger. If you happen to be sharing an rq, cool, you
> > accomplished your yield_to(). If not, there's no practical way (I can
> > think of) to ensure that the target runs before you run again if you try
> > to yield, but you did your best to try to get him to the cpu sooner, and
> > in a manner that preserves fairness without dangerous vruntime diddling.
> >
> > Would that be good enough to stop (or seriously improve) cpu wastage?
>
> The cross-cpu limitation is bothersome. Since there are many cpus in
> modern machines, particularly ones used for virt, the probability of the
> two tasks being on the same cpu is quite low.

What would you suggest? There is no global execution timeline, so if
you want to definitely run after this task, you're stuck with moving to
his timezone or moving him to yours. Well, you could sleep a while, but
we know how productive sleeping is.

-Mike


2010-12-20 09:03:39

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/20/2010 10:55 AM, Mike Galbraith wrote:
> > > >
> > > > I don't want it to run now. I want it to run before some other task. I
> > > > don't care if N other tasks run before both. So no godlike powers
> > > > needed, simply a courteous "after you".
> > >
> > > Ponders that...
> > >
> > > What if: we test that both tasks are in the same thread group, if so,
> >
> > In my use case, both tasks are in the same thread group, so that works.
> > I don't see why the requirement is needed though.
>
> Because preempting a perfect stranger is not courteous, all tasks have
> to play nice.

I don't want to preempt anybody, simply make the task run before me.

Further, this is a kernel internal API, so no need for these types of
restrictions. If we expose it to userspace, sure.

> > > use cfs_rq->next to pass the scheduler a HINT of what you would LIKE to
> > > happen.
> >
> > Hint is fine, so long as the scheduler seriously considers it.
>
> It will take the hint if the target the target hasn't had too much cpu.

Since I'm running and the target isn't, it's clear the scheduler thinks
the target had more cpu than I did [73]. That's why I want to donate
cpu time.

[73] at least it'd be clear if the scheduler were globally fair. As it
is, I might be the only task running on my cpu, therefore in a cpu glut,
while the other task shares the cpu with some other task and is
currently waiting for its turn.

> > > If the current task on that rq is also in your group, resched
> > > it, then IFF the task you would like to run isn't too far right, it'll
> > > be selected. If the current task is not one of yours, tough, you can
> > > set cfs_rq->next and hope it doesn't get overwritten, but you may not
> > > preempt a stranger. If you happen to be sharing an rq, cool, you
> > > accomplished your yield_to(). If not, there's no practical way (I can
> > > think of) to ensure that the target runs before you run again if you try
> > > to yield, but you did your best to try to get him to the cpu sooner, and
> > > in a manner that preserves fairness without dangerous vruntime diddling.
> > >
> > > Would that be good enough to stop (or seriously improve) cpu wastage?
> >
> > The cross-cpu limitation is bothersome. Since there are many cpus in
> > modern machines, particularly ones used for virt, the probability of the
> > two tasks being on the same cpu is quite low.
>
> What would you suggest? There is no global execution timeline, so if
> you want to definitely run after this task, you're stuck with moving to
> his timezone or moving him to yours. Well, you could sleep a while, but
> we know how productive sleeping is.

I don't know. The whole idea of donating runtime was predicated on CFS
being completely fair. Now I find that (a) it isn't (b) donating
runtimes between tasks on different cpus is problematic.

Moving tasks between cpus is expensive and sometimes prohibited by
pinning. I'd like to avoid it if possible, but it's better than nothing.

--
error compiling committee.c: too many arguments to function

2010-12-20 09:33:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Mon, 2010-12-20 at 11:03 +0200, Avi Kivity wrote:
> On 12/20/2010 10:55 AM, Mike Galbraith wrote:
> > > > >
> > > > > I don't want it to run now. I want it to run before some other task. I
> > > > > don't care if N other tasks run before both. So no godlike powers
> > > > > needed, simply a courteous "after you".
> > > >
> > > > Ponders that...
> > > >
> > > > What if: we test that both tasks are in the same thread group, if so,
> > >
> > > In my use case, both tasks are in the same thread group, so that works.
> > > I don't see why the requirement is needed though.
> >
> > Because preempting a perfect stranger is not courteous, all tasks have
> > to play nice.
>
> I don't want to preempt anybody, simply make the task run before me.

I thought you wanted to get the target to the cpu asap? You just can't
have he runs before me cross cpu.

> Further, this is a kernel internal API, so no need for these types of
> restrictions. If we expose it to userspace, sure.

Doesn't matter whether it's kernel or not afaikt. If virtualization has
to coexist peacefully with other loads, it can't just say "my hints are
the only ones that count", and thus shred other loads throughput.

> > > > use cfs_rq->next to pass the scheduler a HINT of what you would LIKE to
> > > > happen.
> > >
> > > Hint is fine, so long as the scheduler seriously considers it.
> >
> > It will take the hint if the target the target hasn't had too much cpu.
>
> Since I'm running and the target isn't, it's clear the scheduler thinks
> the target had more cpu than I did [73]. That's why I want to donate
> cpu time.

That's not necessarily true, in fact, it's very often false. Last/next
buddy will allow a task to run ahead of leftmost so we don't always
blindly select leftmost and shred cache.

> [73] at least it'd be clear if the scheduler were globally fair. As it
> is, I might be the only task running on my cpu, therefore in a cpu glut,
> while the other task shares the cpu with some other task and is
> currently waiting for its turn.
>
> > > > If the current task on that rq is also in your group, resched
> > > > it, then IFF the task you would like to run isn't too far right, it'll
> > > > be selected. If the current task is not one of yours, tough, you can
> > > > set cfs_rq->next and hope it doesn't get overwritten, but you may not
> > > > preempt a stranger. If you happen to be sharing an rq, cool, you
> > > > accomplished your yield_to(). If not, there's no practical way (I can
> > > > think of) to ensure that the target runs before you run again if you try
> > > > to yield, but you did your best to try to get him to the cpu sooner, and
> > > > in a manner that preserves fairness without dangerous vruntime diddling.
> > > >
> > > > Would that be good enough to stop (or seriously improve) cpu wastage?
> > >
> > > The cross-cpu limitation is bothersome. Since there are many cpus in
> > > modern machines, particularly ones used for virt, the probability of the
> > > two tasks being on the same cpu is quite low.
> >
> > What would you suggest? There is no global execution timeline, so if
> > you want to definitely run after this task, you're stuck with moving to
> > his timezone or moving him to yours. Well, you could sleep a while, but
> > we know how productive sleeping is.
>
> I don't know. The whole idea of donating runtime was predicated on CFS
> being completely fair. Now I find that (a) it isn't (b) donating
> runtimes between tasks on different cpus is problematic.

True and true. However, would you _want_ the scheduler to hold runnable
tasks hostage, and thus let CPU go to waste in the name of perfect
fairness? Perfect is the enemy of good applies to that idea imho.

> Moving tasks between cpus is expensive and sometimes prohibited by
> pinning. I'd like to avoid it if possible, but it's better than nothing.

Expensive in many ways, so let's try to not do that.

So why do you need this other task to run before you do, even cross cpu?
If he's a lock holder, getting him to the cpu will give him a chance to
drop, no? Isn't that what you want to get done? Drop that lock so you
or someone else can get something other than spinning done?

-Mike

2010-12-20 09:46:43

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/20/2010 11:30 AM, Mike Galbraith wrote:
> > >
> > > Because preempting a perfect stranger is not courteous, all tasks have
> > > to play nice.
> >
> > I don't want to preempt anybody, simply make the task run before me.
>
> I thought you wanted to get the target to the cpu asap? You just can't
> have he runs before me cross cpu.

You're right, of course. I'm fine with running in parallel. I'm fine
with him running before or instead of me. I'm not fine with running
while the other guy is waiting.

> > Further, this is a kernel internal API, so no need for these types of
> > restrictions. If we expose it to userspace, sure.
>
> Doesn't matter whether it's kernel or not afaikt. If virtualization has
> to coexist peacefully with other loads, it can't just say "my hints are
> the only ones that count", and thus shred other loads throughput.

What does that have to do with being in the same group or not? I want
to maintain fairness (needed for pure virt workloads, one guest cannot
dominate another), but I don't see how being in the same thread group is
relevant.

Again, I don't want more than one entitlement. I want to move part of
my entitlement to another task.

> > > > > use cfs_rq->next to pass the scheduler a HINT of what you would LIKE to
> > > > > happen.
> > > >
> > > > Hint is fine, so long as the scheduler seriously considers it.
> > >
> > > It will take the hint if the target the target hasn't had too much cpu.
> >
> > Since I'm running and the target isn't, it's clear the scheduler thinks
> > the target had more cpu than I did [73]. That's why I want to donate
> > cpu time.
>
> That's not necessarily true, in fact, it's very often false. Last/next
> buddy will allow a task to run ahead of leftmost so we don't always
> blindly select leftmost and shred cache.

Ok.

> > >
> > > What would you suggest? There is no global execution timeline, so if
> > > you want to definitely run after this task, you're stuck with moving to
> > > his timezone or moving him to yours. Well, you could sleep a while, but
> > > we know how productive sleeping is.
> >
> > I don't know. The whole idea of donating runtime was predicated on CFS
> > being completely fair. Now I find that (a) it isn't (b) donating
> > runtimes between tasks on different cpus is problematic.
>
> True and true. However, would you _want_ the scheduler to hold runnable
> tasks hostage, and thus let CPU go to waste in the name of perfect
> fairness? Perfect is the enemy of good applies to that idea imho.

Sorry, I don't see how it follows.

> > Moving tasks between cpus is expensive and sometimes prohibited by
> > pinning. I'd like to avoid it if possible, but it's better than nothing.
>
> Expensive in many ways, so let's try to not do that.
>
> So why do you need this other task to run before you do, even cross cpu?
> If he's a lock holder, getting him to the cpu will give him a chance to
> drop, no? Isn't that what you want to get done? Drop that lock so you
> or someone else can get something other than spinning done?

Correct. I don't want the other task to run before me, I just don't
want to run before it.

--
error compiling committee.c: too many arguments to function

2010-12-20 10:33:21

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Mon, 2010-12-20 at 11:46 +0200, Avi Kivity wrote:
> On 12/20/2010 11:30 AM, Mike Galbraith wrote:
> > > >
> > > > Because preempting a perfect stranger is not courteous, all tasks have
> > > > to play nice.
> > >
> > > I don't want to preempt anybody, simply make the task run before me.
> >
> > I thought you wanted to get the target to the cpu asap? You just can't
> > have he runs before me cross cpu.
>
> You're right, of course. I'm fine with running in parallel. I'm fine
> with him running before or instead of me. I'm not fine with running
> while the other guy is waiting.

Goody, maybe we're headed down a productive path then.

> > > Further, this is a kernel internal API, so no need for these types of
> > > restrictions. If we expose it to userspace, sure.
> >
> > Doesn't matter whether it's kernel or not afaikt. If virtualization has
> > to coexist peacefully with other loads, it can't just say "my hints are
> > the only ones that count", and thus shred other loads throughput.
>
> What does that have to do with being in the same group or not? I want
> to maintain fairness (needed for pure virt workloads, one guest cannot
> dominate another), but I don't see how being in the same thread group is
> relevant.

My thought is that you can shred your own throughput, but not some other
concurrent load. I'll have to let that thought stew a bit though.

> Again, I don't want more than one entitlement. I want to move part of
> my entitlement to another task.

Folks can keep trying that, but IMO it's too broken to live.

> > > > > > use cfs_rq->next to pass the scheduler a HINT of what you would LIKE to
> > > > > > happen.
> > > > >
> > > > > Hint is fine, so long as the scheduler seriously considers it.
> > > >
> > > > It will take the hint if the target the target hasn't had too much cpu.
> > >
> > > Since I'm running and the target isn't, it's clear the scheduler thinks
> > > the target had more cpu than I did [73]. That's why I want to donate
> > > cpu time.
> >
> > That's not necessarily true, in fact, it's very often false. Last/next
> > buddy will allow a task to run ahead of leftmost so we don't always
> > blindly select leftmost and shred cache.
>
> Ok.
>
> > > >
> > > > What would you suggest? There is no global execution timeline, so if
> > > > you want to definitely run after this task, you're stuck with moving to
> > > > his timezone or moving him to yours. Well, you could sleep a while, but
> > > > we know how productive sleeping is.
> > >
> > > I don't know. The whole idea of donating runtime was predicated on CFS
> > > being completely fair. Now I find that (a) it isn't (b) donating
> > > runtimes between tasks on different cpus is problematic.
> >
> > True and true. However, would you _want_ the scheduler to hold runnable
> > tasks hostage, and thus let CPU go to waste in the name of perfect
> > fairness? Perfect is the enemy of good applies to that idea imho.
>
> Sorry, I don't see how it follows.

Let's just forget theoretical views, and concentrate on a forward path.

> > > Moving tasks between cpus is expensive and sometimes prohibited by
> > > pinning. I'd like to avoid it if possible, but it's better than nothing.
> >
> > Expensive in many ways, so let's try to not do that.
> >
> > So why do you need this other task to run before you do, even cross cpu?
> > If he's a lock holder, getting him to the cpu will give him a chance to
> > drop, no? Isn't that what you want to get done? Drop that lock so you
> > or someone else can get something other than spinning done?
>
> Correct. I don't want the other task to run before me, I just don't
> want to run before it.

OK, so what I gather is that if you can preempt another of your own
threads to get the target to cpu, that would be a good thing whether
he's on the same cpu as yield_to() caller or not. If the target is
sharing a cpu with you, that's even better. Correct?

Would a kick/hint option be useful?

-Mike

2010-12-20 10:39:54

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/20/2010 12:33 PM, Mike Galbraith wrote:
> >
> > Correct. I don't want the other task to run before me, I just don't
> > want to run before it.
>
> OK, so what I gather is that if you can preempt another of your own
> threads to get the target to cpu, that would be a good thing whether
> he's on the same cpu as yield_to() caller or not. If the target is
> sharing a cpu with you, that's even better. Correct?

Correct. I'm not interested in scheduling wrt other tasks, just my tasks.

However, if I'm all alone on my cpu, and the other task is runnable but
not running, behind some unrelated task, then I do need that task to be
preempted (or to move tasks around).

> Would a kick/hint option be useful?

Depends on what it does...

--
error compiling committee.c: too many arguments to function

2010-12-20 10:46:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Mon, 2010-12-20 at 12:39 +0200, Avi Kivity wrote:
> On 12/20/2010 12:33 PM, Mike Galbraith wrote:
> > >
> > > Correct. I don't want the other task to run before me, I just don't
> > > want to run before it.
> >
> > OK, so what I gather is that if you can preempt another of your own
> > threads to get the target to cpu, that would be a good thing whether
> > he's on the same cpu as yield_to() caller or not. If the target is
> > sharing a cpu with you, that's even better. Correct?
>
> Correct. I'm not interested in scheduling wrt other tasks, just my tasks.
>
> However, if I'm all alone on my cpu, and the other task is runnable but
> not running, behind some unrelated task, then I do need that task to be
> preempted (or to move tasks around).

So in that case, a pull may be advantageous.

> > Would a kick/hint option be useful?
>
> Depends on what it does...

Let you decide whether you only want to drop a hint and leave it at
that, or also attempt a preemption.

-Mike

2010-12-20 10:50:03

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/20/2010 12:46 PM, Mike Galbraith wrote:
> >
> > However, if I'm all alone on my cpu, and the other task is runnable but
> > not running, behind some unrelated task, then I do need that task to be
> > preempted (or to move tasks around).
>
> So in that case, a pull may be advantageous.

Yes.

> > > Would a kick/hint option be useful?
> >
> > Depends on what it does...
>
> Let you decide whether you only want to drop a hint and leave it at
> that, or also attempt a preemption.

Who is "you" in this? the scheduler?

I'm fine with hints so long as they are usually acted upon (if it isn't,
I'll go back to the guest, spin a bit more, and retry).

--
error compiling committee.c: too many arguments to function

2010-12-20 10:50:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Mon, 2010-12-20 at 12:49 +0200, Avi Kivity wrote:
> On 12/20/2010 12:46 PM, Mike Galbraith wrote:
> > >
> > > However, if I'm all alone on my cpu, and the other task is runnable but
> > > not running, behind some unrelated task, then I do need that task to be
> > > preempted (or to move tasks around).
> >
> > So in that case, a pull may be advantageous.
>
> Yes.
>
> > > > Would a kick/hint option be useful?
> > >
> > > Depends on what it does...
> >
> > Let you decide whether you only want to drop a hint and leave it at
> > that, or also attempt a preemption.
>
> Who is "you" in this? the scheduler?

The caller.

> I'm fine with hints so long as they are usually acted upon (if it isn't,
> I'll go back to the guest, spin a bit more, and retry).


2010-12-20 11:07:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/20/2010 12:50 PM, Mike Galbraith wrote:
> >
> > > > > Would a kick/hint option be useful?
> > > >
> > > > Depends on what it does...
> > >
> > > Let you decide whether you only want to drop a hint and leave it at
> > > that, or also attempt a preemption.
> >
> > Who is "you" in this? the scheduler?
>
> The caller.
>

I guess it works. We can start out without preemption, if we see we
spin to much, preempt.

--
error compiling committee.c: too many arguments to function

2010-12-20 15:40:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/17/2010 02:15 AM, Mike Galbraith wrote:

> BTW, with this vruntime donation thingy, what prevents a task from
> forking off accomplices who do nothing but wait for a wakeup and
> yield_to(exploit)?
>
> Even swapping vruntimes in the same cfs_rq is dangerous as hell, because
> one party is going backward.

I just realized the answer to this question.

We only give cpu time to tasks that are runnable, but not
currently running. That ensures one task cannot block others
from running by having time yielded to it constantly.

2010-12-20 16:04:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Mon, 2010-12-20 at 10:40 -0500, Rik van Riel wrote:
> On 12/17/2010 02:15 AM, Mike Galbraith wrote:
>
> > BTW, with this vruntime donation thingy, what prevents a task from
> > forking off accomplices who do nothing but wait for a wakeup and
> > yield_to(exploit)?
> >
> > Even swapping vruntimes in the same cfs_rq is dangerous as hell, because
> > one party is going backward.
>
> I just realized the answer to this question.
>
> We only give cpu time to tasks that are runnable, but not
> currently running. That ensures one task cannot block others
> from running by having time yielded to it constantly.

Hm. Don't think that will 100% sure prevent clock stoppage, because the
running task doesn't necessarily advance min_vruntime.

What about something like the below instead? It compiles, but may eat
your first born child.

sched: implement fair class yield_to(task) using cfs_rq->next as a selection hint.

<CHANGELOG>

Not-signed-off-by: Mike Galbraith <[email protected]>
---
include/linux/sched.h | 1
kernel/sched.c | 47 +++++++++++++++++++++++++++++++++++++++++
kernel/sched_fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 104 insertions(+)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1056,6 +1056,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
+ int (*yield_to_task) (struct task_struct *p, int preempt);

void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5325,6 +5325,53 @@ void __sched yield(void)
}
EXPORT_SYMBOL(yield);

+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ */
+void __sched yield_to(struct task_struct *p, int preempt)
+{
+ struct task_struct *curr = current;
+ struct rq *rq, *p_rq;
+ unsigned long flags;
+ int yield = 0;
+
+ local_irq_save(flags);
+ rq = this_rq();
+
+again:
+ p_rq = task_rq(p);
+ double_rq_lock(rq, p_rq);
+ while (task_rq(p) != p_rq) {
+ double_rq_unlock(rq, p_rq);
+ goto again;
+ }
+
+ if (task_running(p_rq, p) || p->state || !p->se.on_rq ||
+ !same_thread_group(p, curr) ||
+ !curr->sched_class->yield_to_task ||
+ curr->sched_class != p->sched_class) {
+ goto out;
+ }
+
+ yield = curr->sched_class->yield_to_task(p, preempt);
+
+out:
+ double_rq_unlock(rq, p_rq);
+ local_irq_restore(flags);
+
+ if (yield) {
+ set_current_state(TASK_RUNNING);
+ schedule();
+ }
+}
+EXPORT_SYMBOL(yield_to);
+
+
/*
* This task is about to go to sleep on IO. Increment rq->nr_iowait so
* that process accounting knows that this is a task in IO wait state.
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1320,6 +1320,61 @@ static void yield_task_fair(struct rq *r
}

#ifdef CONFIG_SMP
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+ struct rq *this_rq, int this_cpu);
+#endif
+
+static int yield_to_task_fair(struct task_struct *p, int preempt)
+{
+ struct sched_entity *se = &current->se;
+ struct sched_entity *pse = &p->se;
+ struct sched_entity *curr = &(task_rq(p)->curr)->se;
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
+ int yield = this_rq() == task_rq(p);
+ int want_preempt = preempt;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ if (cfs_rq->tg != p_cfs_rq->tg)
+ return 0;
+
+ /* Preemption only allowed within the same task group. */
+ if (preempt && cfs_rq->tg != cfs_rq_of(curr)->tg)
+ preempt = 0;
+#endif
+ /* Preemption only allowed within the same thread group. */
+ if (preempt && !same_thread_group(current, task_of(p_cfs_rq->curr)))
+ preempt = 0;
+
+#ifdef CONFIG_SMP
+ /*
+ * If this yield is important enough to want to preempt instead
+ * of only dropping a ->next hint, we're alone, and the target
+ * is not alone, pull the target to this cpu.
+ */
+ if (want_preempt && !yield && cfs_rq->nr_running == 1 &&
+ cpumask_test_cpu(smp_processor_id(), &p->cpus_allowed)) {
+ pull_task(task_rq(p), p, this_rq(), smp_processor_id());
+ p_cfs_rq = cfs_rq_of(pse);
+ yield = 1;
+ }
+#endif
+
+ if (yield)
+ clear_buddies(cfs_rq, se);
+ else if (preempt)
+ clear_buddies(p_cfs_rq, curr);
+
+ /* Tell the scheduler that we'd really like pse to run next. */
+ p_cfs_rq->next = pse;
+
+ if (!yield && preempt)
+ resched_task(task_of(p_cfs_rq->curr));
+
+ return yield;
+}
+
+#ifdef CONFIG_SMP

static void task_waking_fair(struct rq *rq, struct task_struct *p)
{
@@ -4126,6 +4181,7 @@ static const struct sched_class fair_sch
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair,
+ .yield_to_task = yield_to_task_fair,

.check_preempt_curr = check_preempt_wakeup,


2010-12-28 05:55:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Mon, 2010-12-20 at 17:04 +0100, Mike Galbraith wrote:
> On Mon, 2010-12-20 at 10:40 -0500, Rik van Riel wrote:
> > On 12/17/2010 02:15 AM, Mike Galbraith wrote:
> >
> > > BTW, with this vruntime donation thingy, what prevents a task from
> > > forking off accomplices who do nothing but wait for a wakeup and
> > > yield_to(exploit)?
> > >
> > > Even swapping vruntimes in the same cfs_rq is dangerous as hell, because
> > > one party is going backward.
> >
> > I just realized the answer to this question.
> >
> > We only give cpu time to tasks that are runnable, but not
> > currently running. That ensures one task cannot block others
> > from running by having time yielded to it constantly.
>
> Hm. Don't think that will 100% sure prevent clock stoppage, because the
> running task doesn't necessarily advance min_vruntime.
>
> What about something like the below instead? It compiles, but may eat
> your first born child.

(either it's christmas, or nobody cares to try it. ho ho hum:)

> sched: implement fair class yield_to(task) using cfs_rq->next as a selection hint.
>
> <CHANGELOG>
>
> Not-signed-off-by: Mike Galbraith <[email protected]>
> ---
> include/linux/sched.h | 1
> kernel/sched.c | 47 +++++++++++++++++++++++++++++++++++++++++
> kernel/sched_fair.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 104 insertions(+)
>
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -1056,6 +1056,7 @@ struct sched_class {
> void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*yield_task) (struct rq *rq);
> + int (*yield_to_task) (struct task_struct *p, int preempt);
>
> void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -5325,6 +5325,53 @@ void __sched yield(void)
> }
> EXPORT_SYMBOL(yield);
>
> +/**
> + * yield_to - yield the current processor to another thread in
> + * your thread group, or accelerate that thread toward the
> + * processor it's on.
> + *
> + * It's the caller's job to ensure that the target task struct
> + * can't go away on us before we can do any checks.
> + */
> +void __sched yield_to(struct task_struct *p, int preempt)
> +{
> + struct task_struct *curr = current;
> + struct rq *rq, *p_rq;
> + unsigned long flags;
> + int yield = 0;
> +
> + local_irq_save(flags);
> + rq = this_rq();
> +
> +again:
> + p_rq = task_rq(p);
> + double_rq_lock(rq, p_rq);
> + while (task_rq(p) != p_rq) {
> + double_rq_unlock(rq, p_rq);
> + goto again;
> + }
> +
> + if (task_running(p_rq, p) || p->state || !p->se.on_rq ||
> + !same_thread_group(p, curr) ||
> + !curr->sched_class->yield_to_task ||
> + curr->sched_class != p->sched_class) {
> + goto out;
> + }
> +
> + yield = curr->sched_class->yield_to_task(p, preempt);
> +
> +out:
> + double_rq_unlock(rq, p_rq);
> + local_irq_restore(flags);
> +
> + if (yield) {
> + set_current_state(TASK_RUNNING);
> + schedule();
> + }
> +}
> +EXPORT_SYMBOL(yield_to);
> +
> +
> /*
> * This task is about to go to sleep on IO. Increment rq->nr_iowait so
> * that process accounting knows that this is a task in IO wait state.
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -1320,6 +1320,61 @@ static void yield_task_fair(struct rq *r
> }
>
> #ifdef CONFIG_SMP
> +static void pull_task(struct rq *src_rq, struct task_struct *p,
> + struct rq *this_rq, int this_cpu);
> +#endif
> +
> +static int yield_to_task_fair(struct task_struct *p, int preempt)
> +{
> + struct sched_entity *se = &current->se;
> + struct sched_entity *pse = &p->se;
> + struct sched_entity *curr = &(task_rq(p)->curr)->se;
> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
> + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
> + int yield = this_rq() == task_rq(p);
> + int want_preempt = preempt;
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> + if (cfs_rq->tg != p_cfs_rq->tg)
> + return 0;
> +
> + /* Preemption only allowed within the same task group. */
> + if (preempt && cfs_rq->tg != cfs_rq_of(curr)->tg)
> + preempt = 0;
> +#endif
> + /* Preemption only allowed within the same thread group. */
> + if (preempt && !same_thread_group(current, task_of(p_cfs_rq->curr)))
> + preempt = 0;
> +
> +#ifdef CONFIG_SMP
> + /*
> + * If this yield is important enough to want to preempt instead
> + * of only dropping a ->next hint, we're alone, and the target
> + * is not alone, pull the target to this cpu.
> + */
> + if (want_preempt && !yield && cfs_rq->nr_running == 1 &&
> + cpumask_test_cpu(smp_processor_id(), &p->cpus_allowed)) {
> + pull_task(task_rq(p), p, this_rq(), smp_processor_id());
> + p_cfs_rq = cfs_rq_of(pse);
> + yield = 1;
> + }
> +#endif
> +
> + if (yield)
> + clear_buddies(cfs_rq, se);
> + else if (preempt)
> + clear_buddies(p_cfs_rq, curr);
> +
> + /* Tell the scheduler that we'd really like pse to run next. */
> + p_cfs_rq->next = pse;
> +
> + if (!yield && preempt)
> + resched_task(task_of(p_cfs_rq->curr));
> +
> + return yield;
> +}
> +
> +#ifdef CONFIG_SMP
>
> static void task_waking_fair(struct rq *rq, struct task_struct *p)
> {
> @@ -4126,6 +4181,7 @@ static const struct sched_class fair_sch
> .enqueue_task = enqueue_task_fair,
> .dequeue_task = dequeue_task_fair,
> .yield_task = yield_task_fair,
> + .yield_to_task = yield_to_task_fair,
>
> .check_preempt_curr = check_preempt_wakeup,
>
>
>

2010-12-28 06:08:22

by Gene Heskett

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Tuesday, December 28, 2010, Mike Galbraith wrote:
>On Mon, 2010-12-20 at 17:04 +0100, Mike Galbraith wrote:
>> On Mon, 2010-12-20 at 10:40 -0500, Rik van Riel wrote:
>> > On 12/17/2010 02:15 AM, Mike Galbraith wrote:
>> > > BTW, with this vruntime donation thingy, what prevents a task from
>> > > forking off accomplices who do nothing but wait for a wakeup and
>> > > yield_to(exploit)?
>> > >
>> > > Even swapping vruntimes in the same cfs_rq is dangerous as hell,
>> > > because one party is going backward.
>> >
>> > I just realized the answer to this question.
>> >
>> > We only give cpu time to tasks that are runnable, but not
>> > currently running. That ensures one task cannot block others
>> > from running by having time yielded to it constantly.
>>
>> Hm. Don't think that will 100% sure prevent clock stoppage, because
>> the running task doesn't necessarily advance min_vruntime.
>>
>> What about something like the below instead? It compiles, but may eat
>> your first born child.
>
>(either it's christmas, or nobody cares to try it. ho ho hum:)

Mike, I'd be glad to try it, but the last kernel I was able to make the
nvidia drivers install on was 2.6.36.1-latencyfix, which had your earier
patches in it, and it is working _very_ well.

>> sched: implement fair class yield_to(task) using cfs_rq->next as a
>> selection hint.
>>
>> <CHANGELOG>
>>
>> Not-signed-off-by: Mike Galbraith <[email protected]>
>> ---
>>
>> include/linux/sched.h | 1
>> kernel/sched.c | 47 +++++++++++++++++++++++++++++++++++++++++
>> kernel/sched_fair.c | 56
>> ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed,
>> 104 insertions(+)
>>
>> Index: linux-2.6/include/linux/sched.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/sched.h
>> +++ linux-2.6/include/linux/sched.h
>> @@ -1056,6 +1056,7 @@ struct sched_class {
>>
>> void (*enqueue_task) (struct rq *rq, struct task_struct *p, int
>> flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p,
>> int flags); void (*yield_task) (struct rq *rq);
>>
>> + int (*yield_to_task) (struct task_struct *p, int preempt);
>>
>> void (*check_preempt_curr) (struct rq *rq, struct task_struct *p,
int
>> flags);
>>
>> Index: linux-2.6/kernel/sched.c
>> ===================================================================
>> --- linux-2.6.orig/kernel/sched.c
>> +++ linux-2.6/kernel/sched.c
>> @@ -5325,6 +5325,53 @@ void __sched yield(void)
>>
>> }
>> EXPORT_SYMBOL(yield);
>>
>> +/**
>> + * yield_to - yield the current processor to another thread in
>> + * your thread group, or accelerate that thread toward the
>> + * processor it's on.
>> + *
>> + * It's the caller's job to ensure that the target task struct
>> + * can't go away on us before we can do any checks.
>> + */
>> +void __sched yield_to(struct task_struct *p, int preempt)
>> +{
>> + struct task_struct *curr = current;
>> + struct rq *rq, *p_rq;
>> + unsigned long flags;
>> + int yield = 0;
>> +
>> + local_irq_save(flags);
>> + rq = this_rq();
>> +
>> +again:
>> + p_rq = task_rq(p);
>> + double_rq_lock(rq, p_rq);
>> + while (task_rq(p) != p_rq) {
>> + double_rq_unlock(rq, p_rq);
>> + goto again;
>> + }
>> +
>> + if (task_running(p_rq, p) || p->state || !p->se.on_rq ||
>> + !same_thread_group(p, curr) ||
>> + !curr->sched_class->yield_to_task ||
>> + curr->sched_class != p->sched_class) {
>> + goto out;
>> + }
>> +
>> + yield = curr->sched_class->yield_to_task(p, preempt);
>> +
>> +out:
>> + double_rq_unlock(rq, p_rq);
>> + local_irq_restore(flags);
>> +
>> + if (yield) {
>> + set_current_state(TASK_RUNNING);
>> + schedule();
>> + }
>> +}
>> +EXPORT_SYMBOL(yield_to);
>> +
>> +
>>
>> /*
>>
>> * This task is about to go to sleep on IO. Increment rq->nr_iowait so
>> * that process accounting knows that this is a task in IO wait state.
>>
>> Index: linux-2.6/kernel/sched_fair.c
>> ===================================================================
>> --- linux-2.6.orig/kernel/sched_fair.c
>> +++ linux-2.6/kernel/sched_fair.c
>> @@ -1320,6 +1320,61 @@ static void yield_task_fair(struct rq *r
>>
>> }
>>
>> #ifdef CONFIG_SMP
>>
>> +static void pull_task(struct rq *src_rq, struct task_struct *p,
>> + struct rq *this_rq, int this_cpu);
>> +#endif
>> +
>> +static int yield_to_task_fair(struct task_struct *p, int preempt)
>> +{
>> + struct sched_entity *se = &current->se;
>> + struct sched_entity *pse = &p->se;
>> + struct sched_entity *curr = &(task_rq(p)->curr)->se;
>> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> + struct cfs_rq *p_cfs_rq = cfs_rq_of(pse);
>> + int yield = this_rq() == task_rq(p);
>> + int want_preempt = preempt;
>> +
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> + if (cfs_rq->tg != p_cfs_rq->tg)
>> + return 0;
>> +
>> + /* Preemption only allowed within the same task group. */
>> + if (preempt && cfs_rq->tg != cfs_rq_of(curr)->tg)
>> + preempt = 0;
>> +#endif
>> + /* Preemption only allowed within the same thread group. */
>> + if (preempt && !same_thread_group(current, task_of(p_cfs_rq-
>curr)))
>> + preempt = 0;
>> +
>> +#ifdef CONFIG_SMP
>> + /*
>> + * If this yield is important enough to want to preempt instead
>> + * of only dropping a ->next hint, we're alone, and the target
>> + * is not alone, pull the target to this cpu.
>> + */
>> + if (want_preempt && !yield && cfs_rq->nr_running == 1 &&
>> + cpumask_test_cpu(smp_processor_id(), &p-
>cpus_allowed)) {
>> + pull_task(task_rq(p), p, this_rq(), smp_processor_id());
>> + p_cfs_rq = cfs_rq_of(pse);
>> + yield = 1;
>> + }
>> +#endif
>> +
>> + if (yield)
>> + clear_buddies(cfs_rq, se);
>> + else if (preempt)
>> + clear_buddies(p_cfs_rq, curr);
>> +
>> + /* Tell the scheduler that we'd really like pse to run next. */
>> + p_cfs_rq->next = pse;
>> +
>> + if (!yield && preempt)
>> + resched_task(task_of(p_cfs_rq->curr));
>> +
>> + return yield;
>> +}
>> +
>> +#ifdef CONFIG_SMP
>>
>> static void task_waking_fair(struct rq *rq, struct task_struct *p)
>> {
>>
>> @@ -4126,6 +4181,7 @@ static const struct sched_class fair_sch
>>
>> .enqueue_task = enqueue_task_fair,
>> .dequeue_task = dequeue_task_fair,
>> .yield_task = yield_task_fair,
>>
>> + .yield_to_task = yield_to_task_fair,
>>
>> .check_preempt_curr = check_preempt_wakeup,
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>in the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/


--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I have more humility in my little finger than you have in your whole
____BODY!
-- from "Cerebus" #82

2010-12-28 06:17:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Tue, 2010-12-28 at 01:08 -0500, Gene Heskett wrote:
> On Tuesday, December 28, 2010, Mike Galbraith wrote:

> >(either it's christmas, or nobody cares to try it. ho ho hum:)
>
> Mike, I'd be glad to try it, but the last kernel I was able to make the
> nvidia drivers install on was 2.6.36.1-latencyfix, which had your earier
> patches in it, and it is working _very_ well.

It wouldn't do diddly spit by itself anyway, needs someone to stick
yield_to(amigo) in interesting spots. Might not do diddly spit even
then :)

-Mike

2010-12-28 16:18:20

by Gene Heskett

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On Tuesday, December 28, 2010, Mike Galbraith wrote:
>On Tue, 2010-12-28 at 01:08 -0500, Gene Heskett wrote:
>> On Tuesday, December 28, 2010, Mike Galbraith wrote:
>> >(either it's christmas, or nobody cares to try it. ho ho hum:)
>>
>> Mike, I'd be glad to try it, but the last kernel I was able to make the
>> nvidia drivers install on was 2.6.36.1-latencyfix, which had your
>> earier patches in it, and it is working _very_ well.
>
>It wouldn't do diddly spit by itself anyway, needs someone to stick
>yield_to(amigo) in interesting spots. Might not do diddly spit even
>then :)
>
> -Mike

Oh, in that case it will eventually filter down to me anyway after the rest
of this list beats on it to see what falls out. But I occasionally like to
be part of the filtering media as long as I don't bleed too much. ;-)

Thanks Mike.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Without facts, the decision cannot be made logically. You must rely on
your human intuition.
-- Spock, "Assignment: Earth", stardate unknown

2010-12-28 22:35:04

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC -v2 PATCH 2/3] sched: add yield_to function

On 12/28/2010 12:54 AM, Mike Galbraith wrote:
> On Mon, 2010-12-20 at 17:04 +0100, Mike Galbraith wrote:
>> On Mon, 2010-12-20 at 10:40 -0500, Rik van Riel wrote:
>>> On 12/17/2010 02:15 AM, Mike Galbraith wrote:
>>>
>>>> BTW, with this vruntime donation thingy, what prevents a task from
>>>> forking off accomplices who do nothing but wait for a wakeup and
>>>> yield_to(exploit)?
>>>>
>>>> Even swapping vruntimes in the same cfs_rq is dangerous as hell, because
>>>> one party is going backward.
>>>
>>> I just realized the answer to this question.
>>>
>>> We only give cpu time to tasks that are runnable, but not
>>> currently running. That ensures one task cannot block others
>>> from running by having time yielded to it constantly.
>>
>> Hm. Don't think that will 100% sure prevent clock stoppage, because the
>> running task doesn't necessarily advance min_vruntime.
>>
>> What about something like the below instead? It compiles, but may eat
>> your first born child.
>
> (either it's christmas, or nobody cares to try it. ho ho hum:)

It's christmas.

I've replaced my original patch with your patch in my
series, just haven't gotten around to running Avi's
latest test on it yet :)

--
All rights reversed