Log since RFC:
1. Small fix (thanks to Namhyung).
2. Remove the logical branch which will bind two task on
same cpu (thanks to Mike).
wake_affine() stuff is trying to bind related tasks closely, but it doesn't
work well according to the test on 'perf bench sched pipe' (thanks to Peter).
Besides, pgbench show that blindly using wake_affine() will eat a lot of
performance, the whole stuff need to be used more wisely.
Thus, we need a new solution, it should detect the tasks related to each
other, bind them closely, take care the balance, latency and performance.
And wakeup buddy seems like a good solution (thanks to Mike for the hint).
The feature introduced waker, wakee pointer and their ref count, along with
the new knob sysctl_sched_wakeup_buddy_ref.
So in select_task_rq_fair(), when wakeup p (task A) and current (task B) is
running, if match:
1. A->waker == B && A->wakee == B
2. A->waker_ref > sysctl_sched_wakeup_buddy_ref
3. A->wakee_ref > sysctl_sched_wakeup_buddy_ref
then A is the wakeup buddy of B, which means A and B is likely to utilize
the memory of each other.
Thus, if B is also the wakeup buddy of A, which means no other task has
destroyed their relationship, make them running closely is likely to gain
benefit.
This patch add the feature wakeup buddy, reorganized the logical of
wake_affine() stuff with the new feature to make the decision more wisely,
by doing these, pgbench perform better.
Test:
Test with 12 cpu X86 server and tip 3.8.0-rc7.
pgbench result:
prev post
| db_size | clients | tps | | tps |
+---------+---------+-------+ +-------+
| 22 MB | 1 | 10794 | | 10842 |
| 22 MB | 2 | 21567 | | 21737 |
| 22 MB | 4 | 41621 | | 42844 |
| 22 MB | 8 | 53883 | | 62486 | +15.97%
| 22 MB | 12 | 50818 | | 58732 | +15.57%
| 22 MB | 16 | 50463 | | 60131 | +19.16%
| 22 MB | 24 | 46698 | | 64037 | +37.13%
| 22 MB | 32 | 43404 | | 63024 | +45.20%
| 7484 MB | 1 | 7974 | | 8398 |
| 7484 MB | 2 | 19341 | | 19686 |
| 7484 MB | 4 | 36808 | | 38138 |
| 7484 MB | 8 | 47821 | | 51944 | +8.62%
| 7484 MB | 12 | 45913 | | 52011 | +13.28%
| 7484 MB | 16 | 46478 | | 54891 | +18.10%
| 7484 MB | 24 | 42793 | | 56756 | +32.63%
| 7484 MB | 32 | 36329 | | 55300 | +52.22%
| 15 GB | 1 | 7636 | | 8221 |
| 15 GB | 2 | 19195 | | 19641 |
| 15 GB | 4 | 35975 | | 37562 |
| 15 GB | 8 | 47919 | | 51402 | +7.27%
| 15 GB | 12 | 45397 | | 51126 | +12.62%
| 15 GB | 16 | 45926 | | 53577 | +16.66%
| 15 GB | 24 | 42184 | | 55453 | +31.46%
| 15 GB | 32 | 35983 | | 54946 | +52.70%
Signed-off-by: Michael Wang <[email protected]>
---
include/linux/sched.h | 8 +++++
kernel/sched/fair.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sysctl.c | 10 ++++++
3 files changed, 97 insertions(+), 1 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d211247..c5a02b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1235,6 +1235,10 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};
+#ifdef CONFIG_SMP
+extern unsigned int sysctl_sched_wakeup_buddy_ref;
+#endif
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1245,6 +1249,10 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ struct task_struct *waker;
+ struct task_struct *wakee;
+ unsigned int waker_ref;
+ unsigned int wakee_ref;
#endif
int on_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81fa536..1b81cc3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3173,6 +3173,75 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
}
/*
+ * Reduce sysctl_sched_wakeup_buddy_ref will reduce the preparation time
+ * to active the wakeup buddy feature, and make it agile, however, this
+ * will increase the risk of misidentify.
+ *
+ * Check wakeup_buddy() for the usage.
+ */
+unsigned int sysctl_sched_wakeup_buddy_ref = 8U;
+
+/*
+ * wakeup_buddy() help to check whether p1 is the wakeup buddy of p2.
+ *
+ * Return 1 for yes, 0 for no.
+*/
+static inline int wakeup_buddy(struct task_struct *p1, struct task_struct *p2)
+{
+ if (p1->waker != p2 || p1->wakee != p2)
+ return 0;
+
+ if (p1->waker_ref < sysctl_sched_wakeup_buddy_ref)
+ return 0;
+
+ if (p1->wakee_ref < sysctl_sched_wakeup_buddy_ref)
+ return 0;
+
+ return 1;
+}
+
+/*
+ * wakeup_related() help to check whether bind p close to current will
+ * benefit the system.
+ *
+ * If p and current are wakeup buddy of each other, usually that means
+ * they utilize the memory of each other, and current cached some data
+ * interested by p.
+ *
+ * Return 1 for yes, 0 for no.
+ */
+static inline int wakeup_related(struct task_struct *p)
+{
+ if (wakeup_buddy(p, current)) {
+ /*
+ * Now check whether current still focus on his buddy.
+ */
+ if (wakeup_buddy(current, p))
+ return 1;
+ }
+
+ return 0;
+}
+
+/*
+ * wakeup_ref() help to record the ref when current wakeup p
+ */
+static inline void wakeup_ref(struct task_struct *p)
+{
+ if (p->waker != current) {
+ p->waker_ref = 0;
+ p->waker = current;
+ } else
+ p->waker_ref++;
+
+ if (current->wakee != p) {
+ current->wakee_ref = 0;
+ current->wakee = p;
+ } else
+ current->wakee_ref++;
+}
+
+/*
* find_idlest_group finds and returns the least busy CPU group within the
* domain.
*/
@@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
}
if (affine_sd) {
- if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ /*
+ * If current and p are wakeup related, and balance is
+ * guaranteed, we will try to make them running closely
+ * to gain cache benefit.
+ */
+ if (cpu != prev_cpu && wakeup_related(p) &&
+ wake_affine(affine_sd, p, sync))
prev_cpu = cpu;
new_cpu = select_idle_sibling(p, prev_cpu);
@@ -3399,6 +3474,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
unlock:
rcu_read_unlock();
+ if (sd_flag & SD_BALANCE_WAKE)
+ wakeup_ref(p);
+
return new_cpu;
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c88878d..6845d24 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -424,6 +424,16 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
+#ifdef CONFIG_SMP
+ {
+ .procname = "sched_wakeup_buddy_ref",
+ .data = &sysctl_sched_wakeup_buddy_ref,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
--
1.7.4.1
On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
> work well according to the test on 'perf bench sched pipe' (thanks to Peter).
so sched-pipe is a poor benchmark for this..
Ideally we'd write a new benchmark that has some actual data footprint
and we'd measure the cost of tasks being apart on the various cache
metrics and see what affine wakeup does for it.
Before doing something like what you're proposing, I'd have a hard look
at WF_SYNC, it is possible we should disable/fix select_idle_sibling
for sync wakeups.
The idea behind sync wakeups is that we try and detect the case where
we wakeup up one task only to go to sleep ourselves and try and avoid
the regular ping-pong this would otherwise create on account of the
waking task still being alive and so the current cpu isn't actually
idle yet but we know its going to be idle soon.
On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote:
> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>
> > wake_affine() stuff is trying to bind related tasks closely, but it doesn't
> > work well according to the test on 'perf bench sched pipe' (thanks to Peter).
>
> so sched-pipe is a poor benchmark for this..
>
> Ideally we'd write a new benchmark that has some actual data footprint
> and we'd measure the cost of tasks being apart on the various cache
> metrics and see what affine wakeup does for it.
>
> Before doing something like what you're proposing, I'd have a hard look
> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
> for sync wakeups.
If nobody beats me to it, I'm going to try tracking shortest round trip
to idle, and use a multiple of that to shut select_idle_sibling() down.
If avg_idle approaches round trip time, there's no win to be had, we're
just wasting cycles.
-Mike
Hi, Peter
Thanks for your reply.
On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>
>> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
>> work well according to the test on 'perf bench sched pipe' (thanks to Peter).
>
> so sched-pipe is a poor benchmark for this..
>
> Ideally we'd write a new benchmark that has some actual data footprint
> and we'd measure the cost of tasks being apart on the various cache
> metrics and see what affine wakeup does for it.
I think sched-pipe is still somewhat capable, the problem is that the
select_idle_sibling() doesn't take care the wakeup related case, it
doesn't contain the logical to locate an idle cpu closely.
So even we detect the relationship successfully, select_idle_sibling()
can only help to make sure the target cpu won't be outside of the
current package, it's a package level bind, not mc or smp level.
>
> Before doing something like what you're proposing, I'd have a hard look
> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
> for sync wakeups.
The patch is supposed to stop using wake_affine() blindly, not improve
the wake_affine() stuff itself, the whole stuff still works, but since
we rely on select_idle_sibling() to make the choice, the benefit is not
so significant, especially on my one node box...
>
> The idea behind sync wakeups is that we try and detect the case where
> we wakeup up one task only to go to sleep ourselves and try and avoid
> the regular ping-pong this would otherwise create on account of the
> waking task still being alive and so the current cpu isn't actually
> idle yet but we know its going to be idle soon.
Are you suggesting that we should separate the process of wakeup related
case, not just pass current cpu to select_idle_sibling()?
Regards,
Michael Wang
>
>
>
On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote:
> On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
> > On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> >
> >> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
> >> work well according to the test on 'perf bench sched pipe' (thanks to Peter).
> >
> > so sched-pipe is a poor benchmark for this..
> >
> > Ideally we'd write a new benchmark that has some actual data footprint
> > and we'd measure the cost of tasks being apart on the various cache
> > metrics and see what affine wakeup does for it.
>
> I think sched-pipe is still somewhat capable,
Yeah, its not entirely crap for this, but its not ideal either. The
very big difference you see between it running on a single cpu and on
say two threads of a single core is mostly due to preemption
'artifacts' though. Not because of cache.
So we have 2 tasks -- lets call then A and B -- involved in a single
word ping-pong. So we're both doing write(); read(); loops. Now what
happens on a single cpu is that A's write()->wakeup() of B makes B
preempt A before A hits read() and blocks. This in turn ensures that
B's write()->wakeup() of A finds an already running A and doesn't
actually need to do the full (and expensive) wakeup thing (and vice
versa).
So by constantly preempting one another they avoid the expensive bit of
going to sleep and waking up again.
wake_affine() OTOH still has a (supposed) benefit if it gets the tasks
running 'closer' (in a cache hierarchy sense) since then the data
sharing is less expensive.
> the problem is that the
> select_idle_sibling() doesn't take care the wakeup related case, it
> doesn't contain the logical to locate an idle cpu closely.
I'm not entirely sure if I understand what you mean, do you mean to say
its idea of 'closely' is not quite correct? If so, I tend to agree, see
further down.
> So even we detect the relationship successfully, select_idle_sibling()
> can only help to make sure the target cpu won't be outside of the
> current package, it's a package level bind, not mc or smp level.
That is the entire point of select_idle_sibling(), selecting a cpu
'near' the target cpu that is currently idle.
Not too long ago we had a bit of a discussion on the unholy mess that
is select_idle_sibling() and if it actually does the right thing.
Arguably it doesn't for machines that have an effective L2 cache. The
issue is that the arch<->sched interface only knows about
last-level-cache (L3 on anything modern) and SMT.
Expanding the topology description in a way that makes sense (and
doesn't make it a bigger mess) is somewhere on the todo-list.
> > Before doing something like what you're proposing, I'd have a hard look
> > at WF_SYNC, it is possible we should disable/fix select_idle_sibling
> > for sync wakeups.
>
> The patch is supposed to stop using wake_affine() blindly, not improve
> the wake_affine() stuff itself, the whole stuff still works, but since
> we rely on select_idle_sibling() to make the choice, the benefit is not
> so significant, especially on my one node box...
OK, I'll have to go read the actual patch for that, I'll get back to
you on that :-)
> > The idea behind sync wakeups is that we try and detect the case where
> > we wakeup up one task only to go to sleep ourselves and try and avoid
> > the regular ping-pong this would otherwise create on account of the
> > waking task still being alive and so the current cpu isn't actually
> > idle yet but we know its going to be idle soon.
>
> Are you suggesting that we should separate the process of wakeup related
> case, not just pass current cpu to select_idle_sibling()?
Depends a bit on what you're trying to fix, so far I'm just trying to
write down what I remember about stuff and reacting to half-read
changelogs ;-)
On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> +static inline int wakeup_related(struct task_struct *p)
> +{
> + if (wakeup_buddy(p, current)) {
> + /*
> + * Now check whether current still focus on his buddy.
> + */
> + if (wakeup_buddy(current, p))
> + return 1;
> + }
> +
> + return 0;
> +}
Not commenting on the thing in general, but:
static inline bool wakeup_related(struct task_struct *p)
{
return wakeup_buddy(p, current) && wakeup_buddy(current, p);
}
is far shorter and easier to read :-)
On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int
> sd_flag, int wake_flags)
> }
>
> if (affine_sd) {
> - if (cpu != prev_cpu && wake_affine(affine_sd, p,
> sync))
> + /*
> + * If current and p are wakeup related, and balance is
> + * guaranteed, we will try to make them running
> closely
> + * to gain cache benefit.
> + */
> + if (cpu != prev_cpu && wakeup_related(p) &&
> + wake_affine(affine_sd, p,
> sync))
> prev_cpu = cpu;
OK, so there's two issues I have with all this are:
- it completely wrecks task placement for things like interrupts (sadly
I don't
have a good idea about a benchmark where this matters).
- yet another random number.. :/
Also, I'm starting to dislike the buddy name; its somewhat over-used.
On 03/08/2013 12:52 AM, Peter Zijlstra wrote:
> On Thu, 2013-03-07 at 17:46 +0800, Michael Wang wrote:
>
>> On 03/07/2013 04:36 PM, Peter Zijlstra wrote:
>>> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>>>
>>>> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
>>>> work well according to the test on 'perf bench sched pipe' (thanks to Peter).
>>>
>>> so sched-pipe is a poor benchmark for this..
>>>
>>> Ideally we'd write a new benchmark that has some actual data footprint
>>> and we'd measure the cost of tasks being apart on the various cache
>>> metrics and see what affine wakeup does for it.
>>
>> I think sched-pipe is still somewhat capable,
>
> Yeah, its not entirely crap for this, but its not ideal either. The
> very big difference you see between it running on a single cpu and on
> say two threads of a single core is mostly due to preemption
> 'artifacts' though. Not because of cache.
>
> So we have 2 tasks -- lets call then A and B -- involved in a single
> word ping-pong. So we're both doing write(); read(); loops. Now what
> happens on a single cpu is that A's write()->wakeup() of B makes B
> preempt A before A hits read() and blocks. This in turn ensures that
> B's write()->wakeup() of A finds an already running A and doesn't
> actually need to do the full (and expensive) wakeup thing (and vice
> versa).
Exactly, I used to think that make them running on same cpu only gain
benefit when one is going to sleep, but you are right, in that case,
they get latency and performance at same time, amazing ;-)
One concern in my mind is that this cooperation is somewhat fragile, if
another task C join the fight on that cpu, it will be broken, but if we
have a way to detect that in select_task_rq_fair(), we will be able to
win the gamble.
>
> So by constantly preempting one another they avoid the expensive bit of
> going to sleep and waking up again.
Amazing point :)
>
> wake_affine() OTOH still has a (supposed) benefit if it gets the tasks
> running 'closer' (in a cache hierarchy sense) since then the data
> sharing is less expensive.
>
>> the problem is that the
>> select_idle_sibling() doesn't take care the wakeup related case, it
>> doesn't contain the logical to locate an idle cpu closely.
>
> I'm not entirely sure if I understand what you mean, do you mean to say
> its idea of 'closely' is not quite correct? If so, I tend to agree, see
> further down.
>
>> So even we detect the relationship successfully, select_idle_sibling()
>> can only help to make sure the target cpu won't be outside of the
>> current package, it's a package level bind, not mc or smp level.
>
> That is the entire point of select_idle_sibling(), selecting a cpu
> 'near' the target cpu that is currently idle.
>
> Not too long ago we had a bit of a discussion on the unholy mess that
> is select_idle_sibling() and if it actually does the right thing.
> Arguably it doesn't for machines that have an effective L2 cache. The
> issue is that the arch<->sched interface only knows about
> last-level-cache (L3 on anything modern) and SMT.
Yeah, that's the point I concerned, we make sure that the cpu returned
by select_idle_sibling() at least share the L3 cache with target, but
there is a chance to miss the better candidate which share L2 with the
target.
>
> Expanding the topology description in a way that makes sense (and
> doesn't make it a bigger mess) is somewhere on the todo-list.
>
>>> Before doing something like what you're proposing, I'd have a hard look
>>> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
>>> for sync wakeups.
>>
>> The patch is supposed to stop using wake_affine() blindly, not improve
>> the wake_affine() stuff itself, the whole stuff still works, but since
>> we rely on select_idle_sibling() to make the choice, the benefit is not
>> so significant, especially on my one node box...
>
> OK, I'll have to go read the actual patch for that, I'll get back to
> you on that :-)
>
>>> The idea behind sync wakeups is that we try and detect the case where
>>> we wakeup up one task only to go to sleep ourselves and try and avoid
>>> the regular ping-pong this would otherwise create on account of the
>>> waking task still being alive and so the current cpu isn't actually
>>> idle yet but we know its going to be idle soon.
>>
>> Are you suggesting that we should separate the process of wakeup related
>> case, not just pass current cpu to select_idle_sibling()?
>
> Depends a bit on what you're trying to fix, so far I'm just trying to
> write down what I remember about stuff and reacting to half-read
> changelogs ;-)
I see, and I get some good point here for me to think deeper, that's
great ;-)
Regards,
Michael Wang
>
On 03/08/2013 01:21 AM, Peter Zijlstra wrote:
> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>> +static inline int wakeup_related(struct task_struct *p)
>> +{
>> + if (wakeup_buddy(p, current)) {
>> + /*
>> + * Now check whether current still focus on his buddy.
>> + */
>> + if (wakeup_buddy(current, p))
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>
> Not commenting on the thing in general, but:
>
> static inline bool wakeup_related(struct task_struct *p)
> {
> return wakeup_buddy(p, current) && wakeup_buddy(current, p);
> }
>
> is far shorter and easier to read :-)
Right, I will correct it :)
Regards,
Michael Wang
>
On 03/07/2013 05:43 PM, Mike Galbraith wrote:
> On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote:
>> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>>
>>> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
>>> work well according to the test on 'perf bench sched pipe' (thanks to Peter).
>>
>> so sched-pipe is a poor benchmark for this..
>>
>> Ideally we'd write a new benchmark that has some actual data footprint
>> and we'd measure the cost of tasks being apart on the various cache
>> metrics and see what affine wakeup does for it.
>>
>> Before doing something like what you're proposing, I'd have a hard look
>> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
>> for sync wakeups.
>
> If nobody beats me to it, I'm going to try tracking shortest round trip
> to idle, and use a multiple of that to shut select_idle_sibling() down.
> If avg_idle approaches round trip time, there's no win to be had, we're
> just wasting cycles.
That's great if we have it, I'm a little doubt whether it is possible to
find a better way to replace the select_idle_sibling() (look at the way
it locates idle cpu...) in some cases, but I'm looking forward it ;-)
Regards,
Michael Wang
>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 03/08/2013 01:27 AM, Peter Zijlstra wrote:
> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>> @@ -3351,7 +3420,13 @@ select_task_rq_fair(struct task_struct *p, int
>> sd_flag, int wake_flags)
>> }
>>
>> if (affine_sd) {
>> - if (cpu != prev_cpu && wake_affine(affine_sd, p,
>> sync))
>> + /*
>> + * If current and p are wakeup related, and balance is
>> + * guaranteed, we will try to make them running
>> closely
>> + * to gain cache benefit.
>> + */
>> + if (cpu != prev_cpu && wakeup_related(p) &&
>> + wake_affine(affine_sd, p,
>> sync))
>> prev_cpu = cpu;
>
>
> OK, so there's two issues I have with all this are:
>
> - it completely wrecks task placement for things like interrupts (sadly
> I don't
> have a good idea about a benchmark where this matters).
I don't get this point...could you please give more details?
> - yet another random number.. :/
Correct...well, but that also means flexibility, I suppose different
system and workload will need some tuning on this knob to gain more
benefit, by default, they will gain some benefit, small or big.
>
> Also, I'm starting to dislike the buddy name; its somewhat over-used.
>
I have to agree :), any suggestions?
Regards,
Michael Wang
On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote:
> On 03/07/2013 05:43 PM, Mike Galbraith wrote:
> > On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote:
> >> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
> >>
> >>> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
> >>> work well according to the test on 'perf bench sched pipe' (thanks to Peter).
> >>
> >> so sched-pipe is a poor benchmark for this..
> >>
> >> Ideally we'd write a new benchmark that has some actual data footprint
> >> and we'd measure the cost of tasks being apart on the various cache
> >> metrics and see what affine wakeup does for it.
> >>
> >> Before doing something like what you're proposing, I'd have a hard look
> >> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
> >> for sync wakeups.
> >
> > If nobody beats me to it, I'm going to try tracking shortest round trip
> > to idle, and use a multiple of that to shut select_idle_sibling() down.
> > If avg_idle approaches round trip time, there's no win to be had, we're
> > just wasting cycles.
>
> That's great if we have it, I'm a little doubt whether it is possible to
> find a better way to replace the select_idle_sibling() (look at the way
> it locates idle cpu...) in some cases, but I'm looking forward it ;-)
I'm not going to replace it, only stop it from wasting cycles when
there's very likely nothing to gain. Save task wakeup time, if delta
rq->clock - p->last_wakeup < N*shortest_idle or some such very cheap
metric. Wake ultra switchers L2 affine if allowed, only go hunting for
an idle L3 if the thing is on another package.
In general, I think things would work better if we'd just rate limit how
frequently we can wakeup migrate each individual task. We want
jabbering tasks to share L3, but we don't really want to trash L2 at an
awesome rate.
-Mike
On 03/08/2013 02:44 PM, Mike Galbraith wrote:
> On Fri, 2013-03-08 at 10:37 +0800, Michael Wang wrote:
>> On 03/07/2013 05:43 PM, Mike Galbraith wrote:
>>> On Thu, 2013-03-07 at 09:36 +0100, Peter Zijlstra wrote:
>>>> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>>>>
>>>>> wake_affine() stuff is trying to bind related tasks closely, but it doesn't
>>>>> work well according to the test on 'perf bench sched pipe' (thanks to Peter).
>>>>
>>>> so sched-pipe is a poor benchmark for this..
>>>>
>>>> Ideally we'd write a new benchmark that has some actual data footprint
>>>> and we'd measure the cost of tasks being apart on the various cache
>>>> metrics and see what affine wakeup does for it.
>>>>
>>>> Before doing something like what you're proposing, I'd have a hard look
>>>> at WF_SYNC, it is possible we should disable/fix select_idle_sibling
>>>> for sync wakeups.
>>>
>>> If nobody beats me to it, I'm going to try tracking shortest round trip
>>> to idle, and use a multiple of that to shut select_idle_sibling() down.
>>> If avg_idle approaches round trip time, there's no win to be had, we're
>>> just wasting cycles.
>>
>> That's great if we have it, I'm a little doubt whether it is possible to
>> find a better way to replace the select_idle_sibling() (look at the way
>> it locates idle cpu...) in some cases, but I'm looking forward it ;-)
>
> I'm not going to replace it, only stop it from wasting cycles when
> there's very likely nothing to gain. Save task wakeup time, if delta
> rq->clock - p->last_wakeup < N*shortest_idle or some such very cheap
> metric. Wake ultra switchers L2 affine if allowed, only go hunting for
> an idle L3 if the thing is on another package.
>
> In general, I think things would work better if we'd just rate limit how
> frequently we can wakeup migrate each individual task.
Isn't the wakeup buddy already limit the rate? and by turning the knob,
we could change the rate on our demand.
We want
> jabbering tasks to share L3, but we don't really want to trash L2 at an
> awesome rate.
I don't get it..., it's a task which has 'sleep' for some time, unless
there is no task running on prev_cpu when it's sleeping, otherwise
whatever the new cpu is, we will trash L2, isn't it?
Regards,
Michael Wang
>
> -Mike
>
On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote:
> On 03/08/2013 02:44 PM, Mike Galbraith wrote:
> > In general, I think things would work better if we'd just rate limit how
> > frequently we can wakeup migrate each individual task.
>
> Isn't the wakeup buddy already limit the rate? and by turning the knob,
> we could change the rate on our demand.
I was referring to the existing kernel, not as altered.
> We want
> > jabbering tasks to share L3, but we don't really want to trash L2 at an
> > awesome rate.
>
> I don't get it..., it's a task which has 'sleep' for some time, unless
> there is no task running on prev_cpu when it's sleeping, otherwise
> whatever the new cpu is, we will trash L2, isn't it?
I'm thinking if you wake it to it's old home after a microscopic sleep,
it has a good chance of evicting the current resident, rescuing its L2.
If tasks which do microscopic sleep can't move around at a high rate,
they'll poke holes in fewer preempt victims. If they're _really_ fast
switchers, always wake affine. They can't hurt much, they don't do much
other than schedule off.
-Mike
On 03/08/2013 04:26 PM, Mike Galbraith wrote:
> On Fri, 2013-03-08 at 15:30 +0800, Michael Wang wrote:
>> On 03/08/2013 02:44 PM, Mike Galbraith wrote:
>
>>> In general, I think things would work better if we'd just rate limit how
>>> frequently we can wakeup migrate each individual task.
>>
>> Isn't the wakeup buddy already limit the rate? and by turning the knob,
>> we could change the rate on our demand.
>
> I was referring to the existing kernel, not as altered.
>
>> We want
>>> jabbering tasks to share L3, but we don't really want to trash L2 at an
>>> awesome rate.
>>
>> I don't get it..., it's a task which has 'sleep' for some time, unless
>> there is no task running on prev_cpu when it's sleeping, otherwise
>> whatever the new cpu is, we will trash L2, isn't it?
>
> I'm thinking if you wake it to it's old home after a microscopic sleep,
> it has a good chance of evicting the current resident, rescuing its L2.
> If tasks which do microscopic sleep can't move around at a high rate,
> they'll poke holes in fewer preempt victims. If they're _really_ fast
> switchers, always wake affine. They can't hurt much, they don't do much
> other than schedule off.
I get your point, it's about the task which sleep frequently for a very
short time, you are right, keep them around prev_cpu could gain some
benefit.
There are still many factors need to be considered, for example, if the
cpu around prev_cpu are busier than those around curr_cpu, then pull
from prev to curr still likely to be a good choice...for the consider of
future.
Also for sure, that depends on what's the workload in the system, and
how they cooperate with each other.
IMHO, I think wakeup buddy is a compromising solution for this case,
before we could figure out the correct formular (and a efficient way to
collect all the info we rely on), such a flexible optimization is not bad.
Regards,
Michael Wang
>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
* Peter Zijlstra <[email protected]> wrote:
> On Wed, 2013-03-06 at 15:06 +0800, Michael Wang wrote:
>
> > wake_affine() stuff is trying to bind related tasks closely, but it
> > doesn't work well according to the test on 'perf bench sched pipe'
> > (thanks to Peter).
>
> so sched-pipe is a poor benchmark for this..
>
> Ideally we'd write a new benchmark that has some actual data footprint
> and we'd measure the cost of tasks being apart on the various cache
> metrics and see what affine wakeup does for it.
Ideally we'd offer applications a new, lightweight vsyscall:
void sys_sched_work_tick(void)
Or, to speed up adoption, a new, vsyscall-accelerated prctrl():
prctl(PR_WORK_TICK);
which applications could call in each basic work unit they are performing.
Sysbench would call it for every transaction completed, sched-pipe would
call it for every pipe message sent, hackbench would call it for messages,
etc. etc.
This is a minimal application level change, but gives *huge* information
to the scheduler: we could balance tasks to maximize their observed work
rate.
The scheduler could also do other things, like observe the wakeup/sleep
patterns within a 'work atom', observe execution overlap between work
atoms and place tasks accordingly, etc. etc.
Today we approximate work atoms by saying that scheduling atoms == work
atoms. But that approximation breaks down in a number of important cases.
If we had such a design we'd be able to fix pretty much everything,
without the catch-22 problems we are facing normally.
An added bonus would be increased instrumentation: we could trace, time,
profile work atom rates and could collect work atom profiles. We see work
atom execution histograms, etc. etc. - stuff that is simply not possible
today without extensive application-dependent instrumentation.
We could also use utrace scripts to define work atoms without modifying
the application: for many applications we know which particular function
call means that a basic work unit was completed.
I have actually written the prctl() approach before, for instrumentation
purposes, and it does wonders to system analysis.
Any objections?
Thanks,
Ingo
Hi, Ingo
On 03/11/2013 04:21 PM, Ingo Molnar wrote:
[snip]
>
> I have actually written the prctl() approach before, for instrumentation
> purposes, and it does wonders to system analysis.
The idea sounds great, we could get many new info to implement more
smart scheduler, that's amazing :)
>
> Any objections?
Just one concern, may be I have misunderstand you, but will it cause
trouble if the prctl() was indiscriminately used by some applications,
will we get fake data?
Regards,
Michael Wang
>
> Thanks,
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
* Michael Wang <[email protected]> wrote:
> Hi, Ingo
>
> On 03/11/2013 04:21 PM, Ingo Molnar wrote:
> [snip]
> >
> > I have actually written the prctl() approach before, for instrumentation
> > purposes, and it does wonders to system analysis.
>
> The idea sounds great, we could get many new info to implement more
> smart scheduler, that's amazing :)
>
> >
> > Any objections?
>
> Just one concern, may be I have misunderstand you, but will it cause
> trouble if the prctl() was indiscriminately used by some applications,
> will we get fake data?
It's their problem: overusing it will increase their CPU overhead. The two
boundary worst-cases are that they either call it too frequently or too
rarely:
- too frequently: it approximates the current cpu-runtime work metric
- too infrequently: we just ignore it and fall back to a runtime metric
if it does not change.
It's not like it can be used to get preferential treatment - we don't ever
balance other tasks against these tasks based on work throughput, we try
to maximize this workload's work throughput.
What could happen is if an app is 'optimized' for a buggy scheduler by
changing the work metric frequency. We offer no guarantee - apps will be
best off (and users will be least annoyed) if apps honestly report their
work metric.
Instrumentation/stats/profiling will also double check the correctness of
this data: if developers/users start relying on the work metric as a
substitute benchmark number, then app writers will have an additional
incentive to make them correct.
Thanks,
Ingo
On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote:
> > OK, so there's two issues I have with all this are:
> >
> > - it completely wrecks task placement for things like interrupts (sadly I don't
> > have a good idea about a benchmark where this matters).
>
> I don't get this point...could you please give more details?
Take for instance a network workload, the hardirq notifies us there's
data buffered, then softirq does a ton of network layer demuxing and
eventually wakes one (or more) tasks to process data. You want these
tasks to move to the waking cpu, it has lots of this data in cache.
Now, neither hard- nor softirq runs in task context (except for -rt) so
it completely fails on what you propose.
We could simply add something like in_softirq() || in_irq() etc.. to
re-enable wake_affine() for those cases unconditionally, but not sure
that's the right thing either.
> > - yet another random number.. :/
>
> Correct...well, but that also means flexibility, I suppose different
> system and workload will need some tuning on this knob to gain more
> benefit, by default, they will gain some benefit, small or big.
Nah, it just means another knob nobody knows how to twiddle.
> > Also, I'm starting to dislike the buddy name; its somewhat over-used.
>
> I have to agree :), any suggestions?
Nah.. I suppose it depends a bit on the shape the final solution takes,
but I'll think about it.
On 03/11/2013 06:36 PM, Peter Zijlstra wrote:
> On Fri, 2013-03-08 at 10:50 +0800, Michael Wang wrote:
>
>>> OK, so there's two issues I have with all this are:
>>>
>>> - it completely wrecks task placement for things like interrupts (sadly I don't
>>> have a good idea about a benchmark where this matters).
>>
>> I don't get this point...could you please give more details?
>
> Take for instance a network workload, the hardirq notifies us there's
> data buffered, then softirq does a ton of network layer demuxing and
> eventually wakes one (or more) tasks to process data. You want these
> tasks to move to the waking cpu, it has lots of this data in cache.
>
> Now, neither hard- nor softirq runs in task context (except for -rt) so
> it completely fails on what you propose.
I got it, exactly, the new feature with current ref limit will miss this
optimize timing, if the data not related to current...
However, we are still gambling here, and some workload suffered.
Actually this feature's purpose is to provide a way for user who want to
balance the risk and benefit on self demand, it providing flexible, not
restriction...
So what about make the default sysctl_sched_wakeup_buddy_ref to be 0 and
make every thing just like the old world.
And for those who want to make the decision more carefully, they could
turn the knob to make it bigger, and gain the benefit.
>
> We could simply add something like in_softirq() || in_irq() etc.. to
> re-enable wake_affine() for those cases unconditionally, but not sure
> that's the right thing either.
>
>>> - yet another random number.. :/
>>
>> Correct...well, but that also means flexibility, I suppose different
>> system and workload will need some tuning on this knob to gain more
>> benefit, by default, they will gain some benefit, small or big.
>
> Nah, it just means another knob nobody knows how to twiddle.
Right, and we already have many such kind of knob...since some formular
is impossible to be written down.
Mysterious knob, I twiddle it and wait for something to drop from sky,
candy, cake or rock, whatever, there are adventures there ;-)
And I believe if once the cake drop from sky, there will be more
adventurers, and it won't be a mysterious knob any more.
>
>>> Also, I'm starting to dislike the buddy name; its somewhat over-used.
>>
>> I have to agree :), any suggestions?
>
> Nah.. I suppose it depends a bit on the shape the final solution takes,
What about default sysctl_sched_wakeup_buddy_ref = 0 and keep the old
logical?
Regards,
Michael Wang
> but I'll think about it.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 03/11/2013 05:40 PM, Ingo Molnar wrote:
>
> * Michael Wang <[email protected]> wrote:
>
>> Hi, Ingo
>>
>> On 03/11/2013 04:21 PM, Ingo Molnar wrote:
>> [snip]
>>>
>>> I have actually written the prctl() approach before, for instrumentation
>>> purposes, and it does wonders to system analysis.
>>
>> The idea sounds great, we could get many new info to implement more
>> smart scheduler, that's amazing :)
>>
>>>
>>> Any objections?
>>
>> Just one concern, may be I have misunderstand you, but will it cause
>> trouble if the prctl() was indiscriminately used by some applications,
>> will we get fake data?
>
> It's their problem: overusing it will increase their CPU overhead. The two
> boundary worst-cases are that they either call it too frequently or too
> rarely:
>
> - too frequently: it approximates the current cpu-runtime work metric
>
> - too infrequently: we just ignore it and fall back to a runtime metric
> if it does not change.
>
> It's not like it can be used to get preferential treatment - we don't ever
> balance other tasks against these tasks based on work throughput, we try
> to maximize this workload's work throughput.
>
> What could happen is if an app is 'optimized' for a buggy scheduler by
> changing the work metric frequency. We offer no guarantee - apps will be
> best off (and users will be least annoyed) if apps honestly report their
> work metric.
>
> Instrumentation/stats/profiling will also double check the correctness of
> this data: if developers/users start relying on the work metric as a
> substitute benchmark number, then app writers will have an additional
> incentive to make them correct.
I see, I could not figure out how to wisely using the info currently,
but I have the feeling that it will make scheduler very different ;-)
May be we could implement the API and get those info ready firstly
(along with the new sched-pipe which provide work tick info), then think
about the way to use them in scheduler, is there any patches on the way?
Regards,
Michael Wang
>
> Thanks,
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
* Michael Wang <[email protected]> wrote:
> On 03/11/2013 05:40 PM, Ingo Molnar wrote:
> >
> > * Michael Wang <[email protected]> wrote:
> >
> >> Hi, Ingo
> >>
> >> On 03/11/2013 04:21 PM, Ingo Molnar wrote:
> >> [snip]
> >>>
> >>> I have actually written the prctl() approach before, for instrumentation
> >>> purposes, and it does wonders to system analysis.
> >>
> >> The idea sounds great, we could get many new info to implement more
> >> smart scheduler, that's amazing :)
> >>
> >>>
> >>> Any objections?
> >>
> >> Just one concern, may be I have misunderstand you, but will it cause
> >> trouble if the prctl() was indiscriminately used by some applications,
> >> will we get fake data?
> >
> > It's their problem: overusing it will increase their CPU overhead. The two
> > boundary worst-cases are that they either call it too frequently or too
> > rarely:
> >
> > - too frequently: it approximates the current cpu-runtime work metric
> >
> > - too infrequently: we just ignore it and fall back to a runtime metric
> > if it does not change.
> >
> > It's not like it can be used to get preferential treatment - we don't ever
> > balance other tasks against these tasks based on work throughput, we try
> > to maximize this workload's work throughput.
> >
> > What could happen is if an app is 'optimized' for a buggy scheduler by
> > changing the work metric frequency. We offer no guarantee - apps will be
> > best off (and users will be least annoyed) if apps honestly report their
> > work metric.
> >
> > Instrumentation/stats/profiling will also double check the correctness of
> > this data: if developers/users start relying on the work metric as a
> > substitute benchmark number, then app writers will have an additional
> > incentive to make them correct.
>
> I see, I could not figure out how to wisely using the info currently,
> but I have the feeling that it will make scheduler very different ;-)
>
> May be we could implement the API and get those info ready firstly
> (along with the new sched-pipe which provide work tick info), then think
> about the way to use them in scheduler, is there any patches on the way?
Absolutely.
Beyond the new prctl no new API is needed: a perf soft event could be
added, and/or a tracepoint. Then perf stat and perf record could be used
with it. 'perf bench' could be extended to generate the work tick in its
'perf bench sched ...' workloads - and for 'perf bench mem numa' as well.
vsyscall-accelerating it could be a separate, more complex step: it needs
a per thread writable vsyscall data area to make the overhead to
applications near zero. Performance critical apps won't call an extra
syscall.
Thanks,
Ingo
On 03/12/2013 04:48 PM, Ingo Molnar wrote:
[snip]
>>>
>>> Instrumentation/stats/profiling will also double check the correctness of
>>> this data: if developers/users start relying on the work metric as a
>>> substitute benchmark number, then app writers will have an additional
>>> incentive to make them correct.
>>
>> I see, I could not figure out how to wisely using the info currently,
>> but I have the feeling that it will make scheduler very different ;-)
>>
>> May be we could implement the API and get those info ready firstly
>> (along with the new sched-pipe which provide work tick info), then think
>> about the way to use them in scheduler, is there any patches on the way?
>
> Absolutely.
>
> Beyond the new prctl no new API is needed: a perf soft event could be
> added, and/or a tracepoint. Then perf stat and perf record could be used
> with it. 'perf bench' could be extended to generate the work tick in its
> 'perf bench sched ...' workloads - and for 'perf bench mem numa' as well.
Nice :)
>
> vsyscall-accelerating it could be a separate, more complex step: it needs
> a per thread writable vsyscall data area to make the overhead to
> applications near zero. Performance critical apps won't call an extra
> syscall.
If it's really bring benefit, I think they will consider about it,
whatever, that's the developer/users decision, what we need to do is
just make the stuff attractively.
Regards,
Michael Wang
>
> Thanks,
>
> Ingo
>
Does something simple like a per-task throttle of wake_affine() gain
similar benefits? Say something like only do wake_affine() once every
10 ms or so (counting on the wakee, not waker).
The rationale being that wake_affine() relies on load-balance
statistics that aren't updated that much faster, so calling it more
often than that seems pointless.
Something like that should take a lot less lines to implement.
Just wondering..
On 03/12/2013 06:08 PM, Peter Zijlstra wrote:
> Does something simple like a per-task throttle of wake_affine() gain
> similar benefits? Say something like only do wake_affine() once every
> 10 ms or so (counting on the wakee, not waker).
>
> The rationale being that wake_affine() relies on load-balance
> statistics that aren't updated that much faster, so calling it more
> often than that seems pointless.
That could works, if we do not have any logical to rely on and just want
to limit the rate of pull, this could be a good solution.
However, we already figure out the logical that wakeup related task
could benefit from closely running, this could promise us somewhat
reliable benefit.
It's just like, tell me how many times that two task continuously wakeup
each other means they related, I will try to pull them together since
they have a big chance to benefit from cache.
IMHO, that sounds a little easier for users than to make the decision on
tell me how long to pull tasks together, they may be confused...
In summary, I think we have two point here need to be considered:
1. what about the missed optimize timing, that may benefit
some workload (although we haven't found such workload yet).
2. how many benefit could wake_affine() stuff bring to us,
if limit rate benefit us, why don't we remove it?
Point 1 could be solved by disable wakeup buddy with 0 limitation, and
when users complain about their database performance, we just say "Calm
down and take a try on this knob ;-)".
Point 2 is about the wake_affine() stuff itself.
Later we could try to make the stuff better, but I have the feeling that
some key info is not there (may be we need Ingo's work atom idea here,
just wondering...), whatever, I think we still need a knob finally,
since it doesn't sounds like a general optimization which benefit all
the cases.
And I don't agree to remove the stuff since we have many theories that
this could benefit us, but before it really show the benefit in all the
cases, provide a way to keep it quiet sounds necessary...
Regards,
Michael Wang
>
> Something like that should take a lot less lines to implement.
>
> Just wondering..
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote:
> However, we already figure out the logical that wakeup related task
> could benefit from closely running, this could promise us somewhat
> reliable benefit.
I'm not convinced that the 2 task wakeup scenario is the only sane
scenario. Imagine a ring of 3 tasks that wakes each other, if the
machine is loaded enough, those 3 tasks might fit a single cpu just
fine -- esp. if one of those is always idle.
But your strict 1:1 wakeup relation thing will completely fail to
detect this.
> IMHO, that sounds a little easier for users than to make the decision on
> tell me how long to pull tasks together, they may be confused...
Users shouldn't ever need/want/etc.. rely on this. They should just run
their programs and be (reasonably) happy.
> In summary, I think we have two point here need to be considered:
>
> 1. what about the missed optimize timing, that may benefit
> some workload (although we haven't found such workload yet).
Missed optimize; as in not calling wake_affine() due to the throttle?
If we're calling it at such a high frequency it is very likely the next
call isn't very far away.
> 2. how many benefit could wake_affine() stuff bring to us,
> if limit rate benefit us, why don't we remove it?
It could bring the same benefit but at lower overhead, what's the point
of computing the same value over and over again? Also, the rate limit
thing naturally works for the soft/hard-irq case.
Now, there's another detail I thought up, one could only limit the
wake_affine() calls once it starts returning false.
On 03/14/2013 06:58 PM, Peter Zijlstra wrote:
> On Wed, 2013-03-13 at 11:07 +0800, Michael Wang wrote:
>
>> However, we already figure out the logical that wakeup related task
>> could benefit from closely running, this could promise us somewhat
>> reliable benefit.
>
> I'm not convinced that the 2 task wakeup scenario is the only sane
> scenario. Imagine a ring of 3 tasks that wakes each other, if the
> machine is loaded enough, those 3 tasks might fit a single cpu just
> fine -- esp. if one of those is always idle.
>
> But your strict 1:1 wakeup relation thing will completely fail to
> detect this.
Hmm...yeah, I see your point here, some optimize timing will always
be missed whatever how we twiddle the knob besides 0.
You are right, that's a problem, although currently there are no
workload to prove it, but we have the theory...
>
>> IMHO, that sounds a little easier for users than to make the decision on
>> tell me how long to pull tasks together, they may be confused...
>
> Users shouldn't ever need/want/etc.. rely on this. They should just run
> their programs and be (reasonably) happy.
>
>> In summary, I think we have two point here need to be considered:
>>
>> 1. what about the missed optimize timing, that may benefit
>> some workload (although we haven't found such workload yet).
>
> Missed optimize; as in not calling wake_affine() due to the throttle?
> If we're calling it at such a high frequency it is very likely the next
> call isn't very far away.
>
>> 2. how many benefit could wake_affine() stuff bring to us,
>> if limit rate benefit us, why don't we remove it?
>
> It could bring the same benefit but at lower overhead, what's the point
> of computing the same value over and over again? Also, the rate limit
> thing naturally works for the soft/hard-irq case.
Just try to confirm my understanding, so we are going to do something
like:
if (now - wakee->last > time_limit) && wakeup_affine()
wakee->last = now
select_idle_sibling(curr_cpu)
else
select_idle_sibling(prev_cpu)
And time_limit is some static value respect to the rate of load balance,
is that correct?
Currently I haven't found regression by reduce the rate, but if we found
such benchmark, we may still need a way (knob or CONFIG) to disable this
limitation.
>
> Now, there's another detail I thought up, one could only limit the
> wake_affine() calls once it starts returning false.
Hmm..if wake_affine() keep succeed, then there will be no difference?
I do believe pgbench match the case, since wake_affine() stuff make it
suffered...and the more it suffered, means the more often wake_affine()
succeed and pull none related tasks together.
I really can't see how could it do help... did I miss something?
Regards,
Michael Wang
>
[snip]
>> It could bring the same benefit but at lower overhead, what's the point
>> of computing the same value over and over again? Also, the rate limit
>> thing naturally works for the soft/hard-irq case.
>
> Just try to confirm my understanding, so we are going to do something
> like:
>
> if (now - wakee->last > time_limit) && wakeup_affine()
> wakee->last = now
> select_idle_sibling(curr_cpu)
> else
> select_idle_sibling(prev_cpu)
>
> And time_limit is some static value respect to the rate of load balance,
> is that correct?
>
> Currently I haven't found regression by reduce the rate, but if we found
> such benchmark, we may still need a way (knob or CONFIG) to disable this
> limitation.
I've done some fast tests on this proposal, on my 12 cpu box, the
pgbench 32 clients test, for a 1000ms time_limit, the benefit is just
like the 8 ref wakeup buddy, when adopt 10ms time_limit, the benefit
dropped half, when time_limit is 1ms, the benefit is less than 10%.
tps
original 43404
wakeup-buddy 63024 +45.20%
1s-limit 62359 +43.67%
100ms-limit 57547 +32.58%
10ms-limit 52258 +20.40%
1ms-limit 46535 +7.21%
Other test items of pgbench are corresponding, and other benchmarks
still inert to the changes.
I'm planning to make a new patch for this approach later, in which
time_limit is a knob with the default value 1ms (usually the initial
value of balance_interval and the value of min_interval), that will
based on the latest tip tree.
Regards,
Michael Wang