LinuxLists.cc - [PATCH RFC] sched: add notifier for process migration

2009-10-09 21:02:43

Subject: [PATCH RFC] sched: add notifier for process migration

Hi,

I'm working on adding vsyscall (vread) support for
arch/x86/kernel/pvclock.c. The algorithm needs to look up per-cpu tsc
parameters (aka pvclock_vcpu_time_info) so that it can compute global
system time from the tsc. To do this, it needs to grab a consistent
snapshot of (tsc, time_info).

Obviously this is all racy from usermode, because there are two levels
of scheduling going on the virtual case: kernel scheduling of tasks to
vcpus, and hypervisor scheduling of vcpus to pcpus. The latter is dealt
with a version number in the tsc parameter structure to indicate changes
in the params (which could be due to scheduling, power events, etc).

To deal with kernel scheduling I want a second version number to let
usermode know they've been migrated to a new (v)cpu and need to try
again with updated time parameters. Specifically, update the version on
the "from" vcpu so that usermode (vsyscall) code holding an old pointer
can see the number change and reload the cpu number and get a pointer to
the new cpu's time_info.

Initially I was doing this with a preempt notifier on sched_out, but Avi
pointed out that this was a pessimistic approximation of what I really
want, which is notification on cross-cpu migration. And since migration
is an inherently expensive operation, the overhead of a notifier here
should be negligible. (Aside from that, the preempt notifier mechanism
isn't intended to be enabled on every process on the system.)

So I'm proposing this patch. My questions are:

1. Does this look generally reasonable?
2. Will this notifier actually be called every time a task gets
migrated between CPUs? Are there cases where migration may happen
via some other path? (Though for my particular case I only care
about migration when the task is actually preempted; if it goes to
sleep on one cpu and happens to wake on another then it wasn't in
the middle of getting time so it doesn't matter.)
3. Or is there a better way to achieve what I want?

This might also be a generally useful extension to vgetcpu() caching so
that usermode can definitively tell whether the cpu number has changed
under its feet and needs to be reloaded via lsl/rdtscp, rather than
having to rely on a jiffies-based approximation.

Thanks,
J

[PATCH] sched: add notifier for cross-cpu migrations

It can be useful to know when a task has migrated to another cpu (to invalidate some
per-cpu per-task cache, for example).

Signed-off-by: Jeremy Fitzhardinge <[email protected]>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f1ea4a..a1c843a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -141,6 +141,13 @@ extern unsigned long nr_iowait(void);
extern void calc_global_load(void);
extern u64 cpu_nr_migrations(int cpu);

+struct migration_notifier {
+ struct task_struct *task;
+ int from_cpu;
+ int to_cpu;
+};
+extern void register_migration_notifier(struct notifier_block *n);
+
extern unsigned long get_parent_ip(unsigned long addr);

struct seq_file;
diff --git a/kernel/sched.c b/kernel/sched.c
index 1b59e26..b998504 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7005,6 +7005,13 @@ out:
}
EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);

+static ATOMIC_NOTIFIER_HEAD(migration_notifications);
+
+void register_migration_notifier(struct notifier_block *n)
+{
+ atomic_notifier_chain_register(&migration_notifications, n);
+}
+
/*
* Move (not current) task off this cpu, onto dest cpu. We're doing
* this because either it can't run here any more (set_cpus_allowed()
@@ -7020,6 +7027,7 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
{
struct rq *rq_dest, *rq_src;
int ret = 0, on_rq;
+ struct migration_notifier mn;

if (unlikely(!cpu_active(dest_cpu)))
return ret;
@@ -7044,6 +7052,13 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
activate_task(rq_dest, p, 0);
check_preempt_curr(rq_dest, p, 0);
}
+
+ mn.task = p;
+ mn.from_cpu = src_cpu;
+ mn.to_cpu = dest_cpu;
+
+ atomic_notifier_call_chain(&migration_notifications, 0, &mn);
+
done:
ret = 1;
fail:

2009-10-09 22:03:09

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On Fri, 2009-10-09 at 14:01 -0700, Jeremy Fitzhardinge wrote:

> I'm working on adding vsyscall (vread) support for
> arch/x86/kernel/pvclock.c. The algorithm needs to look up per-cpu tsc
> parameters (aka pvclock_vcpu_time_info) so that it can compute global
> system time from the tsc. To do this, it needs to grab a consistent
> snapshot of (tsc, time_info).

time_info as in gettimeofday()?, that's supposed to be globally
consistent, so get that first and then get the tsc and you're as race
free as you're ever going to get from userspace.

> Obviously this is all racy from usermode, because there are two levels
> of scheduling going on the virtual case: kernel scheduling of tasks to
> vcpus, and hypervisor scheduling of vcpus to pcpus. The latter is dealt
> with a version number in the tsc parameter structure to indicate changes
> in the params (which could be due to scheduling, power events, etc).
>
> To deal with kernel scheduling I want a second version number to let
> usermode know they've been migrated to a new (v)cpu and need to try
> again with updated time parameters. Specifically, update the version on
> the "from" vcpu so that usermode (vsyscall) code holding an old pointer
> can see the number change and reload the cpu number and get a pointer to
> the new cpu's time_info.

/me utterly confused.

> Initially I was doing this with a preempt notifier on sched_out, but Avi
> pointed out that this was a pessimistic approximation of what I really
> want, which is notification on cross-cpu migration. And since migration
> is an inherently expensive operation, the overhead of a notifier here
> should be negligible. (Aside from that, the preempt notifier mechanism
> isn't intended to be enabled on every process on the system.)

And here you're utterly failing to explain what you want such a notifier
would do.

> So I'm proposing this patch. My questions are:
>
> 1. Does this look generally reasonable?

I'm generally confused and not at all clear as to how things would work.
Afaik the vdso is a global entity and does not contain per-cpu or
per-task state.

If you're proposing to increment a global seq count on every task
migration, then I think its a terribly bad idea.

> 2. Will this notifier actually be called every time a task gets
> migrated between CPUs? Are there cases where migration may happen
> via some other path? (Though for my particular case I only care
> about migration when the task is actually preempted; if it goes to
> sleep on one cpu and happens to wake on another then it wasn't in
> the middle of getting time so it doesn't matter.)

No, you've missed quite a lot of cases.

> 3. Or is there a better way to achieve what I want?
>
> This might also be a generally useful extension to vgetcpu() caching so
> that usermode can definitively tell whether the cpu number has changed
> under its feet and needs to be reloaded via lsl/rdtscp, rather than
> having to rely on a jiffies-based approximation.

I've got no idea how vgetcpu() works, but since the vdso page is global
and not per-task, I can't really see how it could work sanely.

2009-10-09 22:44:16

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On 10/09/09 15:02, Peter Zijlstra wrote:
>> I'm working on adding vsyscall (vread) support for
>> arch/x86/kernel/pvclock.c. The algorithm needs to look up per-cpu tsc
>> parameters (aka pvclock_vcpu_time_info) so that it can compute global
>> system time from the tsc. To do this, it needs to grab a consistent
>> snapshot of (tsc, time_info).
>>
> time_info as in gettimeofday()?, that's supposed to be globally
> consistent, so get that first and then get the tsc and you're as race
> free as you're ever going to get from userspace.
>

pvclock_vcpu_time_info is a structure which is part of the hypervisor
ABI (implemented by both Xen and KVM at the moment), which provides all
the info needed to compute a global system time (ns from host boot, for
example) from the tsc, even if the CPU's tscs are not synced, or running
at the same frequency, or can stop/change at any moment. The hypervisor
provides it for each guest virtual CPU containing the parameters for the
underlying physical CPU the vCPU is currently running on.

This is all done in pvclock_clocksource_read, which is then used as the
input for all the rest of the kernel's gettimeofday/clock_gettime/etc
functions. I'm extending this so I can also implement
pvclock_clocksource_vread and do the same thing from userspace.

>> Obviously this is all racy from usermode, because there are two levels
>> of scheduling going on the virtual case: kernel scheduling of tasks to
>> vcpus, and hypervisor scheduling of vcpus to pcpus. The latter is dealt
>> with a version number in the tsc parameter structure to indicate changes
>> in the params (which could be due to scheduling, power events, etc).
>>
>> To deal with kernel scheduling I want a second version number to let
>> usermode know they've been migrated to a new (v)cpu and need to try
>> again with updated time parameters. Specifically, update the version on
>> the "from" vcpu so that usermode (vsyscall) code holding an old pointer
>> can see the number change and reload the cpu number and get a pointer to
>> the new cpu's time_info.
>>
> /me utterly confused.
>

OK, concretely:

1. allocate a page and fixmap it into userspace
2. keep an array of structures containing tsc->cycle_t
(pvclock_vcpu_time_info) params, indexed by cpu
3. register those structures with the hypervisor so it can update
them as either the pcpus change freq and/or the vcpus get moved to
different pcpus
4. associate a "migration_count" with each structure (ie, how many
times this cpu has had tasks migrated off it)

The algorithm is basically:

do {
cpu = vgetcpu(); /* get current cpu */
ti = &timeinfo[cpu]; /* get scaling+offset for tsc */

/* !!! migration race */

migration_count = ti->migration_count;
version = ti->version;

barrier();

local_time_info = *ti;

tsc = rdtsc();
cycles = compute_cycles_from_tsc(tsc, &local_time_info);

barrier();

cpu1 = vgetcpu();

/* loop if anything changed under our feet:
- we changed cpus (if we got migrated at "!!! migration race" above
then the migration_count test won't pick it up)
- the time info changed
- we got migrated to a different cpu (we need to check this as well
as cpu != cpu1 in case we got migrated from A->B->A)
*/

} while(unlikely(cpu1 != cpu ||
timeinfo->version != version ||
timeinfo->migration_count != migration_count));

return cycles;

This is executed in usermode as part of vsyscall gettimeofday via the
clocksource.vread function.

>> Initially I was doing this with a preempt notifier on sched_out, but Avi
>> pointed out that this was a pessimistic approximation of what I really
>> want, which is notification on cross-cpu migration. And since migration
>> is an inherently expensive operation, the overhead of a notifier here
>> should be negligible. (Aside from that, the preempt notifier mechanism
>> isn't intended to be enabled on every process on the system.)
>>
> And here you're utterly failing to explain what you want such a notifier
> would do.
>

Referring to the code above, it does:

time_info[from_cpu]->migrate_count++;

so that usermode can see that it has been moved to a different cpu and
needs to try again.

>> So I'm proposing this patch. My questions are:
>>
>> 1. Does this look generally reasonable?
>>
> I'm generally confused and not at all clear as to how things would work.
> Afaik the vdso is a global entity and does not contain per-cpu or
> per-task state.
>

Yep. For this implementation I fixmap another page of per-cpu time info
into usermode for use by the pvclock vread implementation. This is
mapped RO and global, of course.

> If you're proposing to increment a global seq count on every task
> migration, then I think its a terribly bad idea.
>

A per-cpu counter on each migration.

>> 2. Will this notifier actually be called every time a task gets
>> migrated between CPUs? Are there cases where migration may happen
>> via some other path? (Though for my particular case I only care
>> about migration when the task is actually preempted; if it goes to
>> sleep on one cpu and happens to wake on another then it wasn't in
>> the middle of getting time so it doesn't matter.)
>>
> No, you've missed quite a lot of cases.
>

Could you expand on that? Where else would we need to catch a migration?

> I've got no idea how vgetcpu() works, but since the vdso page is global
> and not per-task, I can't really see how it could work sanely.
>

It works either by using "lsl" to fetch cpu+node number info encoded in
a segment limit, or via rdtscp which encodes cpu+node in the TSC_AUX
register.

Thanks,
J

2009-10-10 07:15:14

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On Fri, 2009-10-09 at 15:43 -0700, Jeremy Fitzhardinge wrote:

> OK, concretely:
>
> 1. allocate a page and fixmap it into userspace
> 2. keep an array of structures containing tsc->cycle_t
> (pvclock_vcpu_time_info) params, indexed by cpu
> 3. register those structures with the hypervisor so it can update
> them as either the pcpus change freq and/or the vcpus get moved to
> different pcpus
> 4. associate a "migration_count" with each structure (ie, how many
> times this cpu has had tasks migrated off it)
>
> The algorithm is basically:
>
> do {
> cpu = vgetcpu(); /* get current cpu */
> ti = &timeinfo[cpu]; /* get scaling+offset for tsc */
>
> /* !!! migration race */
>
> migration_count = ti->migration_count;
> version = ti->version;
>
> barrier();
>
> local_time_info = *ti;
>
> tsc = rdtsc();
> cycles = compute_cycles_from_tsc(tsc, &local_time_info);
>
> barrier();
>
> cpu1 = vgetcpu();
>
> /* loop if anything changed under our feet:
> - we changed cpus (if we got migrated at "!!! migration race" above
> then the migration_count test won't pick it up)
> - the time info changed
> - we got migrated to a different cpu (we need to check this as well
> as cpu != cpu1 in case we got migrated from A->B->A)
> */
>
> } while(unlikely(cpu1 != cpu ||
> timeinfo->version != version ||
> timeinfo->migration_count != migration_count));
>
> return cycles;
>
>
> This is executed in usermode as part of vsyscall gettimeofday via the
> clocksource.vread function.

Why not do something like:

struct {
u64 tsc;
u32 aux;
} tscp = rdtscp();

local_time_info = timeinfo[tscp_cpu(tscp)];

/* yay, consistent tsc and timeinfo !! */

?

2009-10-10 09:06:12

by Avi Kivity

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On 10/10/2009 09:14 AM, Peter Zijlstra wrote:
> Why not do something like:
>
> struct {
> u64 tsc;
> u32 aux;
> } tscp = rdtscp();
>
> local_time_info = timeinfo[tscp_cpu(tscp)];
>
> /* yay, consistent tsc and timeinfo !! */
>

First, not all processors support rdtscp. Second, timeinfo might change
at any time due to cpu frequency changes or the entire cpu being
migrated, so we need to loop in any case.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2009-10-10 09:25:10

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On Sat, 2009-10-10 at 11:05 +0200, Avi Kivity wrote:
> On 10/10/2009 09:14 AM, Peter Zijlstra wrote:
> > Why not do something like:
> >
> > struct {
> > u64 tsc;
> > u32 aux;
> > } tscp = rdtscp();
> >
> > local_time_info = timeinfo[tscp_cpu(tscp)];
> >
> > /* yay, consistent tsc and timeinfo !! */
> >
>
> First, not all processors support rdtscp.

Is that a real issue? If its not supported by the early hardware virt
chips, tough luck, they sucked anyway :-)

> Second, timeinfo might change
> at any time due to cpu frequency changes or the entire cpu being
> migrated, so we need to loop in any case.

Sure.

2009-10-10 09:37:33

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On 10/10/09 02:24, Peter Zijlstra wrote:
> Is that a real issue? If its not supported by the early hardware virt
> chips, tough luck, they sucked anyway :-)
>

Nehalem is the first Intel chip to support it, and Xen doesn't rely on
hardware virt support anyway.

J

2009-10-10 10:13:00

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On Sat, 2009-10-10 at 02:36 -0700, Jeremy Fitzhardinge wrote:
> On 10/10/09 02:24, Peter Zijlstra wrote:
> > Is that a real issue? If its not supported by the early hardware virt
> > chips, tough luck, they sucked anyway :-)
> >
>
> Nehalem is the first Intel chip to support it, and Xen doesn't rely on
> hardware virt support anyway.

Ah, nehalem is too limited indeed :-/

A well, look at set_task_cpu(): new_rq->nr_migrations_in++;

2009-10-13 21:25:38

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On 10/10/09 03:12, Peter Zijlstra wrote:
> A well, look at set_task_cpu(): new_rq->nr_migrations_in++;
>

How does this look?

From: Jeremy Fitzhardinge <[email protected]>
Date: Wed, 7 Oct 2009 13:43:31 -0700
Subject: [PATCH] sched: add notifier for cross-cpu migrations

It can be useful to know when a task has migrated to another cpu (to invalidate some
per-cpu per-task cache, for example).

Signed-off-by: Jeremy Fitzhardinge <[email protected]>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f1ea4a..5186dd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -141,6 +141,14 @@ extern unsigned long nr_iowait(void);
extern void calc_global_load(void);
extern u64 cpu_nr_migrations(int cpu);

+/* Notifier for when a task gets migrated to a new CPU */
+struct task_migration_notifier {
+ struct task_struct *task;
+ int from_cpu;
+ int to_cpu;
+};
+extern void register_task_migration_notifier(struct notifier_block *n);
+
extern unsigned long get_parent_ip(unsigned long addr);

struct seq_file;
diff --git a/kernel/sched.c b/kernel/sched.c
index 1b59e26..3982e8e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1951,6 +1951,12 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
return delta < (s64)sysctl_sched_migration_cost;
}

+static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
+
+void register_task_migration_notifier(struct notifier_block *n)
+{
+ atomic_notifier_chain_register(&task_migration_notifier, n);
+}

void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
{
@@ -1973,6 +1979,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
p->se.block_start -= clock_offset;
#endif
if (old_cpu != new_cpu) {
+ struct task_migration_notifier tmn;
+
p->se.nr_migrations++;
new_rq->nr_migrations_in++;
#ifdef CONFIG_SCHEDSTATS
@@ -1981,6 +1989,12 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
#endif
perf_swcounter_event(PERF_COUNT_SW_CPU_MIGRATIONS,
1, 1, NULL, 0);
+
+ tmn.task = p;
+ tmn.from_cpu = old_cpu;
+ tmn.to_cpu = new_cpu;
+
+ atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
}
p->se.vruntime -= old_cfsrq->min_vruntime -
new_cfsrq->min_vruntime;

2009-10-14 07:06:04

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

* Jeremy Fitzhardinge <[email protected]> wrote:

> @@ -1981,6 +1989,12 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> #endif
> perf_swcounter_event(PERF_COUNT_SW_CPU_MIGRATIONS,
> 1, 1, NULL, 0);
> +
> + tmn.task = p;
> + tmn.from_cpu = old_cpu;
> + tmn.to_cpu = new_cpu;
> +
> + atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);

We already have one event notifier there - look at the
perf_swcounter_event() callback. Why add a second one for essentially
the same thing?

We should only put a single callback there - a tracepoint defined via
TRACE_EVENT() - and any secondary users can register a callback to the
tracepoint itself.

There's many similar places in the kernel - with notifier chains and
also with a need to get tracepoints there. The fastest (and most
consistent) solution is to add just a single event callback facility.

Ingo

2009-10-14 09:27:17

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On Wed, 2009-10-14 at 09:05 +0200, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
> > @@ -1981,6 +1989,12 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > #endif
> > perf_swcounter_event(PERF_COUNT_SW_CPU_MIGRATIONS,
> > 1, 1, NULL, 0);
> > +
> > + tmn.task = p;
> > + tmn.from_cpu = old_cpu;
> > + tmn.to_cpu = new_cpu;
> > +
> > + atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
>
> We already have one event notifier there - look at the
> perf_swcounter_event() callback. Why add a second one for essentially
> the same thing?
>
> We should only put a single callback there - a tracepoint defined via
> TRACE_EVENT() - and any secondary users can register a callback to the
> tracepoint itself.
>
> There's many similar places in the kernel - with notifier chains and
> also with a need to get tracepoints there. The fastest (and most
> consistent) solution is to add just a single event callback facility.

But that would basically mandate tracepoints to be always enabled, do we
want to go there?

I don't think the overhead of tracepoints is understood well enough,
Jason you poked at that, do you have anything solid on that?

Also, I can imagine the embedded people to not want that.

I really like perf and tracepoints to not become co-dependent until
tracepoint become mandatory for all configurations.

2009-10-14 10:38:06

by Avi Kivity

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On 10/14/2009 06:26 PM, Peter Zijlstra wrote:
>> We already have one event notifier there - look at the
>> perf_swcounter_event() callback. Why add a second one for essentially
>> the same thing?
>>
>> We should only put a single callback there - a tracepoint defined via
>> TRACE_EVENT() - and any secondary users can register a callback to the
>> tracepoint itself.
>>
>> There's many similar places in the kernel - with notifier chains and
>> also with a need to get tracepoints there. The fastest (and most
>> consistent) solution is to add just a single event callback facility.
>>
> But that would basically mandate tracepoints to be always enabled, do we
> want to go there?
>
> I don't think the overhead of tracepoints is understood well enough,
> Jason you poked at that, do you have anything solid on that?
>
> Also, I can imagine the embedded people to not want that.
>
> I really like perf and tracepoints to not become co-dependent until
> tracepoint become mandatory for all configurations.
>

It would be cleanest to have both pvclock and tracepoints select
migration notifiers, defaulting to off. Similarly both perf and kvm
should use preemption notifiers (they do the same thing - switch
per-task values into and out of cpu registers).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-10-14 14:42:54

by Jason Baron

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On Wed, Oct 14, 2009 at 11:26:10AM +0200, Peter Zijlstra wrote:
> On Wed, 2009-10-14 at 09:05 +0200, Ingo Molnar wrote:
> > * Jeremy Fitzhardinge <[email protected]> wrote:
> >
> > > @@ -1981,6 +1989,12 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> > > #endif
> > > perf_swcounter_event(PERF_COUNT_SW_CPU_MIGRATIONS,
> > > 1, 1, NULL, 0);
> > > +
> > > + tmn.task = p;
> > > + tmn.from_cpu = old_cpu;
> > > + tmn.to_cpu = new_cpu;
> > > +
> > > + atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
> >
> > We already have one event notifier there - look at the
> > perf_swcounter_event() callback. Why add a second one for essentially
> > the same thing?
> >
> > We should only put a single callback there - a tracepoint defined via
> > TRACE_EVENT() - and any secondary users can register a callback to the
> > tracepoint itself.
> >
> > There's many similar places in the kernel - with notifier chains and
> > also with a need to get tracepoints there. The fastest (and most
> > consistent) solution is to add just a single event callback facility.
>
> But that would basically mandate tracepoints to be always enabled, do we
> want to go there?
>
> I don't think the overhead of tracepoints is understood well enough,
> Jason you poked at that, do you have anything solid on that?
>

Currently, the cost of the tracepoint is the global memory read, and
compare, and then a jump. On x86 systems that I've tested this can average
anywhere b/w 40 - 100 cycles per tracepoints. Plus, there is the
icache overhead of the extra instructions that we skip over. I'm not
sure how to measure that beyond looking at their size.

I've proposed a 'jump label' set of patches, which essentially hard
codes a jump around the disabled code (avoiding the memory reference).
However, this introduces a high 'write' cost in that we code patch the
jmp to a 'jmp 0' to enable the code.

Along with this optimization I'm also looking into a method for moving
the disabled text to a 'cold' text section, to reduce the icache
overhead. Using these techniques we can reduce the disabled case to
essentially a couple of cycles per tracepoint.

In this case, where the tracepoint is always on, we wouldn't want to
move the tracepoint text to a cold section. Thus, I could introduce a
default enabled/disabled bias to the tracepoint.

However, in introducing such a feature, we are essentially forcing an
always on, or always off usage pattern, since the switch cost is high.
So I want to be careful not limit usefullness of tracepoints with such
an optimization.

thanks,

-Jason

2009-10-14 16:16:54

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH RFC] sched: add notifier for process migration

On 10/14/09 00:05, Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>> @@ -1981,6 +1989,12 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> #endif
>> perf_swcounter_event(PERF_COUNT_SW_CPU_MIGRATIONS,
>> 1, 1, NULL, 0);
>> +
>> + tmn.task = p;
>> + tmn.from_cpu = old_cpu;
>> + tmn.to_cpu = new_cpu;
>> +
>> + atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
>>
> We already have one event notifier there - look at the
> perf_swcounter_event() callback. Why add a second one for essentially
> the same thing?
>
> We should only put a single callback there - a tracepoint defined via
> TRACE_EVENT() - and any secondary users can register a callback to the
> tracepoint itself.
>
> There's many similar places in the kernel - with notifier chains and
> also with a need to get tracepoints there. The fastest (and most
> consistent) solution is to add just a single event callback facility.
>

My specific use case for this notifier is to provide a "you've been
migrated" counter to usermode via a fixmap page, as part of the work to
extend kernel/pvclock.c to implement vread for vsyscall use. I probably
should have referred to that explicitly in the comment for the patch to
give a concrete motivation and rationale.

This means that on applicable systems - ie, running virtualized under
Xen or KVM - this will be something that will be installed early in boot
and called for the entire uptime of the system. Since we don't want a
strong permanent coupling between that particular piece of
arch-independent scheduler code and an arch-specific piece of
functionality, it seemed like a notifier is a good fit.

(Note that this callback is generally useful on all systems for the
vgetcpu vsyscall; it would allow us to use the "tcache" parameter to
provide results which are both fast and 100% accurate, by deferring the
use of expensive lsl/rdtscp instructions until it *knows* the cpu has
changed.)

I tend to view the intent of tracepoints as more a diagnostic tool which
are inserted and removed dynamically as a way of instrumenting a running
system, and the tracepoints themselves don't have side-effects required
for correct running of the system.

More handwavingly, I see the semantics of a tracepoint is basically a
flag-fall showing that a particular piece of kernel code has been
called, whereas notifications are that a particular event has occurred
(which may not be associated with any specific piece of code being
executed). This notion of "task X has been migrated from cpu A to B"
seems like a fairly high-level concept; the fact that it can be
implemented by hooking a single piece of code is side-effect of the
modularity of the scheduler rather than anything relating to the event
itself.

Functionally, tracepoints and notifiers do have broad similarities.
Should they be unified? I don't know, but they do seem to serve
distinct roles.

J