2009-04-20 01:44:28

by Zhao Lei

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro

* From: "KOSAKI Motohiro" <[email protected]>
>> Thanks for your suggest.
>> I read distinction between workqueue and worklet.
>> In my schedule, this patch is first step of our target, my image of steps is:
>> 1: Move current workqueuetracepoints into TRACEEVENT (this patch)
>> 2: Make workqueuetrace support per-worklet output (doing)
>> 3: Add time information to workqueuetrace's worklet stat (need above new TRACEPOINT)
>>
>> So, i prepared to add new worklet tracepoints in step3.
>> What's your opinion?
>

Hello, Kosaki-san

> Zhao-san, Maintainer's easy reviewability is very important.
I agree it.

> Can you switch 2 and 3 ?
Sorry for my poor English.
I means 2 is to add worklet stat, 3 is to add more information to worklet stat,
3 is based on 2, switch 2 and 3 means merge 2 and 3.

Thanks
Zhaolei
>
> it cause Ingo can review the patch fill his request or not directly.
>
>
>
>????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?


2009-04-20 01:49:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro

> * From: "KOSAKI Motohiro" <[email protected]>
> >> Thanks for your suggest.
> >> I read distinction between workqueue and worklet.
> >> In my schedule, this patch is first step of our target, my image of steps is:
> >> 1: Move current workqueuetracepoints into TRACEEVENT (this patch)
> >> 2: Make workqueuetrace support per-worklet output (doing)
> >> 3: Add time information to workqueuetrace's worklet stat (need above new TRACEPOINT)
> >>
> >> So, i prepared to add new worklet tracepoints in step3.
> >> What's your opinion?
> >
>
> Hello, Kosaki-san
>
> > Zhao-san, Maintainer's easy reviewability is very important.
> I agree it.
>
> > Can you switch 2 and 3 ?
> Sorry for my poor English.
> I means 2 is to add worklet stat, 3 is to add more information to worklet stat,
> 3 is based on 2, switch 2 and 3 means merge 2 and 3.

OK.

thus, I recoomend to pending 2 and 3. and change 1 as Ingo suggested.
if not, reviewer might forgot this thread.


2009-04-20 08:47:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro


* KOSAKI Motohiro <[email protected]> wrote:

> OK.
>
> thus, I recoomend to pending 2 and 3. and change 1 as Ingo
> suggested. if not, reviewer might forgot this thread.

Ok - but even the original order of patches would be fine - as long
as the end result is agreed on and is reached.

I've asked Frederic (the original author of the workqueue tracer
plugin) to collect these patches into a topic branch in his tree.
Once it's a complete story we can pull it into tip:tracing/core.

Oleg seemed not to disagree and Andrew is in hiding.

Andrew, Oleg: if you plan to make unhappy noises about this level of
instrumentation in the workqueue code, please do it sooner rather
than later, as there's quite some effort injected into this already.
A tentative non-NAK now (patches are still being sorted out) and an
Ack on the final topic tree from you (once we send it and if it's
good) and general happiness would be the ideal outcome :)

Thanks,

Ingo

2009-04-20 22:31:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro

On 04/20, Ingo Molnar wrote:
>
> Andrew, Oleg: if you plan to make unhappy noises about this level of
> instrumentation in the workqueue code, please do it sooner rather
> than later, as there's quite some effort injected into this already.
> A tentative non-NAK now (patches are still being sorted out) and an
> Ack on the final topic tree from you (once we send it and if it's
> good) and general happiness would be the ideal outcome :)

Imho, this info is useful, and the changes are fine.

But I didn't study kernel/trace/trace_workqueue.c yet... Sorry, I was
going to do this, but I didn't.

At first glance, with or without these new changes some parts of
trace_workqueue.c looks suspicious.

For example, I don't understand cpu_workqueue_stats->pid. Why it is
needed? Why can't we just record wq_thread itself? And if we copy
wq_thread->comm to cpu_workqueue_stats, we don't even need get/put
task_struct, afaics.

probe_workqueue_destruction() does cpumask_first(cwq->cpus_allowed),
this doesn't look right. When workqueue_cpu_callback() calls
cleanup_workqueue_thread(), wq_thread is no longer affine to CPU it
was created on. This means probe_workqueue_destruction() can't always
find the correct cpu_workqueue_stats in workqueue_cpu_stat(cpu), no?

Oleg.

2009-04-20 23:48:52

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro

On Tue, Apr 21, 2009 at 12:25:14AM +0200, Oleg Nesterov wrote:
> On 04/20, Ingo Molnar wrote:
> >
> > Andrew, Oleg: if you plan to make unhappy noises about this level of
> > instrumentation in the workqueue code, please do it sooner rather
> > than later, as there's quite some effort injected into this already.
> > A tentative non-NAK now (patches are still being sorted out) and an
> > Ack on the final topic tree from you (once we send it and if it's
> > good) and general happiness would be the ideal outcome :)
>
> Imho, this info is useful, and the changes are fine.
>
> But I didn't study kernel/trace/trace_workqueue.c yet... Sorry, I was
> going to do this, but I didn't.
>
> At first glance, with or without these new changes some parts of
> trace_workqueue.c looks suspicious.
>
> For example, I don't understand cpu_workqueue_stats->pid. Why it is
> needed? Why can't we just record wq_thread itself? And if we copy
> wq_thread->comm to cpu_workqueue_stats, we don't even need get/put
> task_struct, afaics.


We record the pid because of a race condition that can happen
against the stat tracing framework.

For example,

- the user opens the stat file
- then the entries are loaded and sorted into memory
- one of the workqueues is destroyed
- the user reads the file. We try to get the task struct
of each workqueues that were recorded using their pid.
If the task is destroyed then this is fine, we don't get
any struct. If we were using the task_struct as an identifier,
me would have derefenced a freed pointer.


> probe_workqueue_destruction() does cpumask_first(cwq->cpus_allowed),
> this doesn't look right. When workqueue_cpu_callback() calls
> cleanup_workqueue_thread(), wq_thread is no longer affine to CPU it
> was created on. This means probe_workqueue_destruction() can't always
> find the correct cpu_workqueue_stats in workqueue_cpu_stat(cpu), no?


Ah, at this stage of cpu_down() the cpu is cleared on
tsk->cpu_allowed ?

Thanks for looking at this.

Frederic.


> Oleg.
>

2009-04-21 15:34:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro

On 04/21, Frederic Weisbecker wrote:
>
> On Tue, Apr 21, 2009 at 12:25:14AM +0200, Oleg Nesterov wrote:
> >
> > For example, I don't understand cpu_workqueue_stats->pid. Why it is
> > needed? Why can't we just record wq_thread itself? And if we copy
> > wq_thread->comm to cpu_workqueue_stats, we don't even need get/put
> > task_struct, afaics.
>
>
> We record the pid because of a race condition that can happen
> against the stat tracing framework.
>
> For example,
>
> - the user opens the stat file
> - then the entries are loaded and sorted into memory
> - one of the workqueues is destroyed

so, its pid is freed and can be re-used,

> - the user reads the file. We try to get the task struct
> of each workqueues that were recorded using their pid.

if it is already re-used when workqueue_stat_show() is called, we
can get a wrong a wrong tsk.

Another problem with pid_t, it needs a costly find_get_pid(). And please
note that find_get_pid() uses current->nsproxy->pid_ns, this means we
can get a wrong tsk or NULL if we read debugfs from sub-namespace, but
this is minor.

Using "struct pid* pid" is better than "pid_t pid" and solves these
problems. But I still think we don't need it at all.

> If the task is destroyed then this is fine, we don't get
> any struct. If we were using the task_struct as an identifier,
> me would have derefenced a freed pointer.

Please note I said "get/put task_struct" above. probe_workqueue_creation()
does get_task_struct(), and probe_workqueue_destruction() put_task_struct().

But, unless I missed something we don't need even this. Why do we
need to dereference this task_struct? The only reason afaics is
that we read tsk->comm. But we can copy wq_thread->comm to
cpu_workqueue_stats when probe_workqueue_creation() is called . In that
case we don't need get/put, we only use cws->task (or whatever) as a
"void *", just to check "if (node->task == wq_thread)" in
probe_workqueue_insertion/etc.

But it is very possible I missed something, please correct me.

> > probe_workqueue_destruction() does cpumask_first(cwq->cpus_allowed),
> > this doesn't look right. When workqueue_cpu_callback() calls
> > cleanup_workqueue_thread(), wq_thread is no longer affine to CPU it
> > was created on. This means probe_workqueue_destruction() can't always
> > find the correct cpu_workqueue_stats in workqueue_cpu_stat(cpu), no?
>
>
> Ah, at this stage of cpu_down() the cpu is cleared on
> tsk->cpu_allowed ?

Yes. Well not exactly...

More precisely, at this stage tsk->cpu_allowed is not bound, it has
"all CPUS", == cpu_possible_mask. This means that cpumask_first()
returns the "random" value. The fix is simple, I think you should
just add "int cpu" argument to cleanup_workqueue_thread().


Hmm... why cpu_workqueue_stats->inserted is atomic_t? We increment it
in do_workqueue_insertion() (should be static, no?), and it is always
called with cpu_workqueue_struct->lock held.

Oleg.

2009-04-21 15:55:29

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro

On 04/21, Oleg Nesterov wrote:
>
> On 04/21, Frederic Weisbecker wrote:
> >
> > For example,
> >
> > - the user opens the stat file
> > - then the entries are loaded and sorted into memory
> > - one of the workqueues is destroyed

Hmm. And what protects cws when, say, workqueue_stat_show() is called?
Can't we race with probe_workqueue_destruction()->kfree() ?

Oleg.

2009-04-21 18:28:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro

On Tue, Apr 21, 2009 at 05:28:43PM +0200, Oleg Nesterov wrote:
> On 04/21, Frederic Weisbecker wrote:
> >
> > On Tue, Apr 21, 2009 at 12:25:14AM +0200, Oleg Nesterov wrote:
> > >
> > > For example, I don't understand cpu_workqueue_stats->pid. Why it is
> > > needed? Why can't we just record wq_thread itself? And if we copy
> > > wq_thread->comm to cpu_workqueue_stats, we don't even need get/put
> > > task_struct, afaics.
> >
> >
> > We record the pid because of a race condition that can happen
> > against the stat tracing framework.
> >
> > For example,
> >
> > - the user opens the stat file
> > - then the entries are loaded and sorted into memory
> > - one of the workqueues is destroyed
>
> so, its pid is freed and can be re-used,



Indeed.



> > - the user reads the file. We try to get the task struct
> > of each workqueues that were recorded using their pid.
>
> if it is already re-used when workqueue_stat_show() is called, we
> can get a wrong a wrong tsk.
>
> Another problem with pid_t, it needs a costly find_get_pid(). And please
> note that find_get_pid() uses current->nsproxy->pid_ns, this means we
> can get a wrong tsk or NULL if we read debugfs from sub-namespace, but
> this is minor.
>
> Using "struct pid* pid" is better than "pid_t pid" and solves these
> problems. But I still think we don't need it at all.



Ok, I think you're right. I will manage to remove the pid and
use the task_struct as the identifier.



> > If the task is destroyed then this is fine, we don't get
> > any struct. If we were using the task_struct as an identifier,
> > me would have derefenced a freed pointer.
>
> Please note I said "get/put task_struct" above. probe_workqueue_creation()
> does get_task_struct(), and probe_workqueue_destruction() put_task_struct().
>
> But, unless I missed something we don't need even this. Why do we
> need to dereference this task_struct? The only reason afaics is
> that we read tsk->comm. But we can copy wq_thread->comm to
> cpu_workqueue_stats when probe_workqueue_creation() is called . In that
> case we don't need get/put, we only use cws->task (or whatever) as a
> "void *", just to check "if (node->task == wq_thread)" in
> probe_workqueue_insertion/etc.
>
> But it is very possible I missed something, please correct me.



No it's a very good idea!
Such a way to proceed will solve a lot of the racy issues that you are
reporting.


> > > probe_workqueue_destruction() does cpumask_first(cwq->cpus_allowed),
> > > this doesn't look right. When workqueue_cpu_callback() calls
> > > cleanup_workqueue_thread(), wq_thread is no longer affine to CPU it
> > > was created on. This means probe_workqueue_destruction() can't always
> > > find the correct cpu_workqueue_stats in workqueue_cpu_stat(cpu), no?
> >
> >
> > Ah, at this stage of cpu_down() the cpu is cleared on
> > tsk->cpu_allowed ?
>
> Yes. Well not exactly...
>
> More precisely, at this stage tsk->cpu_allowed is not bound, it has
> "all CPUS", == cpu_possible_mask. This means that cpumask_first()
> returns the "random" value. The fix is simple, I think you should
> just add "int cpu" argument to cleanup_workqueue_thread().
>



Ok.
But where is it done? I mean, by looking at workqueue_cpu_callback()
in workqueue.c, I only see this part related to _cpu_down():

case CPU_POST_DEAD:
cleanup_workqueue_thread(cwq);
break;



> Hmm... why cpu_workqueue_stats->inserted is atomic_t? We increment it
> in do_workqueue_insertion() (should be static, no?), and it is always
> called with cpu_workqueue_struct->lock held.


Ah indeed. Good catch.

Your comments are very useful. I will fix all that on top of Zhaolei
patches.

Thanks a lot Oleg!

Frederic.

2009-04-21 18:33:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro

On Tue, Apr 21, 2009 at 05:50:32PM +0200, Oleg Nesterov wrote:
> On 04/21, Oleg Nesterov wrote:
> >
> > On 04/21, Frederic Weisbecker wrote:
> > >
> > > For example,
> > >
> > > - the user opens the stat file
> > > - then the entries are loaded and sorted into memory
> > > - one of the workqueues is destroyed
>
> Hmm. And what protects cws when, say, workqueue_stat_show() is called?
> Can't we race with probe_workqueue_destruction()->kfree() ?
>
> Oleg.
>


Indeed. I plan to implement a waitqueue on the stat tracing framework
to fix it (per tracer of course).

Thanks.

2009-04-21 19:42:59

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] ftrace, workqueuetrace: Make workqueuetracepoints use TRACE_EVENT macro

On 04/21, Frederic Weisbecker wrote:
>
> On Tue, Apr 21, 2009 at 05:28:43PM +0200, Oleg Nesterov wrote:
> > More precisely, at this stage tsk->cpu_allowed is not bound, it has
> > "all CPUS", == cpu_possible_mask. This means that cpumask_first()
> > returns the "random" value. The fix is simple, I think you should
> > just add "int cpu" argument to cleanup_workqueue_thread().
>
> Ok.
> But where is it done? I mean, by looking at workqueue_cpu_callback()
> in workqueue.c, I only see this part related to _cpu_down():
>
> case CPU_POST_DEAD:
> cleanup_workqueue_thread(cwq);
> break;

There are two case. Please note that create_workqueue_thread() does not
bind the thread, we delay kthread_bind() untill start_workqueue_thread().
So, if we have CPU_UP_CANCELED (after CPU_UP_PREPARE) we call
cleanup_workqueue_thread() but the thread still has ->cpu_allowed ==
cpu_all_mask.

Another case is a normal cpu_down(). When CPU_POST_DEAD happens, this
CPU is really dead. All tasks were alredy moved from the dead cpu.
And since wq_thread was affine to that CPU, it gets the "full" cpu mask.

OK, I don't think it is possible to parse the text above ;) Please
look at /* No more Mr. Nice Guy. */ in move_task_off_dead_cpu().

Oleg.