2024-02-16 19:06:06

by Wen Yang

[permalink] [raw]
Subject: [PATCH] coredump debugging: add a tracepoint to report the coredumping

From: Wen Yang <[email protected]>

Currently coredump_task_exit() takes some time to wait for the generation
of the dump file. But if the user-space wants to receive a notification
as soon as possible it maybe inconvenient.

Add the new trace_sched_process_coredump() into coredump_task_exit(),
this way a user-space monitor could easily wait for the exits and
potentially make some preparations in advance.

Signed-off-by: Wen Yang <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
---
include/trace/events/sched.h | 7 +++++++
kernel/exit.c | 1 +
2 files changed, 8 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..ce7448065986 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, sched_process_exit,
TP_PROTO(struct task_struct *p),
TP_ARGS(p));

+/*
+ * Tracepoint for a task coredumping:
+ */
+DEFINE_EVENT(sched_process_template, sched_process_coredump,
+ TP_PROTO(struct task_struct *p),
+ TP_ARGS(p));
+
/*
* Tracepoint for waiting on task to unschedule:
*/
diff --git a/kernel/exit.c b/kernel/exit.c
index 493647fd7c07..c11e12d73f4e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -425,6 +425,7 @@ static void coredump_task_exit(struct task_struct *tsk)
self.next = xchg(&core_state->dumper.next, &self);
else
self.task = NULL;
+ trace_sched_process_coredump(tsk);
/*
* Implies mb(), the result of xchg() must be visible
* to core_state->dumper.
--
2.25.1



2024-02-17 10:50:59

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On 02/17, [email protected] wrote:
>
> From: Wen Yang <[email protected]>
>
> Currently coredump_task_exit() takes some time to wait for the generation
> of the dump file. But if the user-space wants to receive a notification
> as soon as possible it maybe inconvenient.
>
> Add the new trace_sched_process_coredump() into coredump_task_exit(),
> this way a user-space monitor could easily wait for the exits and
> potentially make some preparations in advance.

Can't comment, I never know when the new tracepoint will make sense.

Stupid question. Can we simply shift trace_sched_process_exit() up
before coredump_task_exit() ?

Oleg.


> Signed-off-by: Wen Yang <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: [email protected]
> ---
> include/trace/events/sched.h | 7 +++++++
> kernel/exit.c | 1 +
> 2 files changed, 8 insertions(+)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index dbb01b4b7451..ce7448065986 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, sched_process_exit,
> TP_PROTO(struct task_struct *p),
> TP_ARGS(p));
>
> +/*
> + * Tracepoint for a task coredumping:
> + */
> +DEFINE_EVENT(sched_process_template, sched_process_coredump,
> + TP_PROTO(struct task_struct *p),
> + TP_ARGS(p));
> +
> /*
> * Tracepoint for waiting on task to unschedule:
> */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 493647fd7c07..c11e12d73f4e 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -425,6 +425,7 @@ static void coredump_task_exit(struct task_struct *tsk)
> self.next = xchg(&core_state->dumper.next, &self);
> else
> self.task = NULL;
> + trace_sched_process_coredump(tsk);
> /*
> * Implies mb(), the result of xchg() must be visible
> * to core_state->dumper.
> --
> 2.25.1
>


2024-02-18 15:35:11

by Wen Yang

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping


On 2024/2/17 18:49, Oleg Nesterov wrote:
> On 02/17, [email protected] wrote:
>> From: Wen Yang <[email protected]>
>>
>> Currently coredump_task_exit() takes some time to wait for the generation
>> of the dump file. But if the user-space wants to receive a notification
>> as soon as possible it maybe inconvenient.
>>
>> Add the new trace_sched_process_coredump() into coredump_task_exit(),
>> this way a user-space monitor could easily wait for the exits and
>> potentially make some preparations in advance.
> Can't comment, I never know when the new tracepoint will make sense.
>
> Stupid question.
> Oleg.

Thanks for your help.

trace_sched_process_exit() is located after the PF_EXITING flag is set,
so it could not be moved to there.
Could we make the following modifications?

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..53e9420540dc 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template,
sched_process_exit,
             TP_PROTO(struct task_struct *p),
             TP_ARGS(p));

+/*
+ * Tracepoint for killing a task by a signal:
+ */
+DEFINE_EVENT(sched_process_template, sched_process_kill,
+            TP_PROTO(struct task_struct *p),
+            TP_ARGS(p));
+
 /*
  * Tracepoint for waiting on task to unschedule:
  */
diff --git a/kernel/signal.c b/kernel/signal.c
index 9b40109f0c56..571342799824 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2866,6 +2866,7 @@ bool get_signal(struct ksignal *ksig)
                 * Anything else is fatal, maybe with a core dump.
                 */
                current->flags |= PF_SIGNALED;
+               trace_sched_process_kill(current);

                if (sig_kernel_coredump(signr)) {
                        if (print_fatal_signals)

--

Best wishes,

Wen


>
>> Signed-off-by: Wen Yang <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: Steven Rostedt <[email protected]>
>> Cc: Masami Hiramatsu <[email protected]>
>> Cc: Mathieu Desnoyers <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: [email protected]
>> ---
>> include/trace/events/sched.h | 7 +++++++
>> kernel/exit.c | 1 +
>> 2 files changed, 8 insertions(+)
>>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index dbb01b4b7451..ce7448065986 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -334,6 +334,13 @@ DEFINE_EVENT(sched_process_template, sched_process_exit,
>> TP_PROTO(struct task_struct *p),
>> TP_ARGS(p));
>>
>> +/*
>> + * Tracepoint for a task coredumping:
>> + */
>> +DEFINE_EVENT(sched_process_template, sched_process_coredump,
>> + TP_PROTO(struct task_struct *p),
>> + TP_ARGS(p));
>> +
>> /*
>> * Tracepoint for waiting on task to unschedule:
>> */
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 493647fd7c07..c11e12d73f4e 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -425,6 +425,7 @@ static void coredump_task_exit(struct task_struct *tsk)
>> self.next = xchg(&core_state->dumper.next, &self);
>> else
>> self.task = NULL;
>> + trace_sched_process_coredump(tsk);
>> /*
>> * Implies mb(), the result of xchg() must be visible
>> * to core_state->dumper.
>> --
>> 2.25.1
>>


2024-02-18 17:53:41

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On 02/18, Wen Yang wrote:
>
> On 2024/2/17 18:49, Oleg Nesterov wrote:
> >On 02/17, [email protected] wrote:
> >>From: Wen Yang <[email protected]>
> >>
> >>Currently coredump_task_exit() takes some time to wait for the generation
> >>of the dump file. But if the user-space wants to receive a notification
> >>as soon as possible it maybe inconvenient.
> >>
> >>Add the new trace_sched_process_coredump() into coredump_task_exit(),
> >>this way a user-space monitor could easily wait for the exits and
> >>potentially make some preparations in advance.
> >Can't comment, I never know when the new tracepoint will make sense.
> >
> >Stupid question.
> >Oleg.
>
> Thanks for your help.

Well thanks, but no, I can't help. As I said I can't really comment this
patch.

> trace_sched_process_exit()?is located after the PF_EXITING flag is set

Yes,

> so it could not be moved to there.

Why? DECLARE_EVENT_CLASS(sched_process_template) doesn't report task->flags.

Again, again, I am not arguing. But I think that the changelog should
explain why we can't move trace_sched_process_exit() in more details.

> Could we make the following modifications?
..
>
> @@ -2866,6 +2866,7 @@ bool get_signal(struct ksignal *ksig)
> ???????????????? * Anything else is fatal, maybe with a core dump.
> ???????????????? */
> ??????????????? current->flags |= PF_SIGNALED;
> +?????????????? trace_sched_process_kill(current);

Another case when I can't comment the intent.

We already have trace_signal_deliver() in get_signal(). I'm afraid you
need to explain why do you think userspace needs yet another tracepoint.

Oleg.


2024-02-19 16:28:03

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On Sat, 17 Feb 2024 11:49:24 +0100
Oleg Nesterov <[email protected]> wrote:

> On 02/17, [email protected] wrote:
> >
> > From: Wen Yang <[email protected]>
> >
> > Currently coredump_task_exit() takes some time to wait for the generation
> > of the dump file. But if the user-space wants to receive a notification
> > as soon as possible it maybe inconvenient.
> >
> > Add the new trace_sched_process_coredump() into coredump_task_exit(),
> > this way a user-space monitor could easily wait for the exits and
> > potentially make some preparations in advance.
>
> Can't comment, I never know when the new tracepoint will make sense.
>
> Stupid question. Can we simply shift trace_sched_process_exit() up
> before coredump_task_exit() ?

Reading the rest of the thread and looking at the code, we do have this:

void __noreturn do_exit(long code)
{
struct task_struct *tsk = current;
int group_dead;

[...]
acct_collect(code, group_dead);
if (group_dead)
tty_audit_exit();
audit_free(tsk);

tsk->exit_code = code;
taskstats_exit(tsk, group_dead);

exit_mm();

if (group_dead)
acct_process();
trace_sched_process_exit(tsk);

There's a lot that happens before we trigger the above event. I could
imagine that there are users expecting those actions to have taken place by
the time the event triggered. Like the "exit_mm()" call, as well as many
others.

I would be leery of moving that tracepoint.

-- Steve

2024-02-19 17:02:12

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On 02/19, Steven Rostedt wrote:
>
> On Sat, 17 Feb 2024 11:49:24 +0100
> Oleg Nesterov <[email protected]> wrote:
>
> > On 02/17, [email protected] wrote:
> > >
> > > From: Wen Yang <[email protected]>
> > >
> > > Currently coredump_task_exit() takes some time to wait for the generation
> > > of the dump file. But if the user-space wants to receive a notification
> > > as soon as possible it maybe inconvenient.
> > >
> > > Add the new trace_sched_process_coredump() into coredump_task_exit(),
> > > this way a user-space monitor could easily wait for the exits and
> > > potentially make some preparations in advance.
> >
> > Can't comment, I never know when the new tracepoint will make sense.
> >
> > Stupid question. Can we simply shift trace_sched_process_exit() up
> > before coredump_task_exit() ?
>
> Reading the rest of the thread and looking at the code, we do have this:
>
> void __noreturn do_exit(long code)
> {
> struct task_struct *tsk = current;
> int group_dead;
>
> [...]
> acct_collect(code, group_dead);
> if (group_dead)
> tty_audit_exit();
> audit_free(tsk);
>
> tsk->exit_code = code;
> taskstats_exit(tsk, group_dead);
>
> exit_mm();
>
> if (group_dead)
> acct_process();
> trace_sched_process_exit(tsk);
>
> There's a lot that happens before we trigger the above event.

and a lot after.

To me the current placement of trace_sched_process_exit() looks absolutely
random.

> I could
> imagine that there are users expecting those actions to have taken place by
> the time the event triggered. Like the "exit_mm()" call, as well as many
> others.
>
> I would be leery of moving that tracepoint.

And I agree. I am always scared of every user-visible change, simply
because it is user-visbible.

If it was not clear, I didn't try to nack this patch. I simply do not know
how people use the tracepoints and for what. Apart from debugging.

But if we add the new one into coredump_task_exit(), then we probably want
another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time"
before the exiting task actually exits.

So I think this needs some discussion, and the changelog should probably say
more.

In short: I am glad you are here, I leave this to you and Wen ;)

Oleg.


2024-02-19 17:41:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On Mon, 19 Feb 2024 18:00:38 +0100
Oleg Nesterov <[email protected]> wrote:

> > void __noreturn do_exit(long code)
> > {
> > struct task_struct *tsk = current;
> > int group_dead;
> >
> > [...]
> > acct_collect(code, group_dead);
> > if (group_dead)
> > tty_audit_exit();
> > audit_free(tsk);
> >
> > tsk->exit_code = code;
> > taskstats_exit(tsk, group_dead);
> >
> > exit_mm();
> >
> > if (group_dead)
> > acct_process();
> > trace_sched_process_exit(tsk);
> >
> > There's a lot that happens before we trigger the above event.
>
> and a lot after.

True. There really isn't a meaningful location here is there?

I actually use this tracepoint in my pid tracing.

The set_ftrace_pid and set_event_pid from /sys/kernel/tracing will add and
remove PIDs if the options function-fork or event-fork are set respectively.

I hook to the sched_process_fork tracepoint to add new PIDs if the parent
pid is already in one of the files, and remove a PID via the
sched_process_exit function.

Honestly, if anything, it should probably be moved down right next to
perf_event_exit_task() (I never understood why perf needed its own hooks
and not just use tracepoints).

>
> To me the current placement of trace_sched_process_exit() looks absolutely
> random.

Agreed.

>
> > I could
> > imagine that there are users expecting those actions to have taken place by
> > the time the event triggered. Like the "exit_mm()" call, as well as many
> > others.
> >
> > I would be leery of moving that tracepoint.
>
> And I agree. I am always scared of every user-visible change, simply
> because it is user-visbible.
>
> If it was not clear, I didn't try to nack this patch. I simply do not know
> how people use the tracepoints and for what. Apart from debugging.
>
> But if we add the new one into coredump_task_exit(), then we probably want
> another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time"
> before the exiting task actually exits.
>
> So I think this needs some discussion, and the changelog should probably say
> more.
>
> In short: I am glad you are here, I leave this to you and Wen ;)

I still would like to have your input too ;-)

-- Steve

2024-02-19 18:16:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On 2024-02-19 12:28, Steven Rostedt wrote:
> On Mon, 19 Feb 2024 18:00:38 +0100
> Oleg Nesterov <[email protected]> wrote:
>
>>> void __noreturn do_exit(long code)
>>> {
>>> struct task_struct *tsk = current;
>>> int group_dead;
>>>
>>> [...]
>>> acct_collect(code, group_dead);
>>> if (group_dead)
>>> tty_audit_exit();
>>> audit_free(tsk);
>>>
>>> tsk->exit_code = code;
>>> taskstats_exit(tsk, group_dead);
>>>
>>> exit_mm();
>>>
>>> if (group_dead)
>>> acct_process();
>>> trace_sched_process_exit(tsk);
>>>
>>> There's a lot that happens before we trigger the above event.
>>
>> and a lot after.
>
> True. There really isn't a meaningful location here is there?
>
> I actually use this tracepoint in my pid tracing.
>
> The set_ftrace_pid and set_event_pid from /sys/kernel/tracing will add and
> remove PIDs if the options function-fork or event-fork are set respectively.
>
> I hook to the sched_process_fork tracepoint to add new PIDs if the parent
> pid is already in one of the files, and remove a PID via the
> sched_process_exit function.

No ? Those hook on sched_process_free, which is the actual point where the
task is freed (AFAIR after it's been a zombie and then waited for by another
task).

kernel/trace/trace_events.c:

void trace_event_follow_fork(struct trace_array *tr, bool enable)
{
if (enable) {
register_trace_prio_sched_process_fork(event_filter_pid_sched_process_fork,
tr, INT_MIN);
register_trace_prio_sched_process_free(event_filter_pid_sched_process_exit,
tr, INT_MAX);
} else {
unregister_trace_sched_process_fork(event_filter_pid_sched_process_fork,
tr);
unregister_trace_sched_process_free(event_filter_pid_sched_process_exit,
tr);
}
}

kernel/trace/ftrace.c:

void ftrace_pid_follow_fork(struct trace_array *tr, bool enable)
{
if (enable) {
register_trace_sched_process_fork(ftrace_pid_follow_sched_process_fork,
tr);
register_trace_sched_process_free(ftrace_pid_follow_sched_process_exit,
tr);
} else {
unregister_trace_sched_process_fork(ftrace_pid_follow_sched_process_fork,
tr);
unregister_trace_sched_process_free(ftrace_pid_follow_sched_process_exit,
tr);
}
}

AFAIU, "sched_process_exit" is issued close to the point where the task exits
(it should not go back to userspace after that). "sched_process_free" is done
when the task is really being removed.

Between "sched_process_exit" and "sched_process_free", the task can still be
observed by a trace analysis looking at sched and signal events: it's a zombie at
that stage.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-02-20 15:09:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On Mon, 19 Feb 2024 13:01:16 -0500
Mathieu Desnoyers <[email protected]> wrote:

> No ? Those hook on sched_process_free, which is the actual point where the
> task is freed (AFAIR after it's been a zombie and then waited for by another
> task).

Bah, you're correct. It *used to* be attached to sched_process_exit, and
the function callback still has that name. It was commit afcab63665742
("tracing: Use trace_sched_process_free() instead of exit() for pid
tracing") that changed it.
>
> AFAIU, "sched_process_exit" is issued close to the point where the task exits
> (it should not go back to userspace after that). "sched_process_free" is done
> when the task is really being removed.
>
> Between "sched_process_exit" and "sched_process_free", the task can still be
> observed by a trace analysis looking at sched and signal events: it's a zombie at
> that stage.

Right, thanks for reminding me what I did ;-) I guess I'm starting to get to "that age".

-- Steve

2024-02-21 15:56:30

by Wen Yang

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping



On 2024/2/20 01:00, Oleg Nesterov wrote:
> On 02/19, Steven Rostedt wrote:
>>
>> On Sat, 17 Feb 2024 11:49:24 +0100
>> Oleg Nesterov <[email protected]> wrote:
>>
>>> On 02/17, [email protected] wrote:
>>>>
>>>> From: Wen Yang <[email protected]>
>>>>
>>>> Currently coredump_task_exit() takes some time to wait for the generation
>>>> of the dump file. But if the user-space wants to receive a notification
>>>> as soon as possible it maybe inconvenient.
>>>>
>>>> Add the new trace_sched_process_coredump() into coredump_task_exit(),
>>>> this way a user-space monitor could easily wait for the exits and
>>>> potentially make some preparations in advance.
>>>
>>> Can't comment, I never know when the new tracepoint will make sense.
>>>
>>> Stupid question. Can we simply shift trace_sched_process_exit() up
>>> before coredump_task_exit() ?
>>
>> Reading the rest of the thread and looking at the code, we do have this:
>>
>> void __noreturn do_exit(long code)
>> {
>> struct task_struct *tsk = current;
>> int group_dead;
>>
>> [...]
>> acct_collect(code, group_dead);
>> if (group_dead)
>> tty_audit_exit();
>> audit_free(tsk);
>>
>> tsk->exit_code = code;
>> taskstats_exit(tsk, group_dead);
>>
>> exit_mm();
>>
>> if (group_dead)
>> acct_process();
>> trace_sched_process_exit(tsk);
>>
>> There's a lot that happens before we trigger the above event.
>
> and a lot after.
>
> To me the current placement of looks absolutely
> random.
>
>> I could
>> imagine that there are users expecting those actions to have taken place by
>> the time the event triggered. Like the "exit_mm()" call, as well as many
>> others.
>>
>> I would be leery of moving that tracepoint.
>
> And I agree. I am always scared of every user-visible change, simply
> because it is user-visbible.
>
> If it was not clear, I didn't try to nack this patch. I simply do not know
> how people use the tracepoints and for what. Apart from debugging.
>
> But if we add the new one into coredump_task_exit(), then we probably want
> another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time"
> before the exiting task actually exits.
>
> So I think this needs some discussion, and the changelog should probably say
> more.
>
> In short: I am glad you are here, I leave this to you and Wen ;)
>

Thank you Oleg, thank you Steven,

We could first put trace_sched_process_exit() aside and take a look at
these three patches:

2d4bcf886e42f0f4846a3d9bdc3a90d278903a2e ("exit: Remove
profile_task_exit & profile_munmap")

586b58cac8b4683eb58a1446fbc399de18974e40 (“exit: Move preemption fixup
up, move blocking operations down”)

And the original: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2


Could we add a new TP similar to profile_task_exit()?

--
Best wishes,
Wen






2024-02-21 16:14:24

by Wen Yang

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping



On 2024/2/20 01:28, Steven Rostedt wrote:
> On Mon, 19 Feb 2024 18:00:38 +0100
> Oleg Nesterov <[email protected]> wrote:
>
>>> void __noreturn do_exit(long code)
>>> {
>>> struct task_struct *tsk = current;
>>> int group_dead;
>>>
>>> [...]
>>> acct_collect(code, group_dead);
>>> if (group_dead)
>>> tty_audit_exit();
>>> audit_free(tsk);
>>>
>>> tsk->exit_code = code;
>>> taskstats_exit(tsk, group_dead);
>>>
>>> exit_mm();
>>>
>>> if (group_dead)
>>> acct_process();
>>> trace_sched_process_exit(tsk);
>>>
>>> There's a lot that happens before we trigger the above event.
>>
>> and a lot after.
>
> True. There really isn't a meaningful location here is there?
>
> I actually use this tracepoint in my pid tracing.
>
> The set_ftrace_pid and set_event_pid from /sys/kernel/tracing will add and
> remove PIDs if the options function-fork or event-fork are set respectively.
>
> I hook to the sched_process_fork tracepoint to add new PIDs if the parent
> pid is already in one of the files, and remove a PID via the
> sched_process_exit function.
>
> Honestly, if anything, it should probably be moved down right next to
> () (I never understood why needed its own hooks
> and not just use tracepoints).
>

Perhaps it's just because perf appeared earlier, and it doesn't rely on
TRACEPOINTS.
It is indeed reasonable to replace perf_event_exit_task() with
TRACEPOINT, and we are willing to try to modify it. It will require some
work and time.

--
Best wishes,
Wen

>>
>> To me the current placement of trace_sched_process_exit() looks absolutely
>> random.
>
> Agreed.
>
>>
>>> I could
>>> imagine that there are users expecting those actions to have taken place by
>>> the time the event triggered. Like the "exit_mm()" call, as well as many
>>> others.
>>>
>>> I would be leery of moving that tracepoint.
>>
>> And I agree. I am always scared of every user-visible change, simply
>> because it is user-visbible.
>>
>> If it was not clear, I didn't try to nack this patch. I simply do not know
>> how people use the tracepoints and for what. Apart from debugging.
>>
>> But if we add the new one into coredump_task_exit(), then we probably want
>> another one in ptrace_event(PTRACE_EVENT_EXIT) ? It too can "take some time"
>> before the exiting task actually exits.
>>
>> So I think this needs some discussion, and the changelog should probably say
>> more.
>>
>> In short: I am glad you are here, I leave this to you and Wen ;)
>
> I still would like to have your input too ;-)
>
> -- Steve


2024-02-21 17:46:51

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On Wed, 21 Feb 2024 23:45:58 +0800
Wen Yang <[email protected]> wrote:

> Thank you Oleg, thank you Steven,
>
> We could first put trace_sched_process_exit() aside and take a look at
> these three patches:
>
> 2d4bcf886e42f0f4846a3d9bdc3a90d278903a2e ("exit: Remove
> profile_task_exit & profile_munmap")
>
> 586b58cac8b4683eb58a1446fbc399de18974e40 (“exit: Move preemption fixup
> up, move blocking operations down”)
>
> And the original: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
>
>
> Could we add a new TP similar to profile_task_exit()?

I have no problem with adding that. But others may have other opinions on
the subject matter.

-- Steve

2024-02-21 17:52:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On Thu, 22 Feb 2024 00:00:55 +0800
Wen Yang <[email protected]> wrote:

> > Honestly, if anything, it should probably be moved down right next to
> > () (I never understood why needed its own hooks
> > and not just use tracepoints).
> >
>
> Perhaps it's just because perf appeared earlier, and it doesn't rely on
> TRACEPOINTS.

tracepoints existed in 2008 and perf was added in 2009, so time frame was
not a factor.

> It is indeed reasonable to replace perf_event_exit_task() with
> TRACEPOINT, and we are willing to try to modify it. It will require some
> work and time.

I think that would be worth while, but I guess Peter will need to approve that.

-- Steve

2024-02-23 14:24:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On Mon, 19 Feb 2024 13:01:16 -0500
Mathieu Desnoyers <[email protected]> wrote:

> Between "sched_process_exit" and "sched_process_free", the task can still be
> observed by a trace analysis looking at sched and signal events: it's a zombie at
> that stage.

Looking at the history of this tracepoint, it was added in 2008 by commit
0a16b60758433 ("tracing, sched: LTTng instrumentation - scheduler").
Hmm, LLTng? I wonder who the author was?

Author: Mathieu Desnoyers <[email protected]>

:-D

Mathieu, I would say it's your call on where the tracepoint can be located.
You added it, you own it!

-- Steve

2024-02-23 16:54:52

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On 2024-02-23 09:26, Steven Rostedt wrote:
> On Mon, 19 Feb 2024 13:01:16 -0500
> Mathieu Desnoyers <[email protected]> wrote:
>
>> Between "sched_process_exit" and "sched_process_free", the task can still be
>> observed by a trace analysis looking at sched and signal events: it's a zombie at
>> that stage.
>
> Looking at the history of this tracepoint, it was added in 2008 by commit
> 0a16b60758433 ("tracing, sched: LTTng instrumentation - scheduler").
> Hmm, LLTng? I wonder who the author was?

[ common typo: LLTng -> LTTng ;-) ]

>
> Author: Mathieu Desnoyers <[email protected]>
>
> :-D
>
> Mathieu, I would say it's your call on where the tracepoint can be located.
> You added it, you own it!

Wow! that's now 16 years ago :)

I've checked with Matthew Khouzam (maintainer of Trace Compass)
which care about this tracepoint, and we have not identified any
significant impact of moving it on its model of the scheduler, other
than slightly changing its timing.

I've also checked quickly in lttng-analyses and have not found
any code that care about its specific placement.

So I would say go ahead and move it earlier in do_exit(), it's
fine by me.

If you are interested in a bit of archeology, "sched_process_free"
originated from my ltt-experimental 0.1.99.13 kernel patch against
2.6.12-rc4-mm2 back in September 2005 (that's 19 years ago). It was
a precursor to the LTTng 0.x kernel patchset.

https://lttng.org/files/ltt-experimental/patch-2.6.12-rc4-mm2-ltt-exp-0.1.99.13.gz

Index: kernel/exit.c
===================================================================
--- a/kernel/exit.c (.../trunk/kernel/linux-2.6.12-rc4-mm2) (revision 41)
+++ b/kernel/exit.c (.../branches/mathieu/linux-2.6.12-rc4-mm2) (revision 41)
@@ -4,6 +4,7 @@
* Copyright (C) 1991, 1992 Linus Torvalds
*/

+#include <linux/ltt/ltt-facility-process.h>
#include <linux/config.h>
#include <linux/mm.h>
#include <linux/slab.h>
@@ -55,6 +56,7 @@ static void __unhash_process(struct task
}

REMOVE_LINKS(p);
+ trace_process_free(p->pid);
}

void release_task(struct task_struct * p)
@@ -832,6 +834,8 @@ fastcall NORET_TYPE void do_exit(long co
}
exit_mm(tsk);

+ trace_process_exit(tsk->pid);
+
exit_sem(tsk);
__exit_files(tsk);
__exit_fs(tsk);

This was a significant improvement over the prior LTT which only
had the equivalent of "sched_process_exit", which caused issues
with the Linux scheduler model in LTTV due to zombie processes.

Here is where it appeared in LTT back in 1999:

http://www.opersys.com/ftp/pub/LTT/TracePackage-0.9.0.tgz

patch-ltt-2.2.13-991118

diff -urN linux/kernel/exit.c linux-2.2.13/kernel/exit.c
--- linux/kernel/exit.c Tue Oct 19 20:14:02 1999
+++ linux-2.2.13/kernel/exit.c Sun Nov 7 23:49:17 1999
@@ -14,6 +14,8 @@
#include <linux/acct.h>
#endif

+#include <linux/trace.h>
+
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/mmu_context.h>
@@ -386,6 +388,8 @@
del_timer(&tsk->real_timer);
end_bh_atomic();

+ TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0);
+
lock_kernel();
fake_volatile:
#ifdef CONFIG_BSD_PROCESS_ACCT

And it was moved to its current location (after exit_mm()) a bit
later (2001):

http://www.opersys.com/ftp/pub/LTT/TraceToolkit-0.9.5pre2.tgz

Patches/patch-ltt-linux-2.4.5-vanilla-010909-1.10

diff -urN linux/kernel/exit.c /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c
--- linux/kernel/exit.c Fri May 4 17:44:06 2001
+++ /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c Wed Jun 20 12:39:24 2001
@@ -14,6 +14,8 @@
#include <linux/acct.h>
#endif

+#include <linux/trace.h>
+
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/mmu_context.h>
@@ -439,6 +441,8 @@
#endif
__exit_mm(tsk);

+ TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0);
+
lock_kernel();
sem_exit();
__exit_files(tsk);

So this sched_process_exit placement was actually decided
by Karim Yaghmour back in the LTT days (2001). I don't think
he will mind us moving it around some 23 years later. ;)

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-02-23 17:01:37

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping

On Fri, 23 Feb 2024 11:54:30 -0500
Mathieu Desnoyers <[email protected]> wrote:

> So this sched_process_exit placement was actually decided
> by Karim Yaghmour back in the LTT days (2001). I don't think
> he will mind us moving it around some 23 years later. ;)

And I wonder how many people are involved in this that are younger than
that change :-p

I'm starting to feel really old.

-- Steve

2024-02-23 17:14:49

by Karim Yaghmour

[permalink] [raw]
Subject: Re: [PATCH] coredump debugging: add a tracepoint to report the coredumping


On 2/23/24 11:54, Mathieu Desnoyers wrote:
..
> So this sched_process_exit placement was actually decided
> by Karim Yaghmour back in the LTT days (2001). I don't think
> he will mind us moving it around some 23 years later. ;)

Gee ... the shadows are longer than I thought :)

It's all yours. You guys have been doing a fantastic job and I'm happy
to be on the consumer side of it all these days :D

Cheers,

--
Karim Yaghmour
CEO - Opersys inc. / http://www.opersys.com
http://twitter.com/karimyaghmour