2010-04-12 16:56:36

by Masami Hiramatsu

[permalink] [raw]
Subject: [PATCH -mm v6] tracepoint: Add signal coredump tracepoint

Add signal coredump tracepoint which shows signal number,
mm->flags, core file size limitation, the result of
coredump, and core file name.

This tracepoint requirement comes mainly from the viewpoint of
administrators. Since now we have introduced many coredump
configurations (e.g. dumpable, coredump_filter, core_pattern,
etc) and some of them can be modified by users, it will be hard
to know what was actually dumped (or not dumped) after some
problem happened on the system. For example, a process didn't
generated core, coredump doesn't have some sections, etc.
In those cases, the coredump tracepoint can help us to know
why the core file is so big or small, or not generated, by
recording all configurations for all processes on the system.
That will reduce system-administration cost.

Changes in v6:
- Update against the latest -mm.

Changes in v5:
- Just update against the latest -tip.

Changes in v4:
- Rename limit trace-argument to core_size_limit.

Changes in v3:
- Move tracepoint at the end of do_coredump() for tracing
the result of coredump.
- Use retval to record error-code at every failure points
for passing the result of coredump to tracepoint.
- Trace retval instead of cprm->file for recording the
result of coredump.

Changes in v2:
- Fix a bug to clear file local variable when
call_usermodehelper_pipe() is failed.

Signed-off-by: Masami Hiramatsu <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
---

fs/exec.c | 50 ++++++++++++++++++++++++++++++++---------
include/trace/events/signal.h | 48 +++++++++++++++++++++++++++++++++++++++
2 files changed, 87 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 725d7ef..07e0d3e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
#include <linux/pipe_fs_i.h>
+#include <trace/events/signal.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1833,7 +1834,7 @@ static int umh_pipe_setup(struct subprocess_info *info)
void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
struct core_state core_state;
- char corename[CORENAME_MAX_SIZE + 1];
+ char corename[CORENAME_MAX_SIZE + 1] = "";
struct mm_struct *mm = current->mm;
struct linux_binfmt * binfmt;
const struct cred *old_cred;
@@ -1844,6 +1845,7 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
static atomic_t core_dump_count = ATOMIC_INIT(0);
struct coredump_params cprm = {
.signr = signr,
+ .file = NULL,
.regs = regs,
.limit = rlimit(RLIMIT_CORE),
/*
@@ -1857,14 +1859,19 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
audit_core_dumps(signr);

binfmt = mm->binfmt;
- if (!binfmt || !binfmt->core_dump)
+ if (!binfmt || !binfmt->core_dump) {
+ retval = -ENOSYS;
goto fail;
+ }
if (!__get_dumpable(cprm.mm_flags))
+ /* This is not an error. retval should be 0 */
goto fail;

cred = prepare_creds();
- if (!cred)
+ if (!cred) {
+ retval = -EPERM;
goto fail;
+ }
/*
* We cannot trust fsuid as being the "true" uid of the
* process nor do we know its entire history. We only know it
@@ -1919,12 +1926,14 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
"Process %d(%s) has RLIMIT_CORE set to 1\n",
task_tgid_vnr(current), current->comm);
printk(KERN_WARNING "Aborting core\n");
+ retval = -EINVAL;
goto fail_unlock;
}
cprm.limit = RLIM_INFINITY;

dump_count = atomic_inc_return(&core_dump_count);
if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ retval = -EFBIG;
printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
task_tgid_vnr(current), current->comm);
printk(KERN_WARNING "Skipping core dump\n");
@@ -1933,6 +1942,7 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)

helper_argv = argv_split(GFP_KERNEL, corename+1, NULL);
if (!helper_argv) {
+ retval = -ENOMEM;
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
goto fail_dropcount;
@@ -1950,35 +1960,50 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
} else {
struct inode *inode;

- if (cprm.limit < binfmt->min_coredump)
+ if (cprm.limit < binfmt->min_coredump) {
+ retval = -EFBIG;
goto fail_unlock;
+ }

cprm.file = filp_open(corename,
O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
0600);
- if (IS_ERR(cprm.file))
+ if (IS_ERR(cprm.file)) {
+ retval = (int)PTR_ERR(cprm.file);
goto fail_unlock;
+ }

inode = cprm.file->f_path.dentry->d_inode;
- if (inode->i_nlink > 1)
+ if (inode->i_nlink > 1) {
+ retval = -EMLINK;
goto close_fail;
- if (d_unhashed(cprm.file->f_path.dentry))
+ }
+ if (d_unhashed(cprm.file->f_path.dentry)) {
+ retval = -EBADF;
goto close_fail;
+ }
/*
* AK: actually i see no reason to not allow this for named
* pipes etc, but keep the previous behaviour for now.
*/
- if (!S_ISREG(inode->i_mode))
+ if (!S_ISREG(inode->i_mode)) {
+ retval = -EBADF;
goto close_fail;
+ }
/*
* Dont allow local users get cute and trick others to coredump
* into their pre-created files.
*/
- if (inode->i_uid != current_fsuid())
+ if (inode->i_uid != current_fsuid()) {
+ retval = -EPERM;
goto close_fail;
- if (!cprm.file->f_op || !cprm.file->f_op->write)
+ }
+ if (!cprm.file->f_op || !cprm.file->f_op->write) {
+ retval = -EINVAL;
goto close_fail;
- if (do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file))
+ }
+ retval = do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file);
+ if (retval)
goto close_fail;
}

@@ -2000,5 +2025,8 @@ fail_unlock:
fail_creds:
put_cred(cred);
fail:
+ /* Trace coredump parameters and return value */
+ trace_signal_coredump(&cprm, corename, retval);
+
return;
}
diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
index a510b75..6dbc856 100644
--- a/include/trace/events/signal.h
+++ b/include/trace/events/signal.h
@@ -4,8 +4,10 @@
#if !defined(_TRACE_SIGNAL_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_SIGNAL_H

+#include <linux/err.h>
#include <linux/signal.h>
#include <linux/sched.h>
+#include <linux/binfmts.h>
#include <linux/tracepoint.h>

#define TP_STORE_SIGINFO(__entry, info) \
@@ -167,6 +169,52 @@ TRACE_EVENT(signal_lose_info,
TP_printk("sig=%d group=%d errno=%d code=%d",
__entry->sig, __entry->group, __entry->errno, __entry->code)
);
+
+/**
+ * signal_coredump - called when dumping core by signal
+ * @cprm: pointer to struct coredump_params
+ * @core_name: core-name string
+ * @retval: return value of binfmt->coredump or error-code
+ *
+ * Current process dumps core file to 'core_name' file, because 'cprm->signr'
+ * signal is delivered.
+ * 'retval' is an error code or 0/1. retval == 1 means the core file was
+ * dumped successfully and retval == 0 means binfmt->coredump failed to dump.
+ * If retval < 0, this means do_coredump() failed to dump core file before
+ * calling binfmt->coredump.
+ */
+TRACE_EVENT(signal_coredump,
+
+ TP_PROTO(struct coredump_params *cprm, const char *core_name,
+ int retval),
+
+ TP_ARGS(cprm, core_name, retval),
+
+ TP_STRUCT__entry(
+ __field( int, sig )
+ __field( unsigned long, core_size_limit )
+ __field( unsigned long, flags )
+ __field( int, retval )
+ __string( name, core_name )
+ ),
+
+
+ TP_fast_assign(
+ __entry->sig = (int)cprm->signr;
+ __entry->core_size_limit = cprm->limit;
+ __entry->flags = cprm->mm_flags;
+ __entry->retval = retval;
+ __assign_str(name, core_name);
+ ),
+
+ TP_printk("sig=%d core_size_limit=%lu dumpable=0x%lx dump_filter=0x%lx"
+ " corename=\"%s\" retval=%d",
+ __entry->sig, __entry->core_size_limit,
+ __entry->flags & MMF_DUMPABLE_MASK,
+ (__entry->flags & MMF_DUMP_FILTER_MASK) >>
+ MMF_DUMP_FILTER_SHIFT,
+ __get_str(name), __entry->retval)
+);
#endif /* _TRACE_SIGNAL_H */

/* This part must be outside protection */


--
Masami Hiramatsu
e-mail: [email protected]


2010-04-21 00:56:31

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH -mm v6] tracepoint: Add signal coredump tracepoint

> Add signal coredump tracepoint which shows signal number,
> mm->flags, core file size limitation, the result of
> coredump, and core file name.

The "retval" encoding seems a bit arcane. I wonder if it might be better
to just have separate tracepoints for successful and failed dump attempts.
Note that whether or not the dump succeeded is already available in
(task->signal->group_exit_code & 0x80) as seen at exit or death tracing
events.

> This tracepoint requirement comes mainly from the viewpoint of
> administrators. [...]

The purposes you mention seem to be served well enough by this tracepoint.
But I recall having the impression that one of the original motivating
interests for tracing core-dump details was to understand when a giant core
dump was responsible for huge amounts of i/o and/or memory thrashing.
(Once you notice that happening, you might adjust coredump_filter settings
to reduce the problem.) Your new tracepoint doesn't help directly with
tracking those sorts of issues, because it only happens after all the work
is done. If you are monitoring trace_signal_deliver, then you can filter
those for SIG_DFL cases of sig_kernel_coredump() signals and recognize that
as the beginning of the coredump. Still, it might be preferable to have
explicit start-core-dump and end-core-dump tracepoints.

Furthermore, I can see potential use for tracepoints before and after
coredump_wait(), which synchronizes other threads before actually starting
to calculate and write the dump. The window after coredump_wait() and
before the post-dump tracepoint is where the actual work of writing the
core file takes place, in case you want to monitor i/o load between those
marks or suchlike.

> - char corename[CORENAME_MAX_SIZE + 1];
> + char corename[CORENAME_MAX_SIZE + 1] = "";

This initialization of a stack array means the same as:

memset(corename, '\0', sizeof corename);

So the compiler generates that because those are the semantics.
All you need is:

corename[0] = '\0';

since this is a string.


Thanks,
Roland

2010-04-21 14:31:55

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH -mm v6] tracepoint: Add signal coredump tracepoint

Hi Roland,

Roland McGrath wrote:
>> Add signal coredump tracepoint which shows signal number,
>> mm->flags, core file size limitation, the result of
>> coredump, and core file name.
>
> The "retval" encoding seems a bit arcane. I wonder if it might be better
> to just have separate tracepoints for successful and failed dump attempts.
> Note that whether or not the dump succeeded is already available in
> (task->signal->group_exit_code & 0x80) as seen at exit or death tracing
> events.

OK, please read the previous discussion and tell me what you think about...
https://patchwork.kernel.org/patch/64345/
---
> > What do you think this tracepoint's use case?
>
> Frankly to say, our first attempt was tracing mm->flags because
> it can be changed by users without asking, and they sometimes
> ask why core is not perfect or why core file is so big.
>
> Perhaps, covering all of those failure cases and succeed cases,
> gives better information for them. In that case, we might better
> tweak execution(goto) path to leave some error code on retval.

This is enough acceptable to me.
---

>
>> This tracepoint requirement comes mainly from the viewpoint of
>> administrators. [...]
>
> The purposes you mention seem to be served well enough by this tracepoint.
> But I recall having the impression that one of the original motivating
> interests for tracing core-dump details was to understand when a giant core
> dump was responsible for huge amounts of i/o and/or memory thrashing.
> (Once you notice that happening, you might adjust coredump_filter settings
> to reduce the problem.) Your new tracepoint doesn't help directly with
> tracking those sorts of issues, because it only happens after all the work
> is done. If you are monitoring trace_signal_deliver, then you can filter
> those for SIG_DFL cases of sig_kernel_coredump() signals and recognize that
> as the beginning of the coredump. Still, it might be preferable to have
> explicit start-core-dump and end-core-dump tracepoints.

No, that's not our interests. We are just interested in recording
coredump parameters when the coredump is done. After dumping core
(or failing to dump), we'd like to check what was wrong (wrong filter setting?),
or correctly dumped (and removed after).

> Furthermore, I can see potential use for tracepoints before and after
> coredump_wait(), which synchronizes other threads before actually starting
> to calculate and write the dump. The window after coredump_wait() and
> before the post-dump tracepoint is where the actual work of writing the
> core file takes place, in case you want to monitor i/o load between those
> marks or suchlike.

Hmm, currently, we don't mention that.

>
>> - char corename[CORENAME_MAX_SIZE + 1];
>> + char corename[CORENAME_MAX_SIZE + 1] = "";
>
> This initialization of a stack array means the same as:
>
> memset(corename, '\0', sizeof corename);
>
> So the compiler generates that because those are the semantics.
> All you need is:
>
> corename[0] = '\0';
>
> since this is a string.

hm, ok. that's nice to fix.

Thank you,

>
>
> Thanks,
> Roland

--
Masami Hiramatsu
e-mail: [email protected]

2010-04-24 00:51:35

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH -mm v6] tracepoint: Add signal coredump tracepoint

Roland McGrath wrote:
>>> The "retval" encoding seems a bit arcane. I wonder if it might be better
>>> to just have separate tracepoints for successful and failed dump attempts.
>>> Note that whether or not the dump succeeded is already available in
>>> (task->signal->group_exit_code & 0x80) as seen at exit or death tracing
>>> events.
>>
>> OK, please read the previous discussion and tell me what you think about...
>
> I don't have a strong opinion about this particular aspect.

I think retval decoding will help us to find which condition caused
failing the coredump, by reading the source code.
So, I'd like to leave it.

>>> The purposes you mention seem to be served well enough by this tracepoint.
>>> But I recall having the impression that one of the original motivating
>>> interests for tracing core-dump details was to understand when a giant core
>>> dump was responsible for huge amounts of i/o and/or memory thrashing.
>>> (Once you notice that happening, you might adjust coredump_filter settings
>>> to reduce the problem.) Your new tracepoint doesn't help directly with
>>> tracking those sorts of issues, because it only happens after all the work
>>> is done. If you are monitoring trace_signal_deliver, then you can filter
>>> those for SIG_DFL cases of sig_kernel_coredump() signals and recognize that
>>> as the beginning of the coredump. Still, it might be preferable to have
>>> explicit start-core-dump and end-core-dump tracepoints.
>>
>> No, that's not our interests. We are just interested in recording
>> coredump parameters when the coredump is done. After dumping core
>> (or failing to dump), we'd like to check what was wrong (wrong filter setting?),
>> or correctly dumped (and removed after).
>>
>>> Furthermore, I can see potential use for tracepoints before and after
>>> coredump_wait(), which synchronizes other threads before actually starting
>>> to calculate and write the dump. The window after coredump_wait() and
>>> before the post-dump tracepoint is where the actual work of writing the
>>> core file takes place, in case you want to monitor i/o load between those
>>> marks or suchlike.
>>
>> Hmm, currently, we don't mention that.
>
> I know that's not your interest now. But I do recall someone somewhere at
> some point asking about tracepoints in the context of observing unexpected
> i/o and paging loads and tracking down that they were due to core dumping.
> It's also certainly the case that there have been people spending time
> diagnosing long delays due to coredump_wait() logic and/or actual deadlocks
> due to bugs in the exit path interactions with coredump_wait(). Having
> separate before-wait, after-wait, and after-dump tracepoints could well
> make that sort of diagnosis trivial in the future, especially when doing
> "flight recorder" style analysis.

Hmm, indeed. it seems that those tracepoints are useful for finding
unexpected delays from coredump...
OK, I'll try to add those tracepoints. Would you have any recommended data
which those tracepoints should record?

Thank you!


--
Masami Hiramatsu
e-mail: [email protected]

2010-04-24 00:56:13

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH -mm v6] tracepoint: Add signal coredump tracepoint

> > The "retval" encoding seems a bit arcane. I wonder if it might be better
> > to just have separate tracepoints for successful and failed dump attempts.
> > Note that whether or not the dump succeeded is already available in
> > (task->signal->group_exit_code & 0x80) as seen at exit or death tracing
> > events.
>
> OK, please read the previous discussion and tell me what you think about...

I don't have a strong opinion about this particular aspect.

> > The purposes you mention seem to be served well enough by this tracepoint.
> > But I recall having the impression that one of the original motivating
> > interests for tracing core-dump details was to understand when a giant core
> > dump was responsible for huge amounts of i/o and/or memory thrashing.
> > (Once you notice that happening, you might adjust coredump_filter settings
> > to reduce the problem.) Your new tracepoint doesn't help directly with
> > tracking those sorts of issues, because it only happens after all the work
> > is done. If you are monitoring trace_signal_deliver, then you can filter
> > those for SIG_DFL cases of sig_kernel_coredump() signals and recognize that
> > as the beginning of the coredump. Still, it might be preferable to have
> > explicit start-core-dump and end-core-dump tracepoints.
>
> No, that's not our interests. We are just interested in recording
> coredump parameters when the coredump is done. After dumping core
> (or failing to dump), we'd like to check what was wrong (wrong filter setting?),
> or correctly dumped (and removed after).
>
> > Furthermore, I can see potential use for tracepoints before and after
> > coredump_wait(), which synchronizes other threads before actually starting
> > to calculate and write the dump. The window after coredump_wait() and
> > before the post-dump tracepoint is where the actual work of writing the
> > core file takes place, in case you want to monitor i/o load between those
> > marks or suchlike.
>
> Hmm, currently, we don't mention that.

I know that's not your interest now. But I do recall someone somewhere at
some point asking about tracepoints in the context of observing unexpected
i/o and paging loads and tracking down that they were due to core dumping.
It's also certainly the case that there have been people spending time
diagnosing long delays due to coredump_wait() logic and/or actual deadlocks
due to bugs in the exit path interactions with coredump_wait(). Having
separate before-wait, after-wait, and after-dump tracepoints could well
make that sort of diagnosis trivial in the future, especially when doing
"flight recorder" style analysis.


Thanks,
Roland

2010-04-26 18:00:22

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH -mm v6] tracepoint: Add signal coredump tracepoint

> I think retval decoding will help us to find which condition caused
> failing the coredump, by reading the source code.
> So, I'd like to leave it.

Having a proper -ERR* code for the cases that have one in your patch is
certainly good. What I meant was using the 0/1 values to distinguish
success vs failure from the binfmt dumper. If there were separate
tracepoints for success vs failure, then the failure one should certainly
get an error code, which would be 0 when the error (or refusal to dump) was
due to some decision made by the binfmt code rather than a write error.

> Hmm, indeed. it seems that those tracepoints are useful for finding
> unexpected delays from coredump...
> OK, I'll try to add those tracepoints. Would you have any recommended data
> which those tracepoints should record?

Whatever is handy, I suppose. i.e. of the things you pass into the
tracepoint now, give each tracepoint the subset that makes sense for its
case. For the tracepoint after synchronization and before dumping, I think
it should be more or less right after format_corename() and it can pass the
ispipe, corename, cprm.limit and binfmt->min_coredump values that affect
the tests immediately thereafter (as well as the full cprm and binfmt
pointers).


Thanks,
Roland