2015-08-24 14:33:18

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

Hi Peter and Ingo and everybody,

Here's an update of my extended error reporting patchset, addressing
review comments and adding a few more error messages to PT and BTS
drivers.

Changes since v1:
* addressed Peter's comments,
* added perf_err() annotation to intel_pt and intel_bts drivers;
especially the former is rich in EINVALs in event validation path,
so it should be immediately useful;
* changed the error code to EINVAL for the case when perf_err() is
supplied a non-constart error code;
* dropped RFC, added sign-offs.

This time around, I employed a linker trick to convert the structures
containing extended error information into integers, which are then
made to look just like normal error codes so that IS_ERR_VALUE() and
friends would still work correctly on them. So no extra pointers in
the struct perf_event or anywhere else; the extended error codes are
passed around like normal error codes. They only need to be converted
in syscalls' topmost return statements. This is done in 1/6.

Then, 2/6 illustrates how error sites can be extended to include more
information such as file names and line numbers [1]. Side note, it can
be dropped without impact on other patches.

The other patches add perf_err() annotation to a few semi-randomly
picked places in perf core (3/6), x86 bits (4/6), intel_pt (5/6) and
intel_bts (6/6).

[1] http://marc.info/?l=linux-kernel&m=141471816115946

Alexander Shishkin (6):
perf: Introduce extended syscall error reporting
perf: Add file name and line number to perf extended error reports
perf: Annotate some of the error codes with perf_err()
perf/x86: Annotate some of the error codes with perf_err()
perf/x86/intel/pt: Use extended error reporting in event
initialization
perf/x86/intel/bts: Use extended error reporting in event
initialization

arch/x86/kernel/cpu/perf_event.c | 8 +-
arch/x86/kernel/cpu/perf_event_intel_bts.c | 4 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_pt.c | 29 +++++---
include/linux/perf_event.h | 82 +++++++++++++++++++++
include/uapi/linux/perf_event.h | 8 +-
kernel/events/core.c | 113 +++++++++++++++++++++++++----
7 files changed, 215 insertions(+), 31 deletions(-)

--
2.5.0


2015-08-24 14:33:22

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v2 1/6] perf: Introduce extended syscall error reporting

It has been pointed several times out that perf syscall error reporting
leaves a lot to be desired [1].

This patch introduces a fairly simple extension that allows call sites
to annotate their error codes with arbitrary strings, which will then
be copied to userspace (if they asked for it) along with the module
name that produced the error message in JSON format. This way, we can
provide both human-readable and machine-parsable information to user and
leave room for extensions in the future, such as file name and line
number if kernel debugging is enabled.

Each error "site" is referred to by its index, which is folded into an
integer error value within the range of [-PERF_ERRNO, -MAX_ERRNO], where
PERF_ERRNO is chosen to be below any known error codes, but still leaving
enough room to enumerate error sites. This way, all the traditional macros
will still handle these as error codes and we'd only have to convert them
to their original values right before returning to userspace. This way we
also don't have to worry about keeping a separate pointer to the error
message inside a perf_event.

[1] http://marc.info/?l=linux-kernel&m=141470811013082

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 71 ++++++++++++++++++++++++++++++++++
include/uapi/linux/perf_event.h | 8 +++-
kernel/events/core.c | 86 ++++++++++++++++++++++++++++++++++++++---
3 files changed, 159 insertions(+), 6 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809433..8e37939d21 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -56,6 +56,77 @@ struct perf_guest_info_callbacks {
#include <linux/cgroup.h>
#include <asm/local.h>

+#ifndef PERF_MODNAME
+#define PERF_MODNAME "perf"
+#endif
+
+/*
+ * Extended error reporting: annotate an error code with a string
+ * and a module name to help users diagnase problems with their
+ * attributes and whatnot.
+ */
+struct perf_err_site {
+ const char *message;
+ const char *owner;
+ const int code;
+};
+
+#ifdef CONFIG_PERF_EVENTS
+
+/*
+ * Place all occurrences of struct perf_err_site into a special section,
+ * so that we can find out their offsets, which we'll use to refer back
+ * to the error sites.
+ */
+extern const struct perf_err_site __start___perf_err[], __stop___perf_err[];
+
+#define __perf_err(__c, __m) ({ \
+ static struct perf_err_site \
+ __attribute__ ((__section__("__perf_err"))) \
+ __err_site = { \
+ .message = (__m), \
+ .owner = PERF_MODNAME, \
+ .code = __builtin_constant_p((__c)) ? \
+ (__c) : -EINVAL, \
+ }; \
+ &__err_site; \
+})
+
+/*
+ * Use part of the [-1, -MAX_ERRNO] errno range for perf's extended error
+ * reporting. Anything within [-PERF_ERRNO, -MAX_ERRNO] is an index of a
+ * perf_err_site structure within __perf_err section. 3k should be enough
+ * for everybody, but let's add a boot-time warning just in case it overflows
+ * one day.
+ */
+#define PERF_ERRNO 1024
+
+static inline int perf_errno(const struct perf_err_site *site)
+{
+ unsigned long err = site - __start___perf_err;
+
+ return -(int)err - PERF_ERRNO;
+}
+
+static inline const struct perf_err_site *perf_errno_to_site(int err)
+{
+ return __start___perf_err - err - PERF_ERRNO;
+}
+
+#ifdef MODULE
+/*
+ * Module support is a tad trickier, but far from rocket surgery. Let's
+ * bypass it for now.
+ */
+#define perf_err(__c, __m) (__c)
+#else
+#define perf_err(__c, __m) (perf_errno(__perf_err((__c), (__m))))
+#endif
+
+#define PERF_ERR_PTR(__e, __m) (ERR_PTR(perf_err(__e, __m)))
+
+#endif
+
struct perf_callchain_entry {
__u64 nr;
__u64 ip[PERF_MAX_STACK_DEPTH];
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 2881145cda..73b6919954 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -264,6 +264,7 @@ enum perf_event_read_format {
/* add: sample_stack_user */
#define PERF_ATTR_SIZE_VER4 104 /* add: sample_regs_intr */
#define PERF_ATTR_SIZE_VER5 112 /* add: aux_watermark */
+#define PERF_ATTR_SIZE_VER6 120 /* add: perf_err */

/*
* Hardware event_id to monitor via a performance monitoring event:
@@ -375,7 +376,12 @@ struct perf_event_attr {
* Wakeup watermark for AUX area
*/
__u32 aux_watermark;
- __u32 __reserved_2; /* align to __u64 */
+
+ /*
+ * Extended error reporting buffer
+ */
+ __u32 perf_err_size;
+ __u64 perf_err;
};

#define perf_flags(attr) (*(&(attr)->read_format + 1))
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ae16867670..c6f43d1d5f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,6 +49,79 @@

#include <asm/irq_regs.h>

+static bool extended_reporting_enabled(struct perf_event_attr *attr)
+{
+ if (attr->size >= PERF_ATTR_SIZE_VER6 &&
+ attr->perf_err_size > 0)
+ return true;
+
+ return false;
+}
+
+/*
+ * Provide a JSON formatted error report to the user if they asked for it.
+ */
+static void perf_error_report_site(struct perf_event_attr *attr,
+ const struct perf_err_site *site)
+{
+ unsigned long len;
+ char *buffer;
+
+ if (!site || !extended_reporting_enabled(attr))
+ return;
+
+ /* in case of nested perf_err()s, which you shouldn't really do */
+ while (site->code <= -PERF_ERRNO)
+ site = perf_errno_to_site(site->code);
+
+ buffer = kasprintf(GFP_KERNEL,
+ "{\n"
+ "\t\"code\": %d,\n"
+ "\t\"module\": \"%s\",\n"
+ "\t\"message\": \"%s\"\n"
+ "}\n",
+ site->code, site->owner, site->message
+ );
+ if (!buffer)
+ return;
+
+ /* trim the buffer to the supplied boundary */
+ len = strlen(buffer);
+ if (len >= attr->perf_err_size) {
+ len = attr->perf_err_size - 1;
+ buffer[len] = 0;
+ }
+
+ if (copy_to_user((void __user *)attr->perf_err, buffer, len + 1)) {
+ /* if we failed to copy once, don't bother later */
+ attr->perf_err_size = 0;
+ }
+
+ kfree(buffer);
+}
+
+/*
+ * Synchronous version of perf_err(), for the paths where we return immediately
+ * back to userspace.
+ */
+#define perf_err_sync(__attr, __c, __m) ({ \
+ perf_error_report_site(__attr, __perf_err((__c), (__m))); \
+ (__c); \
+})
+
+static int perf_error_report(struct perf_event_attr *attr, int err)
+{
+ const struct perf_err_site *site;
+
+ if (err > -PERF_ERRNO)
+ return err;
+
+ site = perf_errno_to_site(err);
+ perf_error_report_site(attr, site);
+
+ return site->code;
+}
+
static struct workqueue_struct *perf_wq;

typedef int (*remote_function_f)(void *);
@@ -3904,7 +3977,7 @@ perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
ret = perf_read_hw(event, buf, count);
perf_event_ctx_unlock(event, ctx);

- return ret;
+ return perf_error_report(&event->attr, ret);
}

static unsigned int perf_poll(struct file *file, poll_table *wait)
@@ -4152,7 +4225,7 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
ret = _perf_ioctl(event, cmd, arg);
perf_event_ctx_unlock(event, ctx);

- return ret;
+ return perf_error_report(&event->attr, ret);
}

#ifdef CONFIG_COMPAT
@@ -4752,7 +4825,7 @@ aux_unlock:
if (event->pmu->event_mapped)
event->pmu->event_mapped(event);

- return ret;
+ return perf_error_report(&event->attr, ret);
}

static int perf_fasync(int fd, struct file *filp, int on)
@@ -4766,7 +4839,7 @@ static int perf_fasync(int fd, struct file *filp, int on)
mutex_unlock(&inode->i_mutex);

if (retval < 0)
- return retval;
+ return perf_error_report(&event->attr, retval);

return 0;
}
@@ -8352,7 +8425,7 @@ err_group_fd:
fdput(group);
err_fd:
put_unused_fd(event_fd);
- return err;
+ return perf_error_report(&attr, err);
}

/**
@@ -9130,6 +9203,9 @@ void __init perf_event_init(void)
*/
BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
!= 1024);
+
+ /* Too many error sites; see the comment at PERF_ERRNO definition */
+ WARN_ON(__stop___perf_err - __start___perf_err > MAX_ERRNO - PERF_ERRNO);
}

ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
--
2.5.0

2015-08-24 14:33:39

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v2 2/6] perf: Add file name and line number to perf extended error reports

If kernel debugging is enabled, include additional information to extended
syscall error reports: file name and line number, to make error hunting even
easier.

And in really desperate times, this allows for such things (a variation on
Hugh Dickins' trick [1]):

#undef EINVAL
#define EINVAL perf_err(-22, "")

However, CONFIG_DEBUG_KERNEL might not be the right option for this as I
notice that, for example, debian enables it in its kernels.

[1] http://marc.info/?l=linux-kernel&m=141215166009542

Signed-off-by: Alexander Shishkin <[email protected]>
---
include/linux/perf_event.h | 11 +++++++++++
kernel/events/core.c | 7 +++++++
2 files changed, 18 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e37939d21..f6dd3d6071 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -68,11 +68,21 @@ struct perf_guest_info_callbacks {
struct perf_err_site {
const char *message;
const char *owner;
+#ifdef CONFIG_DEBUG_KERNEL
+ const char *file;
+ const int line;
+#endif
const int code;
};

#ifdef CONFIG_PERF_EVENTS

+#ifdef CONFIG_DEBUG_KERNEL
+#define DEBUG_PERF_ERR .file = __FILE__, .line = __LINE__,
+#else
+#define DEBUG_PERF_ERR
+#endif
+
/*
* Place all occurrences of struct perf_err_site into a special section,
* so that we can find out their offsets, which we'll use to refer back
@@ -88,6 +98,7 @@ extern const struct perf_err_site __start___perf_err[], __stop___perf_err[];
.owner = PERF_MODNAME, \
.code = __builtin_constant_p((__c)) ? \
(__c) : -EINVAL, \
+ DEBUG_PERF_ERR \
}; \
&__err_site; \
})
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c6f43d1d5f..3ff28fc8bd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -76,10 +76,17 @@ static void perf_error_report_site(struct perf_event_attr *attr,

buffer = kasprintf(GFP_KERNEL,
"{\n"
+#ifdef CONFIG_DEBUG_KERNEL
+ "\t\"file\": \"%s\",\n"
+ "\t\"line\": %d,\n"
+#endif
"\t\"code\": %d,\n"
"\t\"module\": \"%s\",\n"
"\t\"message\": \"%s\"\n"
"}\n",
+#ifdef CONFIG_DEBUG_KERNEL
+ site->file, site->line,
+#endif
site->code, site->owner, site->message
);
if (!buffer)
--
2.5.0

2015-08-24 14:35:38

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v2 3/6] perf: Annotate some of the error codes with perf_err()

This patch annotates a few semi-random error paths in perf core to
illustrate the extended error reporting facility. Most of them can
be triggered from perf tools.

Signed-off-by: Alexander Shishkin <[email protected]>
---
kernel/events/core.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3ff28fc8bd..7beab37ea6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3382,10 +3382,10 @@ find_lively_task_by_vpid(pid_t vpid)
rcu_read_unlock();

if (!task)
- return ERR_PTR(-ESRCH);
+ return PERF_ERR_PTR(-ESRCH, "task not found");

/* Reuse ptrace permission checks for now. */
- err = -EACCES;
+ err = perf_err(-EACCES, "insufficient permissions for tracing this task");
if (!ptrace_may_access(task, PTRACE_MODE_READ))
goto errout;

@@ -3413,7 +3413,8 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
if (!task) {
/* Must be root to operate on a CPU event: */
if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
- return ERR_PTR(-EACCES);
+ return PERF_ERR_PTR(-EACCES,
+ "must be root to operate on a CPU event");

/*
* We could be clever and allow to attach a event to an
@@ -3421,7 +3422,7 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
* that's for later.
*/
if (!cpu_online(cpu))
- return ERR_PTR(-ENODEV);
+ return PERF_ERR_PTR(-ENODEV, "cpu is offline");

cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
ctx = &cpuctx->ctx;
@@ -8134,15 +8135,16 @@ SYSCALL_DEFINE5(perf_event_open,

if (!attr.exclude_kernel) {
if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
- return -EACCES;
+ return perf_err_sync(&attr, -EACCES,
+ "kernel tracing forbidden for the unprivileged");
}

if (attr.freq) {
if (attr.sample_freq > sysctl_perf_event_sample_rate)
- return -EINVAL;
+ return perf_err_sync(&attr, -EINVAL, "sample_freq too high");
} else {
if (attr.sample_period & (1ULL << 63))
- return -EINVAL;
+ return perf_err_sync(&attr, -EINVAL, "sample_period too high");
}

/*
@@ -8152,14 +8154,14 @@ SYSCALL_DEFINE5(perf_event_open,
* cgroup.
*/
if ((flags & PERF_FLAG_PID_CGROUP) && (pid == -1 || cpu == -1))
- return -EINVAL;
+ return perf_err_sync(&attr, -EINVAL, "pid and cpu need to be set in cgroup mode");

if (flags & PERF_FLAG_FD_CLOEXEC)
f_flags |= O_CLOEXEC;

event_fd = get_unused_fd_flags(f_flags);
if (event_fd < 0)
- return event_fd;
+ return perf_err_sync(&attr, event_fd, "can't obtain a file descriptor");

if (group_fd != -1) {
err = perf_fget_light(group_fd, &group);
--
2.5.0

2015-08-24 14:35:45

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v2 4/6] perf/x86: Annotate some of the error codes with perf_err()

This patch annotates a few x86-specific error paths with perf's extended
error reporting facility.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 8 ++++++--
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 +-
2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f56cf074d0..b3b531beee 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -12,6 +12,8 @@
* For licencing details see kernel-base/COPYING
*/

+#define PERF_MODNAME "perf/x86"
+
#include <linux/perf_event.h>
#include <linux/capability.h>
#include <linux/notifier.h>
@@ -426,11 +428,13 @@ int x86_setup_perfctr(struct perf_event *event)

/* BTS is currently only allowed for user-mode. */
if (!attr->exclude_kernel)
- return -EOPNOTSUPP;
+ return perf_err(-EOPNOTSUPP,
+ "BTS sampling not allowed for kernel space");

/* disallow bts if conflicting events are present */
if (x86_add_exclusive(x86_lbr_exclusive_lbr))
- return -EBUSY;
+ return perf_err(-EBUSY,
+ "LBR conflicts with active events");

event->destroy = hw_perf_lbr_event_destroy;
}
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index b2c9475b7f..222b259c5e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -607,7 +607,7 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
* no LBR on this PMU
*/
if (!x86_pmu.lbr_nr)
- return -EOPNOTSUPP;
+ return perf_err(-EOPNOTSUPP, "LBR is not supported by this cpu");

/*
* setup SW LBR filter
--
2.5.0

2015-08-24 14:35:50

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v2 5/6] perf/x86/intel/pt: Use extended error reporting in event initialization

Event attribute validation is one of the biggest sources of EINVALs around
perf_event_open() syscall. This patch annotates error checks with extended
error reporting macros to provide the user with more details on what they
are doing wrong.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_pt.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_pt.c b/arch/x86/kernel/cpu/perf_event_intel_pt.c
index 4216928344..9a09c89928 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_pt.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_pt.c
@@ -18,6 +18,8 @@

#undef DEBUG

+#define PERF_MODNAME "perf/x86/intel/pt"
+
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

#include <linux/types.h>
@@ -198,29 +200,29 @@ fail:
RTIT_CTL_CYC_PSB | \
RTIT_CTL_MTC)

-static bool pt_event_valid(struct perf_event *event)
+static int pt_event_valid(struct perf_event *event)
{
u64 config = event->attr.config;
u64 allowed, requested;

if ((config & PT_CONFIG_MASK) != config)
- return false;
+ return perf_err(-EINVAL, "PT config: disallowed bits");

if (config & RTIT_CTL_CYC_PSB) {
if (!pt_cap_get(PT_CAP_psb_cyc))
- return false;
+ return perf_err(-EINVAL, "PT config: PSB/CYC not supported");

allowed = pt_cap_get(PT_CAP_psb_periods);
requested = (config & RTIT_CTL_PSB_FREQ) >>
RTIT_CTL_PSB_FREQ_OFFSET;
if (requested && (!(allowed & BIT(requested))))
- return false;
+ return perf_err(-EINVAL, "PT config: unsupported PSB period");

allowed = pt_cap_get(PT_CAP_cycle_thresholds);
requested = (config & RTIT_CTL_CYC_THRESH) >>
RTIT_CTL_CYC_THRESH_OFFSET;
if (requested && (!(allowed & BIT(requested))))
- return false;
+ return perf_err(-EINVAL, "PT config: unsupported CYC threshold");
}

if (config & RTIT_CTL_MTC) {
@@ -232,20 +234,20 @@ static bool pt_event_valid(struct perf_event *event)
* CPUID is 0 will #GP, so better safe than sorry.
*/
if (!pt_cap_get(PT_CAP_mtc))
- return false;
+ return perf_err(-EINVAL, "PT config: MTC not supported");

allowed = pt_cap_get(PT_CAP_mtc_periods);
if (!allowed)
- return false;
+ return perf_err(-EINVAL, "PT config: MTC periods not supported");

requested = (config & RTIT_CTL_MTC_RANGE) >>
RTIT_CTL_MTC_RANGE_OFFSET;

if (!(allowed & BIT(requested)))
- return false;
+ return perf_err(-EINVAL, "PT config: unsupported MTC period");
}

- return true;
+ return 0;
}

/*
@@ -1111,14 +1113,17 @@ static void pt_event_destroy(struct perf_event *event)

static int pt_event_init(struct perf_event *event)
{
+ int err;
+
if (event->attr.type != pt_pmu.pmu.type)
return -ENOENT;

- if (!pt_event_valid(event))
- return -EINVAL;
+ err = pt_event_valid(event);
+ if (err)
+ return err;

if (x86_add_exclusive(x86_lbr_exclusive_pt))
- return -EBUSY;
+ return perf_err(-EBUSY, "PT conflicts with active events");

event->destroy = pt_event_destroy;

--
2.5.0

2015-08-24 14:35:48

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH v2 6/6] perf/x86/intel/bts: Use extended error reporting in event initialization

If intel_bts events would conflict with other events that are active in
the system, provide an extended error message to the user along with the
usual EBUSY.

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_bts.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_bts.c b/arch/x86/kernel/cpu/perf_event_intel_bts.c
index 54690e8857..c6a8da8f5a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_bts.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_bts.c
@@ -16,6 +16,8 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+#define PERF_MODNAME "perf/x86/intel/bts"
+
#include <linux/bitops.h>
#include <linux/types.h>
#include <linux/slab.h>
@@ -492,7 +494,7 @@ static int bts_event_init(struct perf_event *event)
return -ENOENT;

if (x86_add_exclusive(x86_lbr_exclusive_bts))
- return -EBUSY;
+ return perf_err(-EBUSY, "BTS conflicts with active events");

ret = x86_reserve_hardware();
if (ret) {
--
2.5.0

2015-08-25 08:22:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting


* Alexander Shishkin <[email protected]> wrote:

> Hi Peter and Ingo and everybody,
>
> Here's an update of my extended error reporting patchset, addressing
> review comments and adding a few more error messages to PT and BTS
> drivers.
>
> Changes since v1:
> * addressed Peter's comments,
> * added perf_err() annotation to intel_pt and intel_bts drivers;
> especially the former is rich in EINVALs in event validation path,
> so it should be immediately useful;
> * changed the error code to EINVAL for the case when perf_err() is
> supplied a non-constart error code;
> * dropped RFC, added sign-offs.
>
> This time around, I employed a linker trick to convert the structures
> containing extended error information into integers, which are then
> made to look just like normal error codes so that IS_ERR_VALUE() and
> friends would still work correctly on them. So no extra pointers in
> the struct perf_event or anywhere else; the extended error codes are
> passed around like normal error codes. They only need to be converted
> in syscalls' topmost return statements. This is done in 1/6.

Nice trick!

> Then, 2/6 illustrates how error sites can be extended to include more
> information such as file names and line numbers [1]. Side note, it can
> be dropped without impact on other patches.
>
> The other patches add perf_err() annotation to a few semi-randomly
> picked places in perf core (3/6), x86 bits (4/6), intel_pt (5/6) and
> intel_bts (6/6).
>
> [1] http://marc.info/?l=linux-kernel&m=141471816115946
>
> Alexander Shishkin (6):
> perf: Introduce extended syscall error reporting
> perf: Add file name and line number to perf extended error reports
> perf: Annotate some of the error codes with perf_err()
> perf/x86: Annotate some of the error codes with perf_err()
> perf/x86/intel/pt: Use extended error reporting in event
> initialization
> perf/x86/intel/bts: Use extended error reporting in event
> initialization
>
> arch/x86/kernel/cpu/perf_event.c | 8 +-
> arch/x86/kernel/cpu/perf_event_intel_bts.c | 4 +-
> arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 +-
> arch/x86/kernel/cpu/perf_event_intel_pt.c | 29 +++++---
> include/linux/perf_event.h | 82 +++++++++++++++++++++
> include/uapi/linux/perf_event.h | 8 +-
> kernel/events/core.c | 113 +++++++++++++++++++++++++----
> 7 files changed, 215 insertions(+), 31 deletions(-)

Ok, this looks pretty good to me.

Mind adding support on the perf tooling side as well, so that it can be tested and
applied in a single set of patches?

Thanks,

Ingo

2015-08-25 08:52:32

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Mon, 2015-08-24 at 17:32 +0300, Alexander Shishkin wrote:

> This time around, I employed a linker trick to convert the structures
> containing extended error information into integers, which are then
> made to look just like normal error codes so that IS_ERR_VALUE() and
> friends would still work correctly on them. So no extra pointers in
> the struct perf_event or anywhere else; the extended error codes are
> passed around like normal error codes. They only need to be converted
> in syscalls' topmost return statements. This is done in 1/6.
>

For the record, as we discussed separately, I'd love to see this move
to more general infrastructure. In wireless (nl80211), for example, we
have a few hundred (!) callsites returning -EINVAL, mostly based on
malformed netlink attributes, and it can be very difficult to figure
out what went wrong; debugging mostly employs a variation of Hugh's
trick.

It would be absolutely necessary to support modules, however that could
get tricky: unless we pass a THIS_MODULE through in some way to
netlink_ack() (in the nl80211 case) we'd have to store the actual index
inside the err_site and assign it on module load somehow. Passing the
module pointer seems better, but has even more wrinkles like what to do
when the error came from a nested call (e.g. nla_parse()) that's in a
different modules ...


Unrelated to all that - maybe perf_errno_to_site() should have some
range checking?

johannes

2015-08-25 09:03:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting


* Johannes Berg <[email protected]> wrote:

> On Mon, 2015-08-24 at 17:32 +0300, Alexander Shishkin wrote:
>
> > This time around, I employed a linker trick to convert the structures
> > containing extended error information into integers, which are then
> > made to look just like normal error codes so that IS_ERR_VALUE() and
> > friends would still work correctly on them. So no extra pointers in
> > the struct perf_event or anywhere else; the extended error codes are
> > passed around like normal error codes. They only need to be converted
> > in syscalls' topmost return statements. This is done in 1/6.
>
> For the record, as we discussed separately, I'd love to see this move to more
> general infrastructure. In wireless (nl80211), for example, we have a few
> hundred (!) callsites returning -EINVAL, mostly based on malformed netlink
> attributes, and it can be very difficult to figure out what went wrong;
> debugging mostly employs a variation of Hugh's trick.

Absolutely, I suggested this as well earlier today, as the scheduler would like to
make use of it in syscalls with extensible ABIs, such as sched_setattr().

If people really like this then we could go farther as well and add a standalone
'extended errors system call' as well (SyS_errno_extended_get()), which would
allow the recovery of error strings even for system calls that are not easily
extensible. We could cache the last error description in the task struct.

Thanks,

Ingo

2015-08-25 09:17:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting


* Ingo Molnar <[email protected]> wrote:

>
> * Johannes Berg <[email protected]> wrote:
>
> > On Mon, 2015-08-24 at 17:32 +0300, Alexander Shishkin wrote:
> >
> > > This time around, I employed a linker trick to convert the structures
> > > containing extended error information into integers, which are then made to
> > > look just like normal error codes so that IS_ERR_VALUE() and friends would
> > > still work correctly on them. So no extra pointers in the struct perf_event
> > > or anywhere else; the extended error codes are passed around like normal
> > > error codes. They only need to be converted in syscalls' topmost return
> > > statements. This is done in 1/6.
> >
> > For the record, as we discussed separately, I'd love to see this move to more
> > general infrastructure. In wireless (nl80211), for example, we have a few
> > hundred (!) callsites returning -EINVAL, mostly based on malformed netlink
> > attributes, and it can be very difficult to figure out what went wrong;
> > debugging mostly employs a variation of Hugh's trick.
>
> Absolutely, I suggested this as well earlier today, as the scheduler would like
> to make use of it in syscalls with extensible ABIs, such as sched_setattr().
>
> If people really like this then we could go farther as well and add a standalone
> 'extended errors system call' as well (SyS_errno_extended_get()), which would
> allow the recovery of error strings even for system calls that are not easily
> extensible. We could cache the last error description in the task struct.

If we do that then we don't even have to introduce per system call error code
conversion, but could unconditionally save the last extended error info in the
task struct and continue - this could be done very cheaply with the linker trick
driven integer ID.

I.e. system calls could opt in to do:

return err_str(-EBUSY, "perf/x86: BTS conflicts with active events");

and the overhead of this would be minimal, we'd essentially do something like this
to save the error:

current->err_code = code;

where 'code' is a build time constant in essence.

We could use this even in system calls where the error path is performance
critical, as all the string recovery and copying overhead would be triggered by
applications that opt in via the new system call:

struct err_desc {
const char *message;
const char *owner;
const int code;
};

SyS_err_get_desc(struct err_desc *err_desc __user);

[ Which could perhaps be a prctl() extension as well (PR_GET_ERR_DESC): finally
some truly matching functionality for prctl(). ]

Hm?

Thanks,

Ingo

2015-08-25 09:34:19

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Tue, 2015-08-25 at 11:17 +0200, Ingo Molnar wrote:
>
> If we do that then we don't even have to introduce per system call error code
> conversion, but could unconditionally save the last extended error info in the
> task struct and continue - this could be done very cheaply with the linker trick
> driven integer ID.
>
> I.e. system calls could opt in to do:
>
> > return err_str(-EBUSY, "perf/x86: BTS conflicts with active events");
>
> and the overhead of this would be minimal, we'd essentially do something like this
> to save the error:
>
> > current->err_code = code;
>
> where 'code' is a build time constant in essence.
>
> We could use this even in system calls where the error path is performance
> critical, as all the string recovery and copying overhead would be triggered by
> applications that opt in via the new system call:
>
> > struct err_desc {
> > const char *message;
> > const char *owner;
> > const int code;
> > };
>
> > SyS_err_get_desc(struct err_desc *err_desc __user);
>
> [ Which could perhaps be a prctl() extension as well (PR_GET_ERR_DESC): finally
> some truly matching functionality for prctl(). ]
>
> Hm?

That's neat in a way, but doesn't work in general I think.

Considering the wifi case, or more generally any netlink based
protocol, the syscall (sendmsg) won't return an error, but a subsequent
recvmsg() (which also won't return an error) returns an error message
[in the sense of a protocol message, not a human readable message] to a
buffer provided by the application.
However, this message can be extended relatively easily to include the
string information, but the syscall/prctl wouldn't work since the
syscalls didn't actually fail.

However, it could possibly help with the namespace/module issue if you
also store THIS_MODULE (or perhaps instead a pointer to the module's
error table) in the task. Again not in the netlink case though, I
think, that will always require special handling [although there it
could be stored away in the socket or so, similar to the task]

johannes

2015-08-25 10:07:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting


* Johannes Berg <[email protected]> wrote:

> On Tue, 2015-08-25 at 11:17 +0200, Ingo Molnar wrote:
> >
> > If we do that then we don't even have to introduce per system call error code
> > conversion, but could unconditionally save the last extended error info in the
> > task struct and continue - this could be done very cheaply with the linker trick
> > driven integer ID.
> >
> > I.e. system calls could opt in to do:
> >
> > > return err_str(-EBUSY, "perf/x86: BTS conflicts with active events");
> >
> > and the overhead of this would be minimal, we'd essentially do something like this
> > to save the error:
> >
> > > current->err_code = code;
> >
> > where 'code' is a build time constant in essence.
> >
> > We could use this even in system calls where the error path is performance
> > critical, as all the string recovery and copying overhead would be triggered by
> > applications that opt in via the new system call:
> >
> > > struct err_desc {
> > > const char *message;
> > > const char *owner;
> > > const int code;
> > > };
> >
> > > SyS_err_get_desc(struct err_desc *err_desc __user);
> >
> > [ Which could perhaps be a prctl() extension as well (PR_GET_ERR_DESC): finally
> > some truly matching functionality for prctl(). ]
> >
> > Hm?
>
> That's neat in a way, but doesn't work in general I think.

Ok, I see the netlink problem - but it would work in the perf and scheduler cases,
except for the small wart that it's not signal safe by default. (Apps could either
save/restore it themselves in their signal handlers, via PR_SET_ERR_DESC, or we
could extend the signal frame with the code.)

Having a separate syscall has two (big!) appeals:

- we wouldn't have to touch existing system calls at all.

- extended error reporting would be available for any system call that opts to
use it. (The current scheme as submitted is only available to system calls
using the perf-style flexible attribute ABI.)

Regarding netlink:

> Considering the wifi case, or more generally any netlink based
> protocol, the syscall (sendmsg) won't return an error, but a subsequent
> recvmsg() (which also won't return an error) returns an error message
> [in the sense of a protocol message, not a human readable message] to a
> buffer provided by the application.
> However, this message can be extended relatively easily to include the
> string information, but the syscall/prctl wouldn't work since the
> syscalls didn't actually fail.

Ok. So assuming we can make a 1:1 mapping between the 'extended error code'
integer space and the message:owner strings, it would be enough for netlink to
pass along the integer code itself, not the full strings?

That would simplify things and make the scheme more robust from a security POV I
suspect.

> However, it could possibly help with the namespace/module issue if you
> also store THIS_MODULE (or perhaps instead a pointer to the module's
> error table) in the task. Again not in the netlink case though, I
> think, that will always require special handling [although there it
> could be stored away in the socket or so, similar to the task]

So my hope would be that we can represent this all with a single 'large' error
code integer space. That integer would be constant and translateable (as long as
the module is loaded).

That way the error passing mechanism wouldn't have to be specifically module-aware
- during build we generate the integer space, with all possible modules
considered.

Thanks,

Ingo

2015-08-25 10:20:37

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Tue, 2015-08-25 at 12:07 +0200, Ingo Molnar wrote:
> Having a separate syscall has two (big!) appeals:
>
> - we wouldn't have to touch existing system calls at all.
>
> - extended error reporting would be available for any system call that opts to
> use it. (The current scheme as submitted is only available to system calls
> using the perf-style flexible attribute ABI.)

Yeah, I agree this is nice. However, more generally, I think we need to
actually think more about the module problem then since while syscalls
can't be implemented in modules (I think) they can still end up calling
into modules.

Of course a first iteration could be exactly like what Alexander
posted.

The other issue with this is namespacing - can all syscalls, and
everything that eventually gets called, really use a single error code
namespace with its 3k limit? On the one hand I'm thinking "3k strings
are so big ... we don't want more", but on the other hand all kinds of
drivers etc. might start getting annotations?

> > Considering the wifi case, or more generally any netlink based
> > protocol, the syscall (sendmsg) won't return an error, but a subsequent
> > recvmsg() (which also won't return an error) returns an error message
> > [in the sense of a protocol message, not a human readable message] to a
> > buffer provided by the application.
> > However, this message can be extended relatively easily to include the
> > string information, but the syscall/prctl wouldn't work since the
> > syscalls didn't actually fail.
>
> Ok. So assuming we can make a 1:1 mapping between the 'extended error code'
> integer space and the message:owner strings, it would be enough for netlink to
> pass along the integer code itself, not the full strings?

Considering that this would likely have to be opt-in at the netlink
level (e.g. through a flag in the request message), perhaps. I'd say
it'd still be easier for the message to carry the intended error code
(e.g. -EINVAL) and the actual message in the ACK message [where
requested]. That way, applications that actually behave depending on
the error code can far more easily be extended.

> That would simplify things and make the scheme more robust from a security POV I
> suspect.

You could also argue the other way around, in that being able to look
up (from userspace) arbitrary extended error IDs, even those that
haven't ever been used, could be an information leak of sorts.

> So my hope would be that we can represent this all with a single 'large' error
> code integer space. That integer would be constant and translateable (as long as
> the module is loaded).

Ok, I wasn't really what I was assuming. As I said above, on the one
hand I agree, but on the other I'm looking at the reality of a few
hundred (!) -EINVAL callsites in net/wireless/nl80211.c alone, so
having an overall 3k limit seems somewhat low.

> That way the error passing mechanism wouldn't have to be specifically module-aware
> - during build we generate the integer space, with all possible modules
> considered.
>

That would be no improvement for me as I work heavily with (upstream)
modules that are compiled out-of-tree, so I'm not all inclined to spend
much time on it if that ends up being the solution ;)

There are, however, are other ways it seems: for example assigning each
module an offset and taking that into account if the code was built
modular. That needs a small allocator to put the module range somewhere
into the global range, of course:

#ifdef MODULE
#define mod_ext_err(errno, string)
struct ext_err { ... } _my_err __section(errors);
return mod_offset < 0 ?
errno :
mod_ext_err_offset + &_my_err - __start_errors;
#else
...
#endif

or something like that - that leaves the possibility of allocation
failures (e.g. due to error space fragmentation) which sets the offset
to -1. The only additional wrinkle would be that a lookup would have to
walk some kind of extents tree that says which module (table) to look
at.

johannes

2015-08-26 04:49:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting


* Johannes Berg <[email protected]> wrote:

> On Tue, 2015-08-25 at 12:07 +0200, Ingo Molnar wrote:
> > Having a separate syscall has two (big!) appeals:
> >
> > - we wouldn't have to touch existing system calls at all.
> >
> > - extended error reporting would be available for any system call that opts to
> > use it. (The current scheme as submitted is only available to system calls
> > using the perf-style flexible attribute ABI.)
>
> Yeah, I agree this is nice. However, more generally, I think we need to
> actually think more about the module problem then since while syscalls
> can't be implemented in modules (I think) they can still end up calling
> into modules.
>
> Of course a first iteration could be exactly like what Alexander
> posted.
>
> The other issue with this is namespacing - can all syscalls, and
> everything that eventually gets called, really use a single error code
> namespace with its 3k limit? [...]

No, the current MAX_ERRNO is probably not big enough if this scheme is successful,
and I don't see any reason why it wouldn't be successful: I think this feature
would be the biggest usability feature added to Linux system calls and to Linux
system tooling in the last 10 years or so.

> [...] On the one hand I'm thinking "3k strings are so big ... we don't want
> more", but on the other hand all kinds of drivers etc. might start getting
> annotations?

We could extend it with some arch work. The per arch work involves making sure
there's no valid kernel address at [-MAX_ERRNO...-1].

So I wouldn't worry about it too much, let's agree on a good ABI and let's just
start using it, and if we grow out of -4K we can extend things step by step.

> > Ok. So assuming we can make a 1:1 mapping between the 'extended error code'
> > integer space and the message:owner strings, it would be enough for netlink to
> > pass along the integer code itself, not the full strings?
>
> Considering that this would likely have to be opt-in at the netlink level (e.g.
> through a flag in the request message), perhaps. I'd say it'd still be easier
> for the message to carry the intended error code (e.g. -EINVAL) and the actual
> message in the ACK message [where requested]. That way, applications that
> actually behave depending on the error code can far more easily be extended.

Ok. I think we should include the extended error code as well, in case an app
wants to pass it to some more generic library.

> > That would simplify things and make the scheme more robust from a security POV
> > I suspect.
>
> You could also argue the other way around, in that being able to look up (from
> userspace) arbitrary extended error IDs, even those that haven't ever been used,
> could be an information leak of sorts.

The fact is that kernel<->tooling error reporting sucks big time here and today,
in large part due to errno limitations, and arguing that it's somehow helping
security is the Stockholm Syndrome at its best.

> > So my hope would be that we can represent this all with a single 'large' error
> > code integer space. That integer would be constant and translateable (as long
> > as the module is loaded).
>
> Ok, I wasn't really what I was assuming. As I said above, on the one
> hand I agree, but on the other I'm looking at the reality of a few
> hundred (!) -EINVAL callsites in net/wireless/nl80211.c alone, so
> having an overall 3k limit seems somewhat low.

Agreed - but it's not a hard limit really.

> > That way the error passing mechanism wouldn't have to be specifically
> > module-aware - during build we generate the integer space, with all possible
> > modules considered.
>
> That would be no improvement for me as I work heavily with (upstream) modules
> that are compiled out-of-tree, so I'm not all inclined to spend much time on it
> if that ends up being the solution ;)

Perhaps, as long as the number allocation is dynamic and non-ABI there's no reason
why this couldn't be added later on.

Thanks,

Ingo

2015-08-26 07:02:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting


* Linus Torvalds <[email protected]> wrote:

> On Aug 25, 2015 21:49, "Ingo Molnar" <[email protected]> wrote:
> >
> > No, the current MAX_ERRNO is probably not big enough if this scheme is
> > successful, and I don't see any reason why it wouldn't be successful: I think
> > this feature would be the biggest usability feature added to Linux system
> > calls and to Linux system tooling in the last 10 years or so.
>
> Don't be silly. It's a horrible idea. People would want to internationalize the
> strings etc, and nobody would use the extended versions anyway, since nobody
> uses raw system calls.

So the prctl() suggestion would address that worry, which would make it available
essentially immediately, for any tool that cares. (And this would IMHO be a
prctl() that kind of fits the interface, it does not feelt bolted on.)

Internationalization could be done easily in a user-space library, by hash-tabling
the English strings - for anyone who cares. It could be a simple free-form
string->string translation library that gets strings added, it doesn't have to
know about any context.

> We've had this before. Some extension that is Linux-specific, and improved on
> some small detail, and never gets used, and just cause pain.

I think this time is different, especially with another interface variant we could
use that I think addresses (most of your) concerns:

> And the extended errors would be painful even in the kernel. We do compare for
> specific error values. As does user space. There is a reason those values are
> limited to a fairly small set of standard values, and system calls come with
> documentation on which errors they can return.

So my very first interface suggestion two years ago when this first came up was to
decouple the error code from the string, i.e. to allow:

return err_code(-EINVAL, "x86/perf: CPU does not support precise sampling");

... which would return -EINVAL all the way - but would side-store the error
string, for user-space that requests it. There would be no 'extended errno' space
at all, dynamic or static, just the regular errno, and an optional string for
user-space that wants to use it.

This would make error codes still tightly clustered around a handful of main
categories and there would be no change whatsoever to current error codes.

Would you be fine with such an approach?

> It may work for perf, but don't start thinking it works anywhere else

Ok, will keep it perf (and scheduler) only.

Thanks,

Ingo

2015-08-26 07:06:49

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Tue, 2015-08-25 at 22:07 -0700, Linus Torvalds wrote:

> > No, the current MAX_ERRNO is probably not big enough if this scheme is successful,
> > and I don't see any reason why it wouldn't be successful: I think this feature
> > would be the biggest usability feature added to Linux system calls and to Linux
> > system tooling in the last 10 years or so.
> Don't be silly. It's a horrible idea. People would want to
> internationalize the strings etc, and nobody would use the extended
> versions anyway, since nobody uses raw system calls.

That's a good point, and think that least in the netlink case it'd be
much better to say which attribute was the one that had an issue, and
that has an obvious binary encoding rather than encoding that in a
string.

Perhaps that's something I can play with separately.

johannes

2015-08-26 07:20:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting


* Johannes Berg <[email protected]> wrote:

> On Tue, 2015-08-25 at 22:07 -0700, Linus Torvalds wrote:
>
> > > No, the current MAX_ERRNO is probably not big enough if this scheme is successful,
> > > and I don't see any reason why it wouldn't be successful: I think this feature
> > > would be the biggest usability feature added to Linux system calls and to Linux
> > > system tooling in the last 10 years or so.
> > Don't be silly. It's a horrible idea. People would want to
> > internationalize the strings etc, and nobody would use the extended
> > versions anyway, since nobody uses raw system calls.
>
> That's a good point, and think that least in the netlink case it'd be much
> better to say which attribute was the one that had an issue, and that has an
> obvious binary encoding rather than encoding that in a string.

So in older discussions about this I suggested a solution for that: also returning
(in a channel separate from errnos) the byte offset to the field that caused the
error, plus a string - and leaving errnos alone.

This only matters for those (few) system calls that have a large attribute space:
perf and some of the scheduler syscalls are such.

With this scheme arbitrarily granular error handling can be implemented:

- the laziest can just use the errno like usual, which catches 90% of the apps.

- the somewhat sophisticated would print the human readable string (or a
translation thereof). Would cover another 9%. (This percentage might increase
over time, as the strings become more widely used.)

- tools with a case of obsessive-compulsive perfectionism would use the structure
offset to programmatically react to the error condition, and would use the
human-readable string to explain the precise reason. Would cover another 1% of
tools.

... but back then I didn't feel like complicating an error recovery ABI for the
needs of the 1%, robust error handling is all about simplicity: if it's not
simple, tools won't use it.

Thanks,

Ingo

2015-08-26 07:27:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting


* Ingo Molnar <[email protected]> wrote:

> ... but back then I didn't feel like complicating an error recovery ABI for the
> needs of the 1%, robust error handling is all about simplicity: if it's not
> simple, tools won't use it.

And note that it needs to be 'simple' in two places for usage to grow naturally:

- the usage site in the kernel
- the tooling side that recovers the information.

That's why I think that such a form:

return err_str(-EINVAL, "x86/perf: CPU does not support precise sampling");

is obviously simple on the kernel side as it returns -EINVAL, and is very simple
on the tooling side as well, if we are allowed to extend prctl().

Thanks,

Ingo

2015-08-26 07:37:03

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Wed, 2015-08-26 at 09:20 +0200, Ingo Molnar wrote:
> > That's a good point, and think that least in the netlink case it'd be much
> > better to say which attribute was the one that had an issue, and that has an
> > obvious binary encoding rather than encoding that in a string.
>
> So in older discussions about this I suggested a solution for that: also returning
> (in a channel separate from errnos) the byte offset to the field that caused the
> error, plus a string - and leaving errnos alone.

I was considering, at least in this case, to forgo the string entirely
- that makes it use less space in the kernel (no need for all those
strings) and completely avoids the discussion about translation etc.,
while still being entirely sufficient for the debugging I have in mind.

> This only matters for those (few) system calls that have a large attribute space:
> perf and some of the scheduler syscalls are such.

As I'm saying, netlink is that in a way as well :) Except it's not a
syscall per se, since it's layered behind a message passing interface.

> With this scheme arbitrarily granular error handling can be implemented:
>
> - the laziest can just use the errno like usual, which catches 90% of the apps.
>
> - the somewhat sophisticated would print the human readable string (or a
> translation thereof). Would cover another 9%. (This percentage might increase
> over time, as the strings become more widely used.)
>

Well, if the apps were to be extended trivially to print, in the
netlink case, the attribute with an error, that'd help debugging
significantly - not much need for a string in that case.

But I'll agree that it's a more special case than the perf case you're
looking at where you have things like "your hardware doesn't support
this" which obviously isn't really tied to some argument directly.

johannes

2015-08-26 11:37:46

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

Ingo Molnar <[email protected]> writes:

> * Ingo Molnar <[email protected]> wrote:
>
>>
>> * Johannes Berg <[email protected]> wrote:
>>
>> > On Mon, 2015-08-24 at 17:32 +0300, Alexander Shishkin wrote:
>> >
>> > > This time around, I employed a linker trick to convert the structures
>> > > containing extended error information into integers, which are then made to
>> > > look just like normal error codes so that IS_ERR_VALUE() and friends would
>> > > still work correctly on them. So no extra pointers in the struct perf_event
>> > > or anywhere else; the extended error codes are passed around like normal
>> > > error codes. They only need to be converted in syscalls' topmost return
>> > > statements. This is done in 1/6.
>> >
>> > For the record, as we discussed separately, I'd love to see this move to more
>> > general infrastructure. In wireless (nl80211), for example, we have a few
>> > hundred (!) callsites returning -EINVAL, mostly based on malformed netlink
>> > attributes, and it can be very difficult to figure out what went wrong;
>> > debugging mostly employs a variation of Hugh's trick.
>>
>> Absolutely, I suggested this as well earlier today, as the scheduler would like
>> to make use of it in syscalls with extensible ABIs, such as sched_setattr().
>>
>> If people really like this then we could go farther as well and add a standalone
>> 'extended errors system call' as well (SyS_errno_extended_get()), which would
>> allow the recovery of error strings even for system calls that are not easily
>> extensible. We could cache the last error description in the task struct.
>
> If we do that then we don't even have to introduce per system call error code
> conversion, but could unconditionally save the last extended error info in the
> task struct and continue - this could be done very cheaply with the linker trick
> driven integer ID.
>
> I.e. system calls could opt in to do:
>
> return err_str(-EBUSY, "perf/x86: BTS conflicts with active events");
>
> and the overhead of this would be minimal, we'd essentially do something like this
> to save the error:
>
> current->err_code = code;
>
> where 'code' is a build time constant in essence.

I'd propose a mixed approach here: err_str() would still return an
integer in the [-EXT_ERRNO, -MAX_ERRNO] range which would index the
err_site struct and upon returning to userspace we'd do

current->err_code = code;
return ext_errno(code); /* the traditional errno */

Reason: the lifetime of this extended error code would be exactly the
same as that of the traditional error value so that we'd always return
the most recent error and wouldn't be prone to something overwriting the
error code under us.

The problem with code checking for different types of errors has two
sides to it:
* most of those error codes that are check for shouldn't really be
annotated at all and should rather remain like they are;
* with the ones that actually do need to be checked for, the checks
would change from "if (err == EINTR)" to "if (ext_errno(err) ==
EINTR)", which doesn't seem like a big deal (with ext_errno() being a
O(1) lookup).

Side note: we should also make sure that only the userspace-visible
errors ever get annotated like that to prevent the error message creep
(which would be even a bigger problem if we go ahead to store the
extended error code in task_struct right at the topmost return
statement). Perf example: pretty much all errors that happen around
event scheduling, including stuff that pmu callbacks return, needn't and
shouldn't be annotated at all.

> We could use this even in system calls where the error path is performance
> critical, as all the string recovery and copying overhead would be triggered by
> applications that opt in via the new system call:
>
> struct err_desc {
> const char *message;
> const char *owner;
> const int code;
> };
>
> SyS_err_get_desc(struct err_desc *err_desc __user);
>
> [ Which could perhaps be a prctl() extension as well (PR_GET_ERR_DESC): finally
> some truly matching functionality for prctl(). ]
>
> Hm?

I like this.

Regards,
--
Alex

2015-08-26 16:59:01

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

Ingo Molnar <[email protected]> writes:

> * Ingo Molnar <[email protected]> wrote:
>
>> ... but back then I didn't feel like complicating an error recovery ABI for the
>> needs of the 1%, robust error handling is all about simplicity: if it's not
>> simple, tools won't use it.
>
> And note that it needs to be 'simple' in two places for usage to grow naturally:
>
> - the usage site in the kernel
> - the tooling side that recovers the information.
>
> That's why I think that such a form:
>
> return err_str(-EINVAL, "x86/perf: CPU does not support precise sampling");
>
> is obviously simple on the kernel side as it returns -EINVAL, and is very simple
> on the tooling side as well, if we are allowed to extend prctl().

So I hacked stuff a bit [1] to accomodate some of the above
ideas. The below diff shows how these ideas integrate with perf. The
rest is in my github tree.

- this exterr implementation allows its users to add arbitrary
information to the call site structures and also pretty print them on
the way out; in the example, perf stores perf_event_attr field name
that is the source of trouble; it's a string rather than offsetof(),
because half of our event attribute is a bit field;
- the "way out" doesn't have to be syscall return path (although in the
example it is);
- userspace can fetch the extended error reports via prctl() like you
suggested above;
- error codes are still passed around in the [-EXT_ERRNO..-MAX_ERRNO]
range until they are passed to userspace (which is where ext_err_code()
converts them back to traditional errno.h values).

# perf record -e branches -c1 ls
kernel says (0/95): {
"file": "/home/ash/work/linux/arch/x86/kernel/cpu/perf_event.c",
"line": 432,
"code": -95,
"module": "perf/x86",
"message": "BTS sampling not allowed for kernel space"
, "attr_field": "exclude_kernel"
}

Error:
No hardware sampling interrupt available.
No APIC? If so then you can boot the kernel with the "lapic" boot parameter to force-enable it.
#

[1] https://github.com/virtuoso/linux-perf/commits/exterr

>From 1338fe417223cc1d8bc7588d95c6a913993d966f Mon Sep 17 00:00:00 2001
From: Alexander Shishkin <[email protected]>
Date: Wed, 26 Aug 2015 18:36:14 +0300
Subject: [PATCH] perf: Use extended syscall error reporting somewhat

Signed-off-by: Alexander Shishkin <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 5 ++++-
include/linux/perf_event.h | 14 ++++++++++++++
kernel/events/core.c | 17 ++++++++++++++++-
3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f56cf074d0..5e8f2edb2c 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -12,6 +12,8 @@
* For licencing details see kernel-base/COPYING
*/

+#define EXTERR_MODNAME "perf/x86"
+
#include <linux/perf_event.h>
#include <linux/capability.h>
#include <linux/notifier.h>
@@ -426,7 +428,8 @@ int x86_setup_perfctr(struct perf_event *event)

/* BTS is currently only allowed for user-mode. */
if (!attr->exclude_kernel)
- return -EOPNOTSUPP;
+ return perf_err(-EOPNOTSUPP, exclude_kernel,
+ "BTS sampling not allowed for kernel space");

/* disallow bts if conflicting events are present */
if (x86_add_exclusive(x86_lbr_exclusive_lbr))
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809433..eb63074012 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -16,6 +16,20 @@

#include <uapi/linux/perf_event.h>

+#include <linux/exterr.h>
+
+struct perf_ext_err_site {
+ struct ext_err_site site;
+ const char *attr_field;
+};
+
+#define perf_err(__c, __a, __m) \
+ ({ /* make sure it's a real field before stringifying it */ \
+ struct perf_event_attr __x; (void)__x.__a; \
+ ext_err(perf, __c, __m, \
+ .attr_field = __stringify(__a)); \
+ })
+
/*
* Kernel-internal data types and definitions:
*/
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ae16867670..5523c623c4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9,6 +9,8 @@
* For licensing details see kernel-base/COPYING
*/

+#define EXTERR_MODNAME "perf"
+
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/cpu.h>
@@ -44,11 +46,24 @@
#include <linux/compat.h>
#include <linux/bpf.h>
#include <linux/filter.h>
+#include <linux/exterr.h>

#include "internal.h"

#include <asm/irq_regs.h>

+static char *perf_exterr_format(void *site)
+{
+ struct perf_ext_err_site *psite = site;
+ char *output;
+
+ output = kasprintf(GFP_KERNEL, ",\t\"attr_field\": \"%s\"\n",
+ psite->attr_field);
+ return output;
+}
+
+DECLARE_EXTERR_DOMAIN(perf, perf_exterr_format);
+
static struct workqueue_struct *perf_wq;

typedef int (*remote_function_f)(void *);
@@ -8352,7 +8367,7 @@ err_group_fd:
fdput(group);
err_fd:
put_unused_fd(event_fd);
- return err;
+ return ext_err_errno(err);
}

/**
--
2.5.0

2015-08-26 18:41:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Wed, 26 Aug 2015 09:26:56 +0200 Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > ... but back then I didn't feel like complicating an error recovery ABI for the
> > needs of the 1%, robust error handling is all about simplicity: if it's not
> > simple, tools won't use it.
>
> And note that it needs to be 'simple' in two places for usage to grow naturally:
>
> - the usage site in the kernel
> - the tooling side that recovers the information.
>
> That's why I think that such a form:
>
> return err_str(-EINVAL, "x86/perf: CPU does not support precise sampling");
>
> is obviously simple on the kernel side as it returns -EINVAL, and is very simple
> on the tooling side as well, if we are allowed to extend prctl().
>

Is this whole thing overkill? As far as I can see, the problem which is
being addressed only occurs in a couple of places (perf, wifi netlink
handling) and could be addressed with some local pr_debug statements. ie,

#define err_str(e, s) ({
if (debugging)
pr_debug("%s:%d: error %d (%s)", __FILE__, __LINE__, e, s);
e;
})

(And I suppose that if this is later deemed inadequate, err_str() could
be made more fancy).


IOW, do we really need some grand kernel-wide infrastructural thing to
adequately address this problem?

2015-08-26 20:05:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Wed, Aug 26, 2015 at 11:41:11AM -0700, Andrew Morton wrote:
> On Wed, 26 Aug 2015 09:26:56 +0200 Ingo Molnar <[email protected]> wrote:
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > ... but back then I didn't feel like complicating an error recovery ABI for the
> > > needs of the 1%, robust error handling is all about simplicity: if it's not
> > > simple, tools won't use it.
> >
> > And note that it needs to be 'simple' in two places for usage to grow naturally:
> >
> > - the usage site in the kernel
> > - the tooling side that recovers the information.
> >
> > That's why I think that such a form:
> >
> > return err_str(-EINVAL, "x86/perf: CPU does not support precise sampling");
> >
> > is obviously simple on the kernel side as it returns -EINVAL, and is very simple
> > on the tooling side as well, if we are allowed to extend prctl().
> >
>
> Is this whole thing overkill? As far as I can see, the problem which is
> being addressed only occurs in a couple of places (perf, wifi netlink
> handling) and could be addressed with some local pr_debug statements. ie,
>
> #define err_str(e, s) ({
> if (debugging)
> pr_debug("%s:%d: error %d (%s)", __FILE__, __LINE__, e, s);
> e;
> })
>
> (And I suppose that if this is later deemed inadequate, err_str() could
> be made more fancy).

Not really. That is something that's limited to root. Whereas the
problem is very much wider than that.

If you set one bit wrong in the pretty large perf_event_attr you've got
a fair chance of getting -EINVAL on trying to create the event. Good
luck finding what you did wrong.

Any user can create events (for their own tasks), this does not require
root.

Allowing users to flip your @debugging flag would be an insta DoS.

Furthermore, its very unfriendly in that you have to (manually) go
correlate random dmesg output with some program action.

2015-08-26 20:22:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Wed, 26 Aug 2015 22:05:13 +0200 Peter Zijlstra <[email protected]> wrote:

> > Is this whole thing overkill? As far as I can see, the problem which is
> > being addressed only occurs in a couple of places (perf, wifi netlink
> > handling) and could be addressed with some local pr_debug statements. ie,
> >
> > #define err_str(e, s) ({
> > if (debugging)
> > pr_debug("%s:%d: error %d (%s)", __FILE__, __LINE__, e, s);
> > e;
> > })
> >
> > (And I suppose that if this is later deemed inadequate, err_str() could
> > be made more fancy).
>
> Not really. That is something that's limited to root. Whereas the
> problem is very much wider than that.
>
> If you set one bit wrong in the pretty large perf_event_attr you've got
> a fair chance of getting -EINVAL on trying to create the event. Good
> luck finding what you did wrong.
>
> Any user can create events (for their own tasks), this does not require
> root.
>
> Allowing users to flip your @debugging flag would be an insta DoS.
>
> Furthermore, its very unfriendly in that you have to (manually) go
> correlate random dmesg output with some program action.

It depends on who the audience is. If it's developers who are writing
userspace perf tooling then all the above won't be an issue. If it's
aimed at end users of that tooling then yes.

IOW, we're in the usual situation of discussing implementation before
anyone has explained the requirements.


Also... we're talking only of perf, so perhaps some perf-specific
reporting scheme would be better, rather than a kernel-wide thing.

2015-08-26 20:56:07

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Wed, 26 Aug 2015, Andrew Morton wrote:

> On Wed, 26 Aug 2015 22:05:13 +0200 Peter Zijlstra <[email protected]> wrote:
>
> > > Is this whole thing overkill? As far as I can see, the problem which is
> > > being addressed only occurs in a couple of places (perf, wifi netlink
> > > handling) and could be addressed with some local pr_debug statements. ie,
> > >
> > > #define err_str(e, s) ({
> > > if (debugging)
> > > pr_debug("%s:%d: error %d (%s)", __FILE__, __LINE__, e, s);
> > > e;
> > > })
> > >
> > > (And I suppose that if this is later deemed inadequate, err_str() could
> > > be made more fancy).
> >
> > Not really. That is something that's limited to root. Whereas the
> > problem is very much wider than that.
> >
> > If you set one bit wrong in the pretty large perf_event_attr you've got
> > a fair chance of getting -EINVAL on trying to create the event. Good
> > luck finding what you did wrong.
> >
> > Any user can create events (for their own tasks), this does not require
> > root.
> >
> > Allowing users to flip your @debugging flag would be an insta DoS.
> >
> > Furthermore, its very unfriendly in that you have to (manually) go
> > correlate random dmesg output with some program action.
>
> It depends on who the audience is. If it's developers who are writing
> userspace perf tooling then all the above won't be an issue. If it's
> aimed at end users of that tooling then yes.

As a developer of tools that use the perf_event interface directly (PAPI
and such) I can say this is a common problem (getting unexplained EINVAL
results) and yes, telling the user to recompile their kernel to enable
debugging is usually not an option.

I often have to resort to sprinkling the kernel with printks to find the
source of errors, which is a pain. It's even more fun when the user's
setup is slightly different enough that I can't reproduce the issue on a
local machine, which happens often (due to different kernels, distros
backporting perf fixes, different hardware, different security settings,
etc).

Vince

2015-08-26 20:56:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Wed, 26 Aug 2015 16:50:33 -0400 (EDT) Vince Weaver <[email protected]> wrote:

> On Wed, 26 Aug 2015, Andrew Morton wrote:
>
> > On Wed, 26 Aug 2015 22:05:13 +0200 Peter Zijlstra <[email protected]> wrote:
> >
> > > > Is this whole thing overkill? As far as I can see, the problem which is
> > > > being addressed only occurs in a couple of places (perf, wifi netlink
> > > > handling) and could be addressed with some local pr_debug statements. ie,
> > > >
> > > > #define err_str(e, s) ({
> > > > if (debugging)
> > > > pr_debug("%s:%d: error %d (%s)", __FILE__, __LINE__, e, s);
> > > > e;
> > > > })
> > > >
> > > > (And I suppose that if this is later deemed inadequate, err_str() could
> > > > be made more fancy).
> > >
> > > Not really. That is something that's limited to root. Whereas the
> > > problem is very much wider than that.
> > >
> > > If you set one bit wrong in the pretty large perf_event_attr you've got
> > > a fair chance of getting -EINVAL on trying to create the event. Good
> > > luck finding what you did wrong.
> > >
> > > Any user can create events (for their own tasks), this does not require
> > > root.
> > >
> > > Allowing users to flip your @debugging flag would be an insta DoS.
> > >
> > > Furthermore, its very unfriendly in that you have to (manually) go
> > > correlate random dmesg output with some program action.
> >
> > It depends on who the audience is. If it's developers who are writing
> > userspace perf tooling then all the above won't be an issue. If it's
> > aimed at end users of that tooling then yes.
>
> As a developer of tools that use the perf_event interface directly (PAPI
> and such) I can say this is a common problem (getting unexplained EINVAL
> results) and yes, telling the user to recompile their kernel to enable
> debugging is usually not an option.

Users wouldn't need to recompile.

> I often have to resort to sprinkling the kernel with printks to find the
> source of errors, which is a pain. It's even more fun when the user's
> setup is slightly different enough that I can't reproduce the issue on a
> local machine, which happens often (due to different kernels, distros
> backporting perf fixes, different hardware, different security settings,
> etc).

Suppose you were to tell them "please do `echo 1 > /proc/whatever' then
send me the kernel logs". Would this be good enough?

2015-08-26 20:58:30

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

Em Wed, Aug 26, 2015 at 07:56:47PM +0300, Alexander Shishkin escreveu:
> Ingo Molnar <[email protected]> writes:
>
> > * Ingo Molnar <[email protected]> wrote:
> >
> >> ... but back then I didn't feel like complicating an error recovery ABI for the
> >> needs of the 1%, robust error handling is all about simplicity: if it's not
> >> simple, tools won't use it.
> >
> > And note that it needs to be 'simple' in two places for usage to grow naturally:
> >
> > - the usage site in the kernel
> > - the tooling side that recovers the information.
> >
> > That's why I think that such a form:
> >
> > return err_str(-EINVAL, "x86/perf: CPU does not support precise sampling");
> >
> > is obviously simple on the kernel side as it returns -EINVAL, and is very simple
> > on the tooling side as well, if we are allowed to extend prctl().
>
> So I hacked stuff a bit [1] to accomodate some of the above
> ideas. The below diff shows how these ideas integrate with perf. The
> rest is in my github tree.
>
> - this exterr implementation allows its users to add arbitrary
> information to the call site structures and also pretty print them on
> the way out; in the example, perf stores perf_event_attr field name
> that is the source of trouble; it's a string rather than offsetof(),
> because half of our event attribute is a bit field;
> - the "way out" doesn't have to be syscall return path (although in the
> example it is);
> - userspace can fetch the extended error reports via prctl() like you
> suggested above;
> - error codes are still passed around in the [-EXT_ERRNO..-MAX_ERRNO]
> range until they are passed to userspace (which is where ext_err_code()
> converts them back to traditional errno.h values).

Hey, can we see the builtin-record.c patch please?

- Arnaldo

> # perf record -e branches -c1 ls
> kernel says (0/95): {
> "file": "/home/ash/work/linux/arch/x86/kernel/cpu/perf_event.c",
> "line": 432,
> "code": -95,
> "module": "perf/x86",
> "message": "BTS sampling not allowed for kernel space"
> , "attr_field": "exclude_kernel"
> }
>
> Error:
> No hardware sampling interrupt available.
> No APIC? If so then you can boot the kernel with the "lapic" boot parameter to force-enable it.
> #
>
> [1] https://github.com/virtuoso/linux-perf/commits/exterr
>
> >From 1338fe417223cc1d8bc7588d95c6a913993d966f Mon Sep 17 00:00:00 2001
> From: Alexander Shishkin <[email protected]>
> Date: Wed, 26 Aug 2015 18:36:14 +0300
> Subject: [PATCH] perf: Use extended syscall error reporting somewhat
>
> Signed-off-by: Alexander Shishkin <[email protected]>
> ---
> arch/x86/kernel/cpu/perf_event.c | 5 ++++-
> include/linux/perf_event.h | 14 ++++++++++++++
> kernel/events/core.c | 17 ++++++++++++++++-
> 3 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index f56cf074d0..5e8f2edb2c 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -12,6 +12,8 @@
> * For licencing details see kernel-base/COPYING
> */
>
> +#define EXTERR_MODNAME "perf/x86"
> +
> #include <linux/perf_event.h>
> #include <linux/capability.h>
> #include <linux/notifier.h>
> @@ -426,7 +428,8 @@ int x86_setup_perfctr(struct perf_event *event)
>
> /* BTS is currently only allowed for user-mode. */
> if (!attr->exclude_kernel)
> - return -EOPNOTSUPP;
> + return perf_err(-EOPNOTSUPP, exclude_kernel,
> + "BTS sampling not allowed for kernel space");
>
> /* disallow bts if conflicting events are present */
> if (x86_add_exclusive(x86_lbr_exclusive_lbr))
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2027809433..eb63074012 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -16,6 +16,20 @@
>
> #include <uapi/linux/perf_event.h>
>
> +#include <linux/exterr.h>
> +
> +struct perf_ext_err_site {
> + struct ext_err_site site;
> + const char *attr_field;
> +};
> +
> +#define perf_err(__c, __a, __m) \
> + ({ /* make sure it's a real field before stringifying it */ \
> + struct perf_event_attr __x; (void)__x.__a; \
> + ext_err(perf, __c, __m, \
> + .attr_field = __stringify(__a)); \
> + })
> +
> /*
> * Kernel-internal data types and definitions:
> */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index ae16867670..5523c623c4 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -9,6 +9,8 @@
> * For licensing details see kernel-base/COPYING
> */
>
> +#define EXTERR_MODNAME "perf"
> +
> #include <linux/fs.h>
> #include <linux/mm.h>
> #include <linux/cpu.h>
> @@ -44,11 +46,24 @@
> #include <linux/compat.h>
> #include <linux/bpf.h>
> #include <linux/filter.h>
> +#include <linux/exterr.h>
>
> #include "internal.h"
>
> #include <asm/irq_regs.h>
>
> +static char *perf_exterr_format(void *site)
> +{
> + struct perf_ext_err_site *psite = site;
> + char *output;
> +
> + output = kasprintf(GFP_KERNEL, ",\t\"attr_field\": \"%s\"\n",
> + psite->attr_field);
> + return output;
> +}
> +
> +DECLARE_EXTERR_DOMAIN(perf, perf_exterr_format);
> +
> static struct workqueue_struct *perf_wq;
>
> typedef int (*remote_function_f)(void *);
> @@ -8352,7 +8367,7 @@ err_group_fd:
> fdput(group);
> err_fd:
> put_unused_fd(event_fd);
> - return err;
> + return ext_err_errno(err);
> }
>
> /**
> --
> 2.5.0

2015-08-26 21:05:36

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

Em Wed, Aug 26, 2015 at 11:41:11AM -0700, Andrew Morton escreveu:
> On Wed, 26 Aug 2015 09:26:56 +0200 Ingo Molnar <[email protected]> wrote:
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > ... but back then I didn't feel like complicating an error recovery ABI for the
> > > needs of the 1%, robust error handling is all about simplicity: if it's not
> > > simple, tools won't use it.
> >
> > And note that it needs to be 'simple' in two places for usage to grow naturally:
> >
> > - the usage site in the kernel
> > - the tooling side that recovers the information.
> >
> > That's why I think that such a form:
> >
> > return err_str(-EINVAL, "x86/perf: CPU does not support precise sampling");
> >
> > is obviously simple on the kernel side as it returns -EINVAL, and is very simple
> > on the tooling side as well, if we are allowed to extend prctl().
>
> Is this whole thing overkill? As far as I can see, the problem which is
> being addressed only occurs in a couple of places (perf, wifi netlink
> handling) and could be addressed with some local pr_debug statements. ie,
>
> #define err_str(e, s) ({
> if (debugging)
> pr_debug("%s:%d: error %d (%s)", __FILE__, __LINE__, e, s);
> e;
> })
>
> (And I suppose that if this is later deemed inadequate, err_str() could
> be made more fancy).
>
> IOW, do we really need some grand kernel-wide infrastructural thing to
> adequately address this problem?

For perf tooling we already ask the user to look at dmesg sometimes, but
that is very ugly and fragile, for instance multiple users ask for some
complex combination of features (look at the ones already marked by
Alexander), and there are tons of combos possible.

Mapping back from -EINVAL or -ENOTSUPP to some sensible message that
helps the user to make sense of _what_ is the problem is difficult for
developers, let alone for users.

I'd say lets try this with perf and leave the rest of the world alone,
if the experience proves fruitful, it will be left as an example for
others, hopefully a good one :-)

- Arnaldo

2015-08-26 21:12:35

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting

On Wed, 26 Aug 2015, Andrew Morton wrote:

> On Wed, 26 Aug 2015 16:50:33 -0400 (EDT) Vince Weaver <[email protected]> wrote:
> > I often have to resort to sprinkling the kernel with printks to find the
> > source of errors, which is a pain. It's even more fun when the user's
> > setup is slightly different enough that I can't reproduce the issue on a
> > local machine, which happens often (due to different kernels, distros
> > backporting perf fixes, different hardware, different security settings,
> > etc).
>
> Suppose you were to tell them "please do `echo 1 > /proc/whatever' then
> send me the kernel logs". Would this be good enough?

would /proc/whatever require CAP_SYS_ADMIN?

If so, then no, probably not good enough. Many of the users are trying to
run things on large computing clusters, etc, and won't have root
permissions. They quite possibly won't have access to the syslog either.

I realize that the userbase affected by this is very tiny compared to the
amount of bloat introduced to fix it.

It would have been easier if event validation were done in userspace (ala
perfmon2) rather than having everything in the kernel like perf_event
does, but too late for that. Although at some point once you start
re-using the limited number of error return codes more than once things
can get confusing very quickly.

Vince


2015-08-28 10:07:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] perf: Introduce extended syscall error reporting


* Andrew Morton <[email protected]> wrote:

> On Wed, 26 Aug 2015 22:05:13 +0200 Peter Zijlstra <[email protected]> wrote:
>
> > > Is this whole thing overkill? As far as I can see, the problem which is
> > > being addressed only occurs in a couple of places (perf, wifi netlink
> > > handling) and could be addressed with some local pr_debug statements. ie,
> > >
> > > #define err_str(e, s) ({
> > > if (debugging)
> > > pr_debug("%s:%d: error %d (%s)", __FILE__, __LINE__, e, s);
> > > e;
> > > })
> > >
> > > (And I suppose that if this is later deemed inadequate, err_str() could
> > > be made more fancy).
> >
> > Not really. That is something that's limited to root. Whereas the
> > problem is very much wider than that.
> >
> > If you set one bit wrong in the pretty large perf_event_attr you've got
> > a fair chance of getting -EINVAL on trying to create the event. Good
> > luck finding what you did wrong.
> >
> > Any user can create events (for their own tasks), this does not require
> > root.
> >
> > Allowing users to flip your @debugging flag would be an insta DoS.
> >
> > Furthermore, its very unfriendly in that you have to (manually) go
> > correlate random dmesg output with some program action.
>
> It depends on who the audience is. If it's developers who are writing userspace
> perf tooling then all the above won't be an issue. If it's aimed at end users
> of that tooling then yes.
>
> IOW, we're in the usual situation of discussing implementation before anyone has
> explained the requirements.

So the perf background was well understood by most people involved, it just didn't
survive into the 0/N description:

The problem is that we have a complex attribute structure with dozens of user
triggerable (and often hardware dependent) failure scenarios all returning one of
-EINVAL or -ENOTSUPP. Likewise there's a similarly complex scheduler attribute
structure handled by SyS_sched_setattr() with 10+ failure modes.

So since the kernel actually knows exactly what the failure was, and we lose that
information due to errno clustering, we thought it brilliant idea to try to be
helpful to human users of the tooling and to attempt to preserve this
information - to make Linux tooling a bit less passive-aggressive than it is
today. (Or at least those parts of tooling that we are writing!)

The other option would be to replicate all the failure analysis in user-space -
which sucks and which it cannot even do in some important cases.

The third option is to maintain the status quo: let Linux tooling continue to suck
wrt. failure analysis.

> Also... we're talking only of perf, so perhaps some perf-specific reporting
> scheme would be better, rather than a kernel-wide thing.

That was indeed the starting point, at which point scheduler syscalls came up, and
potentially other places in the kernel were mentioned, so we thought we'd try to
be more generally useful.

To address the inevitable "why did you code this up in a perf-specific way??!"
complaints and such ;-)

Thanks,

Ingo

2015-08-31 18:47:21

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] perf: Introduce extended syscall error reporting

On Mon, Aug 24, 2015 at 5:32 PM, Alexander Shishkin
<[email protected]> wrote:

One small comment below.

> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -49,6 +49,79 @@
>
> #include <asm/irq_regs.h>
>
> +static bool extended_reporting_enabled(struct perf_event_attr *attr)
> +{
> + if (attr->size >= PERF_ATTR_SIZE_VER6 &&
> + attr->perf_err_size > 0)
> + return true;
> +
> + return false;
> +}
> +
> +/*
> + * Provide a JSON formatted error report to the user if they asked for it.
> + */
> +static void perf_error_report_site(struct perf_event_attr *attr,
> + const struct perf_err_site *site)
> +{
> + unsigned long len;
> + char *buffer;
> +
> + if (!site || !extended_reporting_enabled(attr))
> + return;
> +
> + /* in case of nested perf_err()s, which you shouldn't really do */
> + while (site->code <= -PERF_ERRNO)
> + site = perf_errno_to_site(site->code);
> +
> + buffer = kasprintf(GFP_KERNEL,
> + "{\n"
> + "\t\"code\": %d,\n"
> + "\t\"module\": \"%s\",\n"
> + "\t\"message\": \"%s\"\n"
> + "}\n",
> + site->code, site->owner, site->message
> + );
> + if (!buffer)
> + return;
> +
> + /* trim the buffer to the supplied boundary */
> + len = strlen(buffer);
> + if (len >= attr->perf_err_size) {
> + len = attr->perf_err_size - 1;
> + buffer[len] = 0;
> + }

len = strnlen(buffer, attr->perf_err_size);
buffer[len] = 0;

And perhaps perf_err_size has to be length (perf_err_len) ?

> +
> + if (copy_to_user((void __user *)attr->perf_err, buffer, len + 1)) {
> + /* if we failed to copy once, don't bother later */
> + attr->perf_err_size = 0;

Kaboom next time on buffer[-1] = 0; since len >= 0?

> + }
> +
> + kfree(buffer);
> +}

--
With Best Regards,
Andy Shevchenko