2008-11-26 08:49:20

by Stephane Eranian

[permalink] [raw]
Subject: [patch 20/24] perfmon: system calls interface

This patch adds the top level perfmon system calls.

Signed-off-by: Stephane Eranian <[email protected]>
--

Index: o3/perfmon/perfmon_syscalls.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ o3/perfmon/perfmon_syscalls.c 2008-11-25 17:55:04.000000000 +0100
@@ -0,0 +1,741 @@
+/*
+ * perfmon_syscalls.c: perfmon2 system call interface
+ *
+ * This file implements the perfmon2 interface which
+ * provides access to the hardware performance counters
+ * of the host processor.
+ *
+ * The initial version of perfmon.c was written by
+ * Ganesh Venkitachalam, IBM Corp.
+ *
+ * Then it was modified for perfmon-1.x by Stephane Eranian and
+ * David Mosberger, Hewlett Packard Co.
+ *
+ * Version Perfmon-2.x is a complete rewrite of perfmon-1.x
+ * by Stephane Eranian, Hewlett Packard Co.
+ *
+ * Copyright (c) 1999-2006 Hewlett-Packard Development Company, L.P.
+ * Contributed by Stephane Eranian <[email protected]>
+ * David Mosberger-Tang <[email protected]>
+ *
+ * More information about perfmon available at:
+ * http://perfmon2.sf.net
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+ * 02111-1307 USA
+ */
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/ptrace.h>
+#include <linux/perfmon_kern.h>
+#include <linux/uaccess.h>
+#include "perfmon_priv.h"
+
+/*
+ * Context locking rules:
+ * ---------------------
+ * - any thread with access to the file descriptor of a context can
+ * potentially issue perfmon calls
+ *
+ * - calls must be serialized to guarantee correctness
+ *
+ * - as soon as a context is attached to a thread or CPU, it may be
+ * actively monitoring. On some architectures, such as IA-64, this
+ * is true even though the pfm_start() call has not been made. This
+ * comes from the fact that on some architectures, it is possible to
+ * start/stop monitoring from userland.
+ *
+ * - If monitoring is active, then there can PMU interrupts. Because
+ * context accesses must be serialized, the perfmon system calls
+ * must mask interrupts as soon as the context is attached.
+ *
+ * - perfmon system calls that operate with the context unloaded cannot
+ * assume it is actually unloaded when they are called. They first need
+ * to check and for that they need interrupts masked. Then, if the
+ * context is actually unloaded, they can unmask interrupts.
+ *
+ * - interrupt masking holds true for other internal perfmon functions as
+ * well. Except for PMU interrupt handler because those interrupts
+ * cannot be nested.
+ *
+ * - we mask ALL interrupts instead of just the PMU interrupt because we
+ * also need to protect against timer interrupts which could trigger
+ * a set switch.
+ */
+
+struct pfm_syscall_cookie {
+ struct file *filp;
+ int fput_needed;
+};
+
+/*
+ * cannot attach if :
+ * - kernel task
+ * - task not owned by caller (checked by ptrace_may_attach())
+ * - task is dead or zombie
+ * - cannot use blocking notification when self-monitoring
+ */
+static int pfm_task_incompatible(struct pfm_context *ctx,
+ struct task_struct *task)
+{
+ /*
+ * cannot attach to a kernel thread
+ */
+ if (!task->mm) {
+ PFM_DBG("cannot attach to kernel thread [%d]", task->pid);
+ return -EPERM;
+ }
+
+ /*
+ * cannot attach to a zombie task
+ */
+ if (task->exit_state == EXIT_ZOMBIE || task->exit_state == EXIT_DEAD) {
+ PFM_DBG("cannot attach to zombie/dead task [%d]", task->pid);
+ return -EBUSY;
+ }
+ return 0;
+}
+
+/**
+ * pfm_get_task -- check permission and acquire task to monitor
+ * @ctx: perfmon context
+ * @pid: identification of the task to check
+ * @task: upon return, a pointer to the task to monitor
+ *
+ * This function is used in per-thread mode only AND when not
+ * self-monitoring. It finds the task to monitor and checks
+ * that the caller has permissions to attach. It also checks
+ * that the task is stopped via ptrace so that we can safely
+ * modify its state.
+ *
+ * task refcount is incremented when succesful.
+ */
+static int pfm_get_task(struct pfm_context *ctx, pid_t pid,
+ struct task_struct **task)
+{
+ struct task_struct *p;
+ int ret = 0, ret1 = 0;
+
+ /*
+ * When attaching to another thread we must ensure
+ * that the thread is actually stopped. Just like with
+ * perfmon system calls, we enforce that the thread
+ * be ptraced and STOPPED by using ptrace_check_attach().
+ *
+ * As a consequence, only the ptracing parent can actually
+ * attach a context to a thread. Obviously, this constraint
+ * does not exist for self-monitoring threads.
+ *
+ * We use ptrace_may_access() to check for permission.
+ */
+ read_lock(&tasklist_lock);
+
+ p = find_task_by_vpid(pid);
+ if (p)
+ get_task_struct(p);
+
+ read_unlock(&tasklist_lock);
+
+ if (!p) {
+ PFM_DBG("task not found %d", pid);
+ return -ESRCH;
+ }
+
+ ret = -EPERM;
+
+ /*
+ * returns 0 if cannot attach
+ */
+ ret1 = ptrace_may_access(p, PTRACE_MODE_ATTACH);
+ if (ret1)
+ ret = ptrace_check_attach(p, 0);
+
+ PFM_DBG("may_attach=%d check_attach=%d", ret1, ret);
+
+ if (ret || !ret1)
+ goto error;
+
+ ret = pfm_task_incompatible(ctx, p);
+ if (ret)
+ goto error;
+
+ *task = p;
+
+ return 0;
+error:
+ if (!(ret1 || ret))
+ ret = -EPERM;
+
+ put_task_struct(p);
+
+ return ret;
+}
+
+/*
+ * context must be locked when calling this function
+ */
+int __pfm_check_task_state(struct pfm_context *ctx, int check_mask,
+ unsigned long *flags)
+{
+ struct task_struct *task;
+ unsigned long local_flags, new_flags;
+ int state, ret;
+
+recheck:
+ /*
+ * task is NULL for system-wide context
+ */
+ task = ctx->task;
+ state = ctx->state;
+ local_flags = *flags;
+
+ PFM_DBG("state=%d check_mask=0x%x task=[%d]",
+ state, check_mask, task ? task->pid:-1);
+ /*
+ * if the context is detached, then we do not touch
+ * hardware, therefore there is not restriction on when we can
+ * access it.
+ */
+ if (state == PFM_CTX_UNLOADED)
+ return 0;
+ /*
+ * no command can operate on a zombie context.
+ * A context becomes zombie when the file that identifies
+ * it is closed while the context is still attached to the
+ * thread it monitors.
+ */
+ if (state == PFM_CTX_ZOMBIE)
+ return -EINVAL;
+
+ /*
+ * at this point, state is PFM_CTX_LOADED
+ */
+
+ /*
+ * some commands require the context to be unloaded to operate
+ */
+ if (check_mask & PFM_CMD_UNLOADED) {
+ PFM_DBG("state=%d, cmd needs context unloaded", state);
+ return -EBUSY;
+ }
+
+ /*
+ * self-monitoring always ok.
+ */
+ if (task == current)
+ return 0;
+
+ /*
+ * at this point, monitoring another thread
+ */
+
+ /*
+ * When we operate on another thread, we must wait for it to be
+ * stopped and completely off any CPU as we need to access the
+ * PMU state (or machine state).
+ *
+ * A thread can be put in the STOPPED state in various ways
+ * including PTRACE_ATTACH, or when it receives a SIGSTOP signal.
+ * We enforce that the thread must be ptraced, so it is stopped
+ * AND it CANNOT wake up while we operate on it because this
+ * would require an action from the ptracing parent which is the
+ * thread that is calling this function.
+ *
+ * The dependency on ptrace, imposes that only the ptracing
+ * parent can issue command on a thread. This is unfortunate
+ * but we do not know of a better way of doing this.
+ */
+ if (check_mask & PFM_CMD_STOPPED) {
+
+ spin_unlock_irqrestore(&ctx->lock, local_flags);
+
+ /*
+ * check that the thread is ptraced AND STOPPED
+ */
+ ret = ptrace_check_attach(task, 0);
+
+ spin_lock_irqsave(&ctx->lock, new_flags);
+
+ /*
+ * flags may be different than when we released the lock
+ */
+ *flags = new_flags;
+
+ if (ret)
+ return ret;
+ /*
+ * we must recheck to verify if state has changed
+ */
+ if (unlikely(ctx->state != state)) {
+ PFM_DBG("old_state=%d new_state=%d",
+ state,
+ ctx->state);
+ goto recheck;
+ }
+ }
+ return 0;
+}
+
+int pfm_check_task_state(struct pfm_context *ctx, int check_mask,
+ unsigned long *flags)
+{
+ int ret;
+ ret = __pfm_check_task_state(ctx, check_mask, flags);
+ PFM_DBG("ret=%d",ret);
+ return ret;
+}
+
+/**
+ * pfm_get_args - Function used to copy the syscall argument into kernel memory
+ * @ureq: user argument
+ * @sz: user argument size
+ * @lsz: size of stack buffer
+ * @laddr: stack buffer address
+ * @req: point to start of kernel copy of the argument
+ * @ptr_free: address of kernel copy to free
+ *
+ * There are two options:
+ * - use a stack buffer described by laddr (addresses) and lsz (size)
+ * - allocate memory
+ *
+ * return:
+ * < 0 : in case of error (ptr_free may not be updated)
+ * 0 : success
+ * - req: points to base of kernel copy of arguments
+ * - ptr_free: address of buffer to free by caller on exit.
+ * NULL if using the stack buffer
+ *
+ * when ptr_free is not NULL upon return, the caller must kfree()
+ */
+int pfm_get_args(void __user *ureq, size_t sz, size_t lsz, void *laddr,
+ void **req, void **ptr_free)
+{
+ void *addr;
+
+ /*
+ * check syadmin argument limit
+ */
+ if (unlikely(sz > pfm_controls.arg_mem_max)) {
+ PFM_DBG("argument too big %zu max=%zu",
+ sz,
+ pfm_controls.arg_mem_max);
+ return -E2BIG;
+ }
+
+ /*
+ * check if vector fits on stack buffer
+ */
+ if (sz > lsz) {
+ addr = kmalloc(sz, GFP_KERNEL);
+ if (unlikely(addr == NULL))
+ return -ENOMEM;
+ *ptr_free = addr;
+ } else {
+ addr = laddr;
+ *req = laddr;
+ *ptr_free = NULL;
+ }
+
+ /*
+ * bring the data in
+ */
+ if (unlikely(copy_from_user(addr, ureq, sz))) {
+ if (addr != laddr)
+ kfree(addr);
+ return -EFAULT;
+ }
+
+ /*
+ * base address of kernel buffer
+ */
+ *req = addr;
+
+ return 0;
+}
+
+/**
+ * pfm_acquire_ctx_from_fd -- get ctx from file descriptor
+ * @fd: file descriptor
+ * @ctx: pointer to pointer of context updated on return
+ * @cookie: opaque structure to use for release
+ *
+ * This helper function extracts the ctx from the file descriptor.
+ * It also increments the refcount of the file structure. Thus
+ * it updates the cookie so the refcount can be decreased when
+ * leaving the perfmon syscall via pfm_release_ctx_from_fd
+ */
+static int pfm_acquire_ctx_from_fd(int fd, struct pfm_context **ctx,
+ struct pfm_syscall_cookie *cookie)
+{
+ struct file *filp;
+ int fput_needed;
+
+ filp = fget_light(fd, &fput_needed);
+ if (unlikely(filp == NULL)) {
+ PFM_DBG("invalid fd %d", fd);
+ return -EBADF;
+ }
+
+ *ctx = filp->private_data;
+
+ if (unlikely(!*ctx || filp->f_op != &pfm_file_ops)) {
+ PFM_DBG("fd %d not related to perfmon", fd);
+ return -EBADF;
+ }
+ cookie->filp = filp;
+ cookie->fput_needed = fput_needed;
+
+ return 0;
+}
+
+/**
+ * pfm_release_ctx_from_fd -- decrease refcount of file associated with context
+ * @cookie: the cookie structure initialized by pfm_acquire_ctx_from_fd
+ */
+static inline void pfm_release_ctx_from_fd(struct pfm_syscall_cookie *cookie)
+{
+ fput_light(cookie->filp, cookie->fput_needed);
+}
+
+/**
+ * pfm_validate_type_sz -- validate sz based on type
+ * @type : PFM_RW_XX type passed to pfm_write or pfm_read
+ * @sz : vector size in bytes
+ *
+ * return:
+ * the number of elements in the vector, 0 if error
+ */
+static size_t pfm_validate_type_sz(int type, size_t sz)
+{
+ size_t count, sz_type;
+
+ switch(type) {
+ case PFM_RW_PMD:
+ case PFM_RW_PMC:
+ sz_type = sizeof(struct pfarg_pmr);
+ break;
+ default:
+ PFM_DBG("invalid type=%d", type);
+ return 0;
+ }
+
+ count = sz / sz_type;
+
+ if ((count * sz_type) != sz) {
+ PFM_DBG("invalid size=%zu for type=%d", sz, type);
+ return 0;
+ }
+
+ PFM_DBG("sz=%zu sz_type=%zu count=%zu",
+ sz,
+ sz_type,
+ count);
+
+ return count;
+}
+
+/*
+ * unlike the other perfmon system calls, this one returns a file descriptor
+ * or a value < 0 in case of error, very much like open() or socket()
+ */
+asmlinkage long sys_pfm_create(int flags, struct pfarg_sinfo __user *ureq)
+{
+ struct pfm_context *new_ctx;
+ struct pfarg_sinfo sif;
+ int ret;
+
+ PFM_DBG("flags=0x%x sif=%p", flags, ureq);
+
+ if (perfmon_disabled)
+ return -ENOSYS;
+
+ if (flags) {
+ PFM_DBG("no flags accepted yet");
+ return -EINVAL;
+ }
+ ret = __pfm_create_context(flags, &sif, &new_ctx);
+
+ /*
+ * copy sif to user level argument, if requested
+ */
+ if (ureq && copy_to_user(ureq, &sif, sizeof(sif))) {
+ pfm_undo_create(ret, new_ctx);
+ ret = -EFAULT;
+ }
+ return ret;
+}
+
+asmlinkage long sys_pfm_write(int fd, int uflags,
+ int type,
+ void __user *ureq,
+ size_t sz)
+{
+ u64 buf[PFM_STK_ARG];
+ struct pfm_context *ctx;
+ struct pfm_syscall_cookie cookie;
+ void *req, *fptr;
+ unsigned long flags;
+ size_t count;
+ int ret;
+
+ PFM_DBG("fd=%d flags=0x%x type=%d req=%p sz=%zu",
+ fd, uflags, type, ureq, sz);
+
+ if (uflags) {
+ PFM_DBG("no flags defined");
+ return -EINVAL;
+ }
+
+ count = pfm_validate_type_sz(type, sz);
+ if (!count)
+ return -EINVAL;
+
+ ret = pfm_acquire_ctx_from_fd(fd, &ctx, &cookie);
+ if (ret)
+ return ret;
+
+ ret = pfm_get_args(ureq, sz, sizeof(buf), buf, (void **)&req, &fptr);
+ if (ret)
+ goto error;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+ if (ret)
+ goto skip;
+ switch(type) {
+ case PFM_RW_PMC:
+ ret = __pfm_write_pmcs(ctx, req, count);
+ break;
+ case PFM_RW_PMD:
+ ret = __pfm_write_pmds(ctx, req, count);
+ break;
+ default:
+ PFM_DBG("invalid type=%d", type);
+ ret = -EINVAL;
+ }
+skip:
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ /*
+ * This function may be on the critical path.
+ * We want to avoid the branch if unecessary.
+ */
+ if (fptr)
+ kfree(fptr);
+error:
+ pfm_release_ctx_from_fd(&cookie);
+ return ret;
+}
+
+asmlinkage long sys_pfm_read(int fd, int uflags,
+ int type,
+ void __user *ureq,
+ size_t sz)
+{
+ u64 buf[PFM_STK_ARG];
+ struct pfm_context *ctx;
+ struct pfm_syscall_cookie cookie;
+ void *req, *fptr;
+ unsigned long flags;
+ size_t count;
+ int ret;
+
+ PFM_DBG("fd=%d flags=0x%x type=%d req=%p sz=%zu",
+ fd, uflags, type, ureq, sz);
+
+ if (uflags) {
+ PFM_DBG("no flags defined");
+ return -EINVAL;
+ }
+
+ count = pfm_validate_type_sz(type, sz);
+ if (!count)
+ return -EINVAL;
+
+ ret = pfm_acquire_ctx_from_fd(fd, &ctx, &cookie);
+ if (ret)
+ return ret;
+
+ ret = pfm_get_args(ureq, sz, sizeof(buf), buf, (void **)&req, &fptr);
+ if (ret)
+ goto error;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+ if (ret)
+ goto skip;
+
+ switch(type) {
+ case PFM_RW_PMD:
+ ret = __pfm_read_pmds(ctx, req, count);
+ break;
+ default:
+ PFM_DBG("invalid type=%d", type);
+ ret = -EINVAL;
+ }
+skip:
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ if (copy_to_user(ureq, req, sz))
+ ret = -EFAULT;
+
+ if (fptr)
+ kfree(fptr);
+error:
+ pfm_release_ctx_from_fd(&cookie);
+ return ret;
+}
+
+asmlinkage long sys_pfm_set_state(int fd, int uflags, int state)
+{
+ struct pfm_context *ctx;
+ struct pfm_syscall_cookie cookie;
+ unsigned long flags;
+ int ret;
+
+ PFM_DBG("fd=%d uflags=0x%x state=0x%x", fd, uflags, state);
+
+ if (uflags) {
+ PFM_DBG("no flags defined");
+ return -EINVAL;
+ }
+
+ switch(state) {
+ case PFM_ST_START:
+ case PFM_ST_STOP:
+ break;
+ default:
+ PFM_DBG("invalid state=0x%x", state);
+ return -EINVAL;
+ }
+
+ ret = pfm_acquire_ctx_from_fd(fd, &ctx, &cookie);
+ if (ret)
+ return ret;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+ if (!ret) {
+ if (state == PFM_ST_STOP)
+ ret = __pfm_stop(ctx);
+ else
+ ret = __pfm_start(ctx);
+ }
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ pfm_release_ctx_from_fd(&cookie);
+
+ return ret;
+}
+
+static long pfm_detach(int fd, int uflags)
+{
+ struct pfm_context *ctx;
+ struct pfm_syscall_cookie cookie;
+ unsigned long flags;
+ int ret;
+
+ ret = pfm_acquire_ctx_from_fd(fd, &ctx, &cookie);
+ if (ret)
+ return ret;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED|PFM_CMD_UNLOAD, &flags);
+ if (!ret)
+ ret = __pfm_unload_context(ctx);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ /*
+ * if unload was successful, then release the session
+ * must be called with interrupts enabled, thus we need
+ * to defer until are out of __pfm_unload_context()
+ */
+ if (!ret)
+ pfm_session_release();
+
+ pfm_release_ctx_from_fd(&cookie);
+
+ return ret;
+}
+
+asmlinkage long sys_pfm_attach(int fd, int uflags, int target)
+{
+ struct pfm_context *ctx;
+ struct task_struct *task;
+ struct pfm_syscall_cookie cookie;
+ unsigned long flags;
+ int ret;
+
+ PFM_DBG("fd=%d uflags=0x%x target=%d", fd, uflags, target);
+
+ if (uflags) {
+ PFM_DBG("invalid flags");
+ return -EINVAL;
+ }
+
+ /*
+ * handle detach in a separate function
+ */
+ if (target == PFM_NO_TARGET)
+ return pfm_detach(fd, uflags);
+
+ ret = pfm_acquire_ctx_from_fd(fd, &ctx, &cookie);
+ if (ret)
+ return ret;
+
+ task = current;
+
+ /*
+ * in per-thread mode (not self-monitoring), get a reference
+ * on task to monitor. This must be done with interrupts enabled
+ * Upon succesful return, refcount on task has increased.
+ *
+ * fget_light() is protecting the context.
+ */
+ if (target != current->pid) {
+ ret = pfm_get_task(ctx, target, &task);
+ if (ret)
+ goto error;
+ }
+
+ /*
+ * irqsave is required to avoid race in case context is already
+ * loaded or with switch timeout in the case of self-monitoring
+ */
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_UNLOADED, &flags);
+ if (!ret)
+ ret = __pfm_load_context(ctx, task);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ /*
+ * in per-thread mode (not self-monitoring), we need
+ * to decrease refcount on task to monitor:
+ * - attach successful: we have a reference in ctx->task
+ * - attach failed : undo the effect of pfm_get_task()
+ */
+ if (task != current)
+ put_task_struct(task);
+error:
+ pfm_release_ctx_from_fd(&cookie);
+ return ret;
+}
Index: o3/include/linux/perfmon.h
===================================================================
--- o3.orig/include/linux/perfmon.h 2008-11-25 17:54:59.000000000 +0100
+++ o3/include/linux/perfmon.h 2008-11-25 17:55:04.000000000 +0100
@@ -72,6 +72,17 @@
#define PFM_RW_PMC 0x02 /* accessing PMC registers */

/*
+ * pfm_set_state state:
+ */
+#define PFM_ST_START 0x01 /* start monitoring */
+#define PFM_ST_STOP 0x02 /* stop monitoring */
+
+/*
+ * pfm_attach special target to trigger detach
+ */
+#define PFM_NO_TARGET -1 /* detach session target */
+
+/*
* default value for the user and group security parameters in
* /proc/sys/kernel/perfmon/sys_group
* /proc/sys/kernel/perfmon/task_group
Index: o3/include/linux/syscalls.h
===================================================================
--- o3.orig/include/linux/syscalls.h 2008-11-25 18:10:58.000000000 +0100
+++ o3/include/linux/syscalls.h 2008-11-25 18:11:06.000000000 +0100
@@ -624,4 +624,15 @@

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

+#ifdef CONFIG_PERFMON
+struct pfarg_sinfo;
+asmlinkage long sys_pfm_create(int flags, struct pfarg_sinfo *s,
+ char __user *f, void __user *uarg, size_t uarg_size);
+
+asmlinkage long sys_pfm_write(int fd, int flags, int type, void __user *arg, size_t s);
+asmlinkage long sys_pfm_read(int fd, int flags, int type, void __user *arg, size_t s);
+asmlinkage long sys_pfm_attach(int fd, int flags, int target);
+asmlinkage long sys_pfm_set_state(int fd, int flags, int state);
+#endif /* CONFIG_PERFMON */
+
#endif

--


2008-11-26 11:52:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

> +static int pfm_task_incompatible(struct pfm_context *ctx,
> + struct task_struct *task)
> +{
> + /*
> + * cannot attach to a kernel thread
> + */
> + if (!task->mm) {
> + PFM_DBG("cannot attach to kernel thread [%d]", task->pid);
> + return -EPERM;
> + }

Check for init_mm too?

> +
> + /*
> + * cannot attach to a zombie task
> + */
> + if (task->exit_state == EXIT_ZOMBIE || task->exit_state == EXIT_DEAD) {
> + PFM_DBG("cannot attach to zombie/dead task [%d]", task->pid);
> + return -EBUSY;
> + }

This happens unlocked doesn't it? Couldn't the state change in parallel?

-Andi
--
[email protected]

2008-11-26 13:45:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface


* [email protected] <[email protected]> wrote:

> +/*
> + * unlike the other perfmon system calls, this one returns a file descriptor
> + * or a value < 0 in case of error, very much like open() or socket()
> + */
> +asmlinkage long sys_pfm_create(int flags, struct pfarg_sinfo __user *ureq)
> +{
> + struct pfm_context *new_ctx;
> + struct pfarg_sinfo sif;
> + int ret;
> +
> + PFM_DBG("flags=0x%x sif=%p", flags, ureq);
> +
> + if (perfmon_disabled)
> + return -ENOSYS;
> +
> + if (flags) {
> + PFM_DBG("no flags accepted yet");
> + return -EINVAL;
> + }
> + ret = __pfm_create_context(flags, &sif, &new_ctx);
> +
> + /*
> + * copy sif to user level argument, if requested
> + */
> + if (ureq && copy_to_user(ureq, &sif, sizeof(sif))) {
> + pfm_undo_create(ret, new_ctx);
> + ret = -EFAULT;
> + }
> + return ret;
> +}

the error control flow of fd creation is sloppy here and has an
kernel-data information leak: if __pfm_create_context() fails:

- due to memory pressure
- or due to lack of CPU support
- or due to lack of permissions
- or due to a busy PMU

then &sif is not initialized, and sys_pfm_create() copies it to
user-space. This way attackers can probe portions of the kernel stack.

Worse than that, there's also a DoS hole here: in the same scenario
above (easily created by attackers), new_ctx is not initialized either
- and if a ureq is provided by (unprivileged) userspace with a
faulting address (say ureq == (void *)1), then sys_pfm_create() will
call pfm_undo_create() => kaboom.

It's even a root hole, because attacker can likely prime the kernel
stack with arbitrary values via prior syscalls and hence controls
new_ctx's value, and the freeing logic happily uses it => local root
hole.

Is this stuff in any distro kernel?

Ingo

2008-11-26 14:01:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface


* [email protected] <[email protected]> wrote:

> +static int pfm_task_incompatible(struct pfm_context *ctx,
> + struct task_struct *task)
> +{
> + /*
> + * cannot attach to a kernel thread
> + */
> + if (!task->mm) {
> + PFM_DBG("cannot attach to kernel thread [%d]", task->pid);
> + return -EPERM;
> + }
> +
> + /*
> + * cannot attach to a zombie task
> + */
> + if (task->exit_state == EXIT_ZOMBIE || task->exit_state == EXIT_DEAD) {
> + PFM_DBG("cannot attach to zombie/dead task [%d]", task->pid);
> + return -EBUSY;
> + }
> + return 0;
> +}

The ptrace coupling code seems broken.

It is used in the following context:

+ /*
+ * returns 0 if cannot attach
+ */
+ ret1 = ptrace_may_access(p, PTRACE_MODE_ATTACH);
+ if (ret1)
+ ret = ptrace_check_attach(p, 0);
+
+ PFM_DBG("may_attach=%d check_attach=%d", ret1, ret);
+
+ if (ret || !ret1)
+ goto error;
+
+ ret = pfm_task_incompatible(ctx, p);
+ if (ret)
+ goto error;

firstly, this code is critical to security, but the variable naming
and the control flow is shaped in a dangerous and error-prone way: two
opaque 'ret' and 'ret1' names, which have _inverted_ logical meaning:
for 'ret' a nonzero value means an error, for 'ret1' a zero value
means error.

This code _must_ be rewritten cleanly via a single 'err' variable.
There's absolutely no need to nest ret and ret1 here. If we may not
access via ptrace then we should error out pronto and not complicate
the flow.

Secondly, get rid of those PFM_DBG() calls there, they are not needed
in a production kernel and just obscure review.

Thirdly, the check for ->exit_state in pfm_task_incompatible() is not
needed: we've just passed ptrace_check_attach() so we know we just
transitioned the task to task->state == TASK_TRACED.

If you _ever_ see a task exit TASK_TRACED and go zombie or dead from
there without this code allowing it that means the whole state machine
with ptrace is borked up by perfmon. For example i dont see where the
perfmon-control task parents itself as the exclusive debugger (parent)
of the debuggee-task.

Without that being implemented properly, a parallel ptrace / perfmon
scenario (triggerable by unprivileged userspace) can go amok, crash
the kernel and likely open up various rootholes as well. The kludge in
pfm_task_incompatible() shows that this was probably seen in the field
and hacked around.

Ingo

2008-11-26 14:02:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface


* [email protected] <[email protected]> wrote:

> +
> +/*
> + * unlike the other perfmon system calls, this one returns a file descriptor
> + * or a value < 0 in case of error, very much like open() or socket()
> + */
> +asmlinkage long sys_pfm_create(int flags, struct pfarg_sinfo __user *ureq)
> +{
> + struct pfm_context *new_ctx;
> + struct pfarg_sinfo sif;
> + int ret;
> +
> + PFM_DBG("flags=0x%x sif=%p", flags, ureq);
> +
> + if (perfmon_disabled)
> + return -ENOSYS;
> +
> + if (flags) {
> + PFM_DBG("no flags accepted yet");
> + return -EINVAL;

the canonical return code for non-yet-supported flags is not -EINVAL
but -ENOSYS.

Ingo

2008-11-26 14:09:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface


* [email protected] <[email protected]> wrote:

> +/*
> + * unlike the other perfmon system calls, this one returns a file descriptor
> + * or a value < 0 in case of error, very much like open() or socket()
> + */
> +asmlinkage long sys_pfm_create(int flags, struct pfarg_sinfo __user *ureq)
> +{
> + struct pfm_context *new_ctx;
> + struct pfarg_sinfo sif;
> + int ret;
> +
> + PFM_DBG("flags=0x%x sif=%p", flags, ureq);
> +
> + if (perfmon_disabled)
> + return -ENOSYS;

uhm. So we have a dynamic 'we dont support perfmon' flag. Which is
global and defined as:

+int perfmon_disabled; /* >0 if perfmon is disabled */

(sidenote: that should be __read_mostly)

then we go:

> + if (flags) {
> + PFM_DBG("no flags accepted yet");
> + return -EINVAL;
> + }

that should be if (unlikely())

then:

> + ret = __pfm_create_context(flags, &sif, &new_ctx);

where we get:

+int __pfm_create_context(__u32 ctx_flags,
+ struct pfarg_sinfo *sif,
+ struct pfm_context **new_ctx)
+{
+ struct pfm_context *ctx;
+ struct file *filp = NULL;
+ int fd = 0, ret = -EINVAL;
+
+ if (!pfm_pmu_conf)
+ return -ENOSYS;
+

_ANOTHER_ global dynamic flag to tell us that ... in essence 'we dont
support perfmon'. Which flag is again:

+struct pfm_pmu_config *pfm_pmu_conf;

... which should be __read_mostly at minimum.

+EXPORT_SYMBOL(pfm_pmu_conf);

and _MUST_ be exported as EXPORT_SYMBOL_GPL. If exported at all. Why
are any symbols exported here? perfmom does core kernel system calls
and is non-modular:

+config PERFMON
+ bool "Perfmon2 performance monitoring interface"

it needs _zero_ exports.

Ingo

2008-11-26 14:12:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface


* [email protected] <[email protected]> wrote:

> +/*
> + * unlike the other perfmon system calls, this one returns a file descriptor
> + * or a value < 0 in case of error, very much like open() or socket()
> + */
> +asmlinkage long sys_pfm_create(int flags, struct pfarg_sinfo __user *ureq)
> +{
> + struct pfm_context *new_ctx;
> + struct pfarg_sinfo sif;
> + int ret;
> +
> + PFM_DBG("flags=0x%x sif=%p", flags, ureq);
> +
> + if (perfmon_disabled)
> + return -ENOSYS;

another gem. we check flags:

> + if (flags) {
> + PFM_DBG("no flags accepted yet");
> + return -EINVAL;
> + }

then we pass them into __pfm_create_context():

> + ret = __pfm_create_context(flags, &sif, &new_ctx);

where we check the flag _again_:

+ /* no context flags supported yet */
+ if (ctx_flags)
+ goto error_alloc;

btw., 'error_alloc' is misnamed: that label is not used for allocation
failure in this branch.

Ingo

2008-11-26 14:13:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface


another thing:

+static inline int pfm_ctx_permissions(u32 ctx_flags)
+{
+ if (pfm_controls.task_group != PFM_GROUP_PERM_ANY
+ && !in_group_p(pfm_controls.task_group)) {
+ PFM_DBG("user group not allowed to create a task context");
+ return -EPERM;
+ }
+ return 0;
+}

ctx_flags is unused here. If flags used used later on _then_ should it
be added to these functions. Right now it just obscures review and
slows down the code.

Ingo

2008-11-26 17:02:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

On 11/26, Ingo Molnar wrote:
>
> * [email protected] <[email protected]> wrote:
>
> > +static int pfm_task_incompatible(struct pfm_context *ctx,
> > + struct task_struct *task)
> > +{
> > + /*
> > + * cannot attach to a kernel thread
> > + */
> > + if (!task->mm) {
> > + PFM_DBG("cannot attach to kernel thread [%d]", task->pid);
> > + return -EPERM;
> > + }
> > +
> > + /*
> > + * cannot attach to a zombie task
> > + */
> > + if (task->exit_state == EXIT_ZOMBIE || task->exit_state == EXIT_DEAD) {
> > + PFM_DBG("cannot attach to zombie/dead task [%d]", task->pid);
> > + return -EBUSY;
> > + }
> > + return 0;
> > +}

I agree with Ingo these checks are pointless. Without the locks
the ->mm or ->exit_state can be changed right after the check.

And, as Ingo pointed out, you don't need this function at all,
if ptrace_check_attach() succeeds the task must have ->mm and
its ->exit_state == 0.

But, please note that the task can be SIGKILL'ed right after
ptrace_check_attach(), it can drop ->mm, it can be released.
(i don't understand the patch, perhaps this doesn't matter for
you, just in case).

Oleg.

2008-11-27 12:26:18

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

Hi,

Given that this thread discusses the syscall API, I would like to verify
something with you at this point.

The sys_pfm_create() syscall is used to create a perfmon session (context).
In the current patchset, where only counting is supported, the syscall is
defined as:

int pfm_create(int flags, pfarg_sinfo_t *sif);

The plan is to extend this syscall, instead of adding a new syscall,
to add support
for sampling. The syscall would be used to select which sampling
buffer format and for passing
potential arguments to that format. The syscall would then look like:

int pfm_create(int flags, pfarg_sinfo_t *sif, char *fmt_name, void
*fmt_arg, size_t arg_sz);

The kernel would look at the last 3 arguments ONLY if it saw the
PFM_FL_SMPL_FMT bit
set in flags. We follow the model used by open(2).

I want to double-check that you are fine with this approach.

2008-11-27 12:30:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

> int pfm_create(int flags, pfarg_sinfo_t *sif, char *fmt_name, void
> *fmt_arg, size_t arg_sz);
>
> The kernel would look at the last 3 arguments ONLY if it saw the
> PFM_FL_SMPL_FMT bit
> set in flags. We follow the model used by open(2).

Normally it's a better model to enforce 0 (or whatever appropiate)
in unused argument. That can be done by checking and returning EINVAL.

>
> I want to double-check that you are fine with this approach.

My recommendation would be to go with the full format
from the beginning, because otherwise there is a versioning problem
with glibc.


-Andi

--
[email protected]

2008-11-27 14:02:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

On Wed, 26 Nov 2008, [email protected] wrote:

> +asmlinkage long sys_pfm_write(int fd, int uflags,
> + int type,
> + void __user *ureq,
> + size_t sz)

> +asmlinkage long sys_pfm_read(int fd, int uflags,
> + int type,
> + void __user *ureq,
> + size_t sz)

After looking at both I did a diff of the two functions:

--- r.c 2008-11-27 14:27:54.000000000 +0100
+++ w.c 2008-11-27 14:27:52.000000000 +0100
@@ -36,10 +36,12 @@
ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
if (ret)
goto skip;
-
switch(type) {
+ case PFM_RW_PMC:
+ ret = __pfm_write_pmcs(ctx, req, count);
+ break;
case PFM_RW_PMD:
- ret = __pfm_read_pmds(ctx, req, count);
+ ret = __pfm_write_pmds(ctx, req, count);
break;
default:
PFM_DBG("invalid type=%d", type);
@@ -48,12 +50,13 @@
skip:
spin_unlock_irqrestore(&ctx->lock, flags);

- if (copy_to_user(ureq, req, sz))
- ret = -EFAULT;
-
+ /*
+ * This function may be on the critical path.
+ * We want to avoid the branch if unecessary.
+ */
if (fptr)
kfree(fptr);
error:
pfm_release_ctx_from_fd(&cookie);
return ret;
}

Both read and write are multiplexing syscalls already and 90% of the
code is the same.

case PFM_RD_PMC:
case PFM_RD_PMD:
case PFM_WR_PMC:
case PFM_WR_PMD:

would make them the same and safe a syscall and duplicated code.

Thanks,

tglx

2008-11-27 14:08:17

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

Thomas,

On Thu, Nov 27, 2008 at 3:01 PM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 26 Nov 2008, [email protected] wrote:
>
>> +asmlinkage long sys_pfm_write(int fd, int uflags,
>> + int type,
>> + void __user *ureq,
>> + size_t sz)
>
>> +asmlinkage long sys_pfm_read(int fd, int uflags,
>> + int type,
>> + void __user *ureq,
>> + size_t sz)
>
> After looking at both I did a diff of the two functions:
>
> --- r.c 2008-11-27 14:27:54.000000000 +0100
> +++ w.c 2008-11-27 14:27:52.000000000 +0100
> @@ -36,10 +36,12 @@
> ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
> if (ret)
> goto skip;
> -
> switch(type) {
> + case PFM_RW_PMC:
> + ret = __pfm_write_pmcs(ctx, req, count);
> + break;
> case PFM_RW_PMD:
> - ret = __pfm_read_pmds(ctx, req, count);
> + ret = __pfm_write_pmds(ctx, req, count);
> break;
> default:
> PFM_DBG("invalid type=%d", type);
> @@ -48,12 +50,13 @@
> skip:
> spin_unlock_irqrestore(&ctx->lock, flags);
>
> - if (copy_to_user(ureq, req, sz))
> - ret = -EFAULT;
> -
> + /*
> + * This function may be on the critical path.
> + * We want to avoid the branch if unecessary.
> + */
> if (fptr)
> kfree(fptr);
> error:
> pfm_release_ctx_from_fd(&cookie);
> return ret;
> }
>
> Both read and write are multiplexing syscalls already and 90% of the
> code is the same.
>
> case PFM_RD_PMC:
> case PFM_RD_PMD:
> case PFM_WR_PMC:
> case PFM_WR_PMD:
>
> would make them the same and safe a syscall and duplicated code.
>
I am fine with that (BTW, there is no PFM_RD_PMC).

What about we call it pfm_rw_regs() ?

2008-11-27 14:23:00

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

Ingo,

On Wed, Nov 26, 2008 at 3:00 PM, Ingo Molnar <[email protected]> wrote:
>
> Thirdly, the check for ->exit_state in pfm_task_incompatible() is not
> needed: we've just passed ptrace_check_attach() so we know we just
> transitioned the task to task->state == TASK_TRACED.
>
> If you _ever_ see a task exit TASK_TRACED and go zombie or dead from
> there without this code allowing it that means the whole state machine
> with ptrace is borked up by perfmon. For example i dont see where the
> perfmon-control task parents itself as the exclusive debugger (parent)
> of the debuggee-task.
>

Perfmon requires ptrace ONLY to stop the thread you want to operate
on. For instance, to read the counters in a thread via pfm_read(), you
need to have that thread stopped, so perfmon can extract the machine
state safely. But when the monitored thread runs, it does not have to
remain under the control of ptrace. All that is needed is that the thread
is stopped while we are in the perfmon syscall. I think ptrace allows this
today. We will be able to drop ptrace() once we switch to utrace in which
case, the kernel will be able to easily stop the thread when entering the
perfmon syscalls. I guess I don't quite understand the meaning of your
last sentence.

2008-11-27 14:29:05

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

Ingo,

On Wed, Nov 26, 2008 at 3:08 PM, Ingo Molnar <[email protected]> wrote:
>
> * [email protected] <[email protected]> wrote:
>
>> +/*
>> + * unlike the other perfmon system calls, this one returns a file descriptor
>> + * or a value < 0 in case of error, very much like open() or socket()
>> + */
>> +asmlinkage long sys_pfm_create(int flags, struct pfarg_sinfo __user *ureq)
>> +{
>> + struct pfm_context *new_ctx;
>> + struct pfarg_sinfo sif;
>> + int ret;
>> +
>> + PFM_DBG("flags=0x%x sif=%p", flags, ureq);
>> +
>> + if (perfmon_disabled)
>> + return -ENOSYS;
>
> uhm. So we have a dynamic 'we dont support perfmon' flag. Which is
> global and defined as:
>
> +int perfmon_disabled; /* >0 if perfmon is disabled */
>
This variable says, there is perfmon support in the kernel but
initialization of the
subsystem failed. Thus the subsystem is disabled.

> (sidenote: that should be __read_mostly)
>

>> + ret = __pfm_create_context(flags, &sif, &new_ctx);
>
> where we get:
>
> +int __pfm_create_context(__u32 ctx_flags,
> + struct pfarg_sinfo *sif,
> + struct pfm_context **new_ctx)
> +{
> + struct pfm_context *ctx;
> + struct file *filp = NULL;
> + int fd = 0, ret = -EINVAL;
> +
> + if (!pfm_pmu_conf)
> + return -ENOSYS;
> +
>
> _ANOTHER_ global dynamic flag to tell us that ... in essence 'we dont
> support perfmon'. Which flag is again:
>
That one says that we do not have support for that particular PMU model.
In the fully-featured version, there PMU description table is loaded
via a kernel
module. So you can be in a situation where the subsystem initialized properly
but no PMU description module is loaded or none exists for this processor.
That's what this second test is about. Note that if all PMU description tables
are compiled in, we can still get into a situation where none support the
host processor.

> +struct pfm_pmu_config *pfm_pmu_conf;
>
> ... which should be __read_mostly at minimum.
>
> +EXPORT_SYMBOL(pfm_pmu_conf);
>
> and _MUST_ be exported as EXPORT_SYMBOL_GPL. If exported at all. Why
> are any symbols exported here? perfmom does core kernel system calls
> and is non-modular:
>
In the patchset, yes. I can drop that for now.

2008-11-27 14:42:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface


* stephane eranian <[email protected]> wrote:

> Ingo,
>
> On Wed, Nov 26, 2008 at 3:00 PM, Ingo Molnar <[email protected]> wrote:
> >
> > Thirdly, the check for ->exit_state in pfm_task_incompatible() is
> > not needed: we've just passed ptrace_check_attach() so we know we
> > just transitioned the task to task->state == TASK_TRACED.
> >
> > If you _ever_ see a task exit TASK_TRACED and go zombie or dead
> > from there without this code allowing it that means the whole
> > state machine with ptrace is borked up by perfmon. For example i
> > dont see where the perfmon-control task parents itself as the
> > exclusive debugger (parent) of the debuggee-task.
> >
>
> Perfmon requires ptrace ONLY to stop the thread you want to operate
> on. For instance, to read the counters in a thread via pfm_read(),
> you need to have that thread stopped, so perfmon can extract the
> machine state safely. But when the monitored thread runs, it does
> not have to remain under the control of ptrace. All that is needed
> is that the thread is stopped while we are in the perfmon syscall. I
> think ptrace allows this today. We will be able to drop ptrace()
> once we switch to utrace in which case, the kernel will be able to
> easily stop the thread when entering the perfmon syscalls. I guess I
> don't quite understand the meaning of your last sentence.

The meaning of my last sentence is the jist of my argument: you cannot
do it like this! You are using a bit of the ptrace infrastructure but
unsafely, as pointed out here.

and the thing is, i fail to understand the whole justification of the
new sys_pfm_attach()/PFM_NO_TARGET system calls.

Firstly, there's a taste issue: why didnt you add sys_pfm_detach
instead of adding a butt-ugly PFM_NO_TARGET special case into
sys_pfm_attach() that maps to pfm_detach??

But more importantly, and very fundamentally: why did you implement it
as a special system call? Why didnt you extend ptrace to read/write
the PMU context? It is _trivial_ and needs no new syscalls at all:
just a new ptrace parameter to arch_ptrace(). And ptrace will drive
the TASK_TRACED state machine safely - it already stops/starts tasks
to read/write hardware context safely.

And as a bonus, if this is implemented via a ptrace extension it will
be trivial to add support for these new context types to all sorts of
user-space debuggers as well. With new syscalls it will take ages for
this to trickle through to all parties involved.

Ingo

2008-11-27 15:16:20

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

Ingo,

On Thu, Nov 27, 2008 at 3:42 PM, Ingo Molnar <[email protected]> wrote:
>
> * stephane eranian <[email protected]> wrote:
>
>> Ingo,
>>
>> On Wed, Nov 26, 2008 at 3:00 PM, Ingo Molnar <[email protected]> wrote:
>> >
>> > Thirdly, the check for ->exit_state in pfm_task_incompatible() is
>> > not needed: we've just passed ptrace_check_attach() so we know we
>> > just transitioned the task to task->state == TASK_TRACED.
>> >
>> > If you _ever_ see a task exit TASK_TRACED and go zombie or dead
>> > from there without this code allowing it that means the whole
>> > state machine with ptrace is borked up by perfmon. For example i
>> > dont see where the perfmon-control task parents itself as the
>> > exclusive debugger (parent) of the debuggee-task.
>> >
>>
>> Perfmon requires ptrace ONLY to stop the thread you want to operate
>> on. For instance, to read the counters in a thread via pfm_read(),
>> you need to have that thread stopped, so perfmon can extract the
>> machine state safely. But when the monitored thread runs, it does
>> not have to remain under the control of ptrace. All that is needed
>> is that the thread is stopped while we are in the perfmon syscall. I
>> think ptrace allows this today. We will be able to drop ptrace()
>> once we switch to utrace in which case, the kernel will be able to
>> easily stop the thread when entering the perfmon syscalls. I guess I
>> don't quite understand the meaning of your last sentence.
>
> The meaning of my last sentence is the jist of my argument: you cannot
> do it like this! You are using a bit of the ptrace infrastructure but
> unsafely, as pointed out here.
>
Perfmon requires the application to use ptrace. The function you looked
at is just there to verify that the application actually did attach before
making a perfmon syscall. If it was not for the stop/resume feature, I would
not require using ptrace. That's why I like the utrace functionality.

Ptracing and performance monitoring are two different things. One
is for debugging the other is for performance analysis. The requirements
are different. There are all sorts of nasty side effects of ptrace which are
pointless with performance monitoring, e.g., signal redirections.



> and the thing is, i fail to understand the whole justification of the
> new sys_pfm_attach()/PFM_NO_TARGET system calls.
>
> Firstly, there's a taste issue: why didnt you add sys_pfm_detach
> instead of adding a butt-ugly PFM_NO_TARGET special case into
> sys_pfm_attach() that maps to pfm_detach??
>
To tell you the truth, I had it exactly like that initially. Some
people suggested
I could combine attach and detach and thereby reduce the number of syscalls.
If you have no issue with adding dedicated syscalls for attach and detach, then
I can add this back.

> But more importantly, and very fundamentally: why did you implement it
> as a special system call? Why didnt you extend ptrace to read/write
> the PMU context? It is _trivial_ and needs no new syscalls at all:
> just a new ptrace parameter to arch_ptrace(). And ptrace will drive
> the TASK_TRACED state machine safely - it already stops/starts tasks
> to read/write hardware context safely.
>

I have not looked at those ptrace extensions. But one thing to keep in mind
is that perfmon can operate at the per-thread level but ALSO at the per-cpu
level (system-wide). I want to the use the same syscall sequence in either case.
You can attach to a thread for a per-thread context or a CPU for a system-wide
context. In system-wide mode, the target designate the CPU, in per-thread the
tid.

2008-12-01 00:50:24

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

stephane eranian writes:

> Perfmon requires ptrace ONLY to stop the thread you want to operate
> on. For instance, to read the counters in a thread via pfm_read(), you
> need to have that thread stopped, so perfmon can extract the machine
> state safely.

What would happen if the thread wasn't stopped? Is it just that the
numbers would be inaccurate, or is there some kind of security
exposure?

If it's just that the numbers would be inaccurate, then I don't think
the kernel needs to enforce it. The monitoring program *should*
ensure that the thread is stopped or blocked, one way or another, but
it could do that simply by sending a SIGSTOP to the thread. I don't
see that it would necessarily have to use ptrace.

Paul.

2008-12-01 06:05:53

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

Paul,

On Mon, Dec 1, 2008 at 1:49 AM, Paul Mackerras <[email protected]> wrote:
> stephane eranian writes:
>
>> Perfmon requires ptrace ONLY to stop the thread you want to operate
>> on. For instance, to read the counters in a thread via pfm_read(), you
>> need to have that thread stopped, so perfmon can extract the machine
>> state safely.
>
> What would happen if the thread wasn't stopped? Is it just that the
> numbers would be inaccurate, or is there some kind of security
> exposure?
>
There are certain operations which cannot be performed, for instance
attaching/detaching.

As for read and write, sure you could never touch the PMU directly if you were
not self-monitoring. But then there is question as what does the
interface guarantee
in terms of execution of the actions. With read, you'd have to say the
interface does
not guarantee the value returned is up-to-date. For applications which
never context
switch, for instance, there the values read from the software state
maybe totally stale.

> If it's just that the numbers would be inaccurate, then I don't think
> the kernel needs to enforce it. The monitoring program *should*
> ensure that the thread is stopped or blocked, one way or another, but
> it could do that simply by sending a SIGSTOP to the thread. I don't
> see that it would necessarily have to use ptrace.
>
Because ptrace provides the additional guarantees, for instance,
nobody can do a SIGCONT
while you are operating on the ptraced thread, nobody else can do the
PTRACE_DETATCH.
Now, I don't think bad things could actually happen if a SIGCONT were
to happen, because
the context is locked during all operations and the context switch in
routine tries to grab that
lock. There may be difficulties if you hold the lock and then you need
to release it for one operation
and then you grab it again. There may be a couple of places where we do that.

Also sending SIGSTOP is not enough to guarantee the thread is off of
the CPU. You need to wait
until it is actually off, i.e., all the state has been saved. I don't
think you can test that from userland.


Note that I am not arguing that we will have to use ptrace forever. In
fact, utrace provides the mechanisms
and the guarantees to avoid using ptrace alltogether. With utrace,
applications would directly call into the
kernel and then, if needed, the kernel would use utrace to stop the
other thread. I have played with that several
months ago and it worked fine. The problem is that utrace is still not in.

2008-12-01 06:10:49

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

Thomas,

On Thu, Nov 27, 2008 at 3:01 PM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 26 Nov 2008, [email protected] wrote:
>
>> +asmlinkage long sys_pfm_write(int fd, int uflags,
>> + int type,
>> + void __user *ureq,
>> + size_t sz)
>
>> +asmlinkage long sys_pfm_read(int fd, int uflags,
>> + int type,
>> + void __user *ureq,
>> + size_t sz)
>
> After looking at both I did a diff of the two functions:
>
> --- r.c 2008-11-27 14:27:54.000000000 +0100
> +++ w.c 2008-11-27 14:27:52.000000000 +0100
> @@ -36,10 +36,12 @@
> ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
> if (ret)
> goto skip;
> -
> switch(type) {
> + case PFM_RW_PMC:
> + ret = __pfm_write_pmcs(ctx, req, count);
> + break;
> case PFM_RW_PMD:
> - ret = __pfm_read_pmds(ctx, req, count);
> + ret = __pfm_write_pmds(ctx, req, count);
> break;
> default:
> PFM_DBG("invalid type=%d", type);
> @@ -48,12 +50,13 @@
> skip:
> spin_unlock_irqrestore(&ctx->lock, flags);
>
> - if (copy_to_user(ureq, req, sz))
> - ret = -EFAULT;
> -
> + /*
> + * This function may be on the critical path.
> + * We want to avoid the branch if unecessary.
> + */
> if (fptr)
> kfree(fptr);
> error:
> pfm_release_ctx_from_fd(&cookie);
> return ret;
> }
>
> Both read and write are multiplexing syscalls already and 90% of the
> code is the same.
>
> case PFM_RD_PMC:
> case PFM_RD_PMD:
> case PFM_WR_PMC:
> case PFM_WR_PMD:
>
> would make them the same and safe a syscall and duplicated code.
>
I have some issues with this approach because in the API I have tried to follow
the model where the syscall name describes the action taken (only one
per call):
- create
- read
- write
- set_state
- attach/detach I admit this one is not so clean. But Ingo
indicated he would prefer them as separate which I am
happy to do (it was like this originally)

2008-12-04 00:55:34

by Roland McGrath

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

> > + /*
> > + * cannot attach to a kernel thread
> > + */
> > + if (!task->mm) {

This should test (task->flags & PF_KTHREAD). (But it's superfluous anyway.)

> > + /*
> > + * cannot attach to a zombie task
> > + */
> > + if (task->exit_state == EXIT_ZOMBIE || task->exit_state == EXIT_DEAD) {

The usual test to write here is just "if (task->exit_state)".
(But it's superfluous anyway.)

> Thirdly, the check for ->exit_state in pfm_task_incompatible() is not
> needed: we've just passed ptrace_check_attach() so we know we just
> transitioned the task to task->state == TASK_TRACED.

Correct.

> If you _ever_ see a task exit TASK_TRACED and go zombie or dead from
> there without this code allowing it that means the whole state machine
> with ptrace is borked up by perfmon.

Is always possible for a TASK_TRACED task to suddenly die via SIGKILL, and
it's even possible it will be reaped (EXIT_DEAD) despite ptrace, in the
de_thread() (MT exec) case. This race is tolerable for everything ptrace
does (it holds a task ref, and the arch code copes). It must also be
tolerated by the perfmon code.

> For example i dont see where the perfmon-control task parents itself as
> the exclusive debugger (parent) of the debuggee-task.

ptrace_check_attach() ensures that this is so.


Thanks,
Roland

2008-12-04 01:06:18

by Roland McGrath

[permalink] [raw]
Subject: Re: [patch 20/24] perfmon: system calls interface

> > + /*
> > + * cannot attach to a kernel thread
> > + */
> > + if (!task->mm) {

This should test (task->flags & PF_KTHREAD). (But it's superfluous anyway.)

> > + /*
> > + * cannot attach to a zombie task
> > + */
> > + if (task->exit_state == EXIT_ZOMBIE || task->exit_state == EXIT_DEAD) {

The usual test to write here is just "if (task->exit_state)".
(But it's superfluous anyway.)

> Thirdly, the check for ->exit_state in pfm_task_incompatible() is not
> needed: we've just passed ptrace_check_attach() so we know we just
> transitioned the task to task->state == TASK_TRACED.

Correct.

> If you _ever_ see a task exit TASK_TRACED and go zombie or dead from
> there without this code allowing it that means the whole state machine
> with ptrace is borked up by perfmon.

Is always possible for a TASK_TRACED task to suddenly die via SIGKILL, and
it's even possible it will be reaped (EXIT_DEAD) despite ptrace, in the
de_thread() (MT exec) case. This race is tolerable for everything ptrace
does (it holds a task ref, and the arch code copes). It must also be
tolerated by the perfmon code.

> For example i dont see where the perfmon-control task parents itself as
> the exclusive debugger (parent) of the debuggee-task.

ptrace_check_attach() ensures that this is so.


Thanks,
Roland