2006-08-23 08:21:40

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH 4/18] 2.6.17.9 perfmon2 patch for review: new system calls support

This patch contains the perfmon2 system call interface.

The interface consist of 12 new system calls. The front-end
of each system call is implemented in perfmon_syscall.c.
The front-end takes care of copying the parameters into
kernel structures and also verifies that the perfmon state
is appropriate for each command. The back-end of each syscall
is implemented either in the core (perfmon.c) or in feature
specific file (e.g. perfmon_sets.c).

The system calls are defined as follows:

sys_pfm_create_context():
- create a new perfmon2 context and returns a file descriptor in
the pfarg_ctx_t parameters. This is the first call an application
must make to do monitoring

sys_pfm_write_pmcs():
- program the PMU configuration registers. Accepts vector of arguments
of type pfarg_pmc_t

sys_pfm_write_pmds():
- program the PMU data registers. Accepts a vector of arguments of type
pfarg_pmd_t

sys_pfm_read_pmds():
- read the PMU data registers. Accepts a vector of arguments of type
pfarg_pmd_t

sys_pfm_restart():
- indicate that application is doing processing an overflow notification

sys_pfm_start():
- start monitoring

sys_pfm_stop():
- stop monitoring

sys_pfm_load_context():
- attach a perfmon2 context to a task or the current processor.

sys_pfm_unload_context():
- detach the perfmon2 context

sys_pfm_create_evtsets():
- create or change an event sets. By default a context is created with only one
set

sys_pfm_delete_evtsets():
- delete any explicitely created event set

sys_pfm_getinfo_evtsets():
- get information about event sets, such as the number of activations. Accepts
vector arguments of type pfarg_setinfo_t

There are other more indirect system calls related to the fact that a context uses a file
descriptor. Those system calls are in perfmon_file.c and part of another patch.




--- linux-2.6.17.9.base/perfmon/perfmon_syscalls.c 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.17.9/perfmon/perfmon_syscalls.c 2006-08-21 03:37:46.000000000 -0700
@@ -0,0 +1,712 @@
+/*
+ * perfmon_syscalls.c: perfmon2 system call interface
+ *
+ * This file implements the perfmon2 interface which
+ * provides access to the hardware performance counters
+ * of the host processor.
+ *
+ * The initial version of perfmon.c was written by
+ * Ganesh Venkitachalam, IBM Corp.
+ *
+ * Then it was modified for perfmon-1.x by Stephane Eranian and
+ * David Mosberger, Hewlett Packard Co.
+ *
+ * Version Perfmon-2.x is a complete rewrite of perfmon-1.x
+ * by Stephane Eranian, Hewlett Packard Co.
+ *
+ * Copyright (c) 1999-2006 Hewlett-Packard Development Company, L.P.
+ * Contributed by Stephane Eranian <[email protected]>
+ * David Mosberger-Tang <[email protected]>
+ *
+ * More information about perfmon available at:
+ * http://www.hpl.hp.com/research/linux/perfmon
+ */
+#include <linux/kernel.h>
+#include <linux/perfmon.h>
+#include <linux/fs.h>
+#include <asm/uaccess.h>
+
+struct pfm_context * pfm_get_ctx(int fd)
+{
+ struct file *filp;
+ struct pfm_context *ctx;
+
+ filp = fget(fd);
+ if (unlikely(filp == NULL)) {
+ PFM_DBG("invalid fd %d", fd);
+ return NULL;
+ }
+
+ if (unlikely(filp->f_op != &pfm_file_ops)) {
+ PFM_DBG("fd %d not related to perfmon", fd);
+ fput(filp);
+ return NULL;
+ }
+ ctx = filp->private_data;
+
+ /*
+ * sanity check
+ */
+ if (filp != ctx->filp && ctx->filp) {
+ PFM_DBG("filp is different");
+ }
+
+ /*
+ * update filp
+ */
+ ctx->filp = filp;
+ return ctx;
+}
+
+int pfm_check_task_state(struct pfm_context *ctx, int check_mask,
+ unsigned long *flags)
+{
+ struct task_struct *task;
+ unsigned long local_flags, new_flags;
+ int state, old_state;
+
+recheck:
+ /*
+ * task is NULL for system-wide context
+ */
+ task = ctx->task;
+ state = ctx->state;
+ local_flags = *flags;
+
+ PFM_DBG("state=%d [%d] task_state=%ld check_mask=0x%x",
+ state,
+ task ? task->pid : -1,
+ task ? task->state : -1, check_mask);
+
+ if (state == PFM_CTX_UNLOADED)
+ return 0;
+ /*
+ * no command can operate on a zombie context
+ */
+ if (state == PFM_CTX_ZOMBIE)
+ return -EINVAL;
+
+ /*
+ * at this point, state is PFM_CTX_LOADED or PFM_CTX_MASKED
+ */
+
+ /*
+ * some commands require the context to be unloaded to operate
+ */
+ if (check_mask & PFM_CMD_UNLOADED) {
+ PFM_DBG("state=%d, cmd needs unloaded", state);
+ return -EBUSY;
+ }
+
+ /*
+ * self-monitoring always ok.
+ */
+ if (task == current)
+ return 0;
+
+ /*
+ * for syswide, we accept if running on the cpu the context is bound
+ * to. When monitoring another thread, must wait until stopped.
+ */
+ if (ctx->flags.system) {
+ if (ctx->cpu != smp_processor_id())
+ return -EBUSY;
+ return 0;
+ }
+
+ /*
+ * at this point, monitoring another thread
+ */
+
+ /*
+ * the pfm_unload_context() command is allowed on masked context
+ */
+ if (state == PFM_CTX_MASKED && !(check_mask & PFM_CMD_UNLOAD))
+ return 0;
+
+ /*
+ * We could lift this restriction for UP but it would mean that
+ * the user has no guarantee the task would not run between
+ * two successive calls to perfmonctl(). That's probably OK.
+ * If this user wants to ensure the task does not run, then
+ * the task must be stopped.
+ */
+ if (check_mask & PFM_CMD_STOPPED) {
+ if ((task->state != TASK_STOPPED)
+ && (task->state != TASK_TRACED)) {
+ PFM_DBG("[%d] task not in stopped state", task->pid);
+ return -EBUSY;
+ }
+ /*
+ * task is now stopped, wait for ctxsw out
+ *
+ * This is an interesting point in the code.
+ * We need to unprotect the context because
+ * the pfm_ctxswout_thread() routines needs to grab
+ * the same lock. There are danger in doing
+ * this because it leaves a window open for
+ * another task to get access to the context
+ * and possibly change its state. The one thing
+ * that is not possible is for the context to disappear
+ * because we are protected by the VFS layer, i.e.,
+ * get_fd()/put_fd().
+ */
+ old_state = state;
+
+ PFM_DBG("going wait_inactive for [%d] state=%ld flags=0x%lx",
+ task->pid,
+ task->state,
+ local_flags);
+
+ spin_unlock_irqrestore(&ctx->lock, local_flags);
+
+ wait_task_inactive(task);
+
+ spin_lock_irqsave(&ctx->lock, new_flags);
+
+ /*
+ * flags may be different than when we released the lock
+ */
+ *flags = new_flags;
+
+ /*
+ * we must recheck to verify if state has changed
+ */
+ if (ctx->state != old_state) {
+ PFM_DBG("old_state=%d new_state=%d",
+ old_state,
+ ctx->state);
+ goto recheck;
+ }
+ }
+ return 0;
+}
+
+int pfm_get_args(void __user *ureq, size_t sz, size_t max_sz, void *laddr,
+ void **req)
+{
+ void *addr;
+
+ if (sz <= max_sz) {
+ *req = laddr;
+ return copy_from_user(laddr, ureq, sz);
+ }
+
+ if (unlikely(sz > pfm_controls.arg_size_max)) {
+ PFM_DBG("argument too big %zu max=%zu",
+ sz,
+ pfm_controls.arg_size_max);
+ return -E2BIG;
+ }
+
+ addr = kmalloc(sz, GFP_KERNEL);
+ if (unlikely(addr == NULL))
+ return -ENOMEM;
+
+ if (copy_from_user(addr, ureq, sz)) {
+ kfree(addr);
+ return -EFAULT;
+ }
+ *req = addr;
+
+ return 0;
+}
+
+int pfm_get_smpl_arg(pfm_uuid_t uuid, void __user *uaddr, size_t usize, void **arg,
+ struct pfm_smpl_fmt **fmt)
+{
+ struct pfm_smpl_fmt *f;
+ void *addr = NULL;
+ size_t sz;
+ int ret;
+
+ if (!pfm_use_smpl_fmt(uuid))
+ return 0;
+
+ /*
+ * find fmt and increase refcount
+ */
+ f = pfm_smpl_fmt_get(uuid);
+ if (f == NULL) {
+ PFM_DBG("buffer format not found");
+ return -EINVAL;
+ }
+
+ sz = f->fmt_arg_size;
+
+ /*
+ * usize = -1 is for IA-64 backward compatibility
+ */
+ ret = -EINVAL;
+ if (sz != usize && usize != -1) {
+ PFM_DBG("invalid arg size %zu, format expects %zu",
+ usize, sz);
+ goto error;
+ }
+
+ ret = -ENOMEM;
+ addr = kmalloc(sz, GFP_KERNEL);
+ if (addr == NULL)
+ goto error;
+
+ ret = -EFAULT;
+ if (copy_from_user(addr, uaddr, sz))
+ goto error;
+
+ *arg = addr;
+ *fmt = f;
+ return 0;
+
+error:
+ kfree(addr);
+ pfm_smpl_fmt_put(f);
+ return ret;
+}
+
+/*
+ * function invoked in case, pfm_context_create fails
+ * at the last operation, copy_to_user. It needs to
+ * undo memory allocations and free the file descriptor
+ */
+#ifndef CONFIG_IA64_PERFMON_COMPAT
+static
+#endif
+void pfm_undo_create_context_fd(int fd, struct pfm_context *ctx)
+{
+ struct files_struct *files = current->files;
+ struct file *file;
+
+ file = fget(fd);
+ /*
+ * there is no fd_uninstall(), so we do it
+ * here. put_unused_fd() does not remove the
+ * effect of fd_install().
+ */
+
+ spin_lock(&files->file_lock);
+ files->fd_array[fd] = NULL;
+ spin_unlock(&files->file_lock);
+
+ /*
+ * undo the fget()
+ */
+ fput(file);
+
+ /*
+ * decrement ref count and kill file
+ */
+ put_filp(file);
+
+ put_unused_fd(fd);
+
+ pfm_context_free(ctx);
+}
+
+asmlinkage long sys_pfm_create_context(struct pfarg_ctx __user *ureq,
+ void __user *uarg, size_t smpl_size)
+{
+ struct pfarg_ctx req;
+ struct pfm_context *new_ctx;
+ struct pfm_smpl_fmt *fmt = NULL;
+ void *smpl_arg = NULL;
+ int ret;
+
+ if (copy_from_user(&req, ureq, sizeof(req)))
+ return -EFAULT;
+
+ ret = pfm_get_smpl_arg(req.ctx_smpl_buf_id, uarg, smpl_size,
+ &smpl_arg, &fmt);
+ if (ret)
+ goto abort;
+
+ ret = __pfm_create_context(&req, fmt, smpl_arg, PFM_NORMAL, NULL, &new_ctx);
+
+ /*
+ * copy_user return value overrides command return value
+ */
+ if (!ret) {
+ if (copy_to_user(ureq, &req, sizeof(req))) {
+ pfm_undo_create_context_fd(req.ctx_fd, new_ctx);
+ ret = -EFAULT;
+ }
+ }
+ kfree(smpl_arg);
+abort:
+ return ret;
+}
+
+asmlinkage long sys_pfm_write_pmcs(int fd, struct pfarg_pmc __user *ureq, int count)
+{
+ struct pfm_context *ctx;
+ struct pfarg_pmc pmcs[PFM_PMC_STK_ARG];
+ struct pfarg_pmc *req;
+ unsigned long flags;
+ size_t sz;
+ int ret;
+
+ if (count < 0)
+ return -EINVAL;
+
+ ctx = pfm_get_ctx(fd);
+ if (unlikely(ctx == NULL))
+ return -EBADF;
+
+ sz = count*sizeof(*ureq);
+
+ ret = pfm_get_args(ureq, sz, sizeof(pmcs), pmcs, (void **)&req);
+ if (ret)
+ goto error;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+ if (!ret)
+ ret = __pfm_write_pmcs(ctx, req, count);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ if (copy_to_user(ureq, req, sz))
+ ret = -EFAULT;
+
+ if (count > PFM_PMC_STK_ARG)
+ kfree(req);
+error:
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+asmlinkage long sys_pfm_write_pmds(int fd, struct pfarg_pmd __user *ureq, int count)
+{
+ struct pfm_context *ctx;
+ struct pfarg_pmd pmds[PFM_PMD_STK_ARG];
+ struct pfarg_pmd *req;
+ unsigned long flags;
+ size_t sz;
+ int ret;
+
+ if (count < 0)
+ return -EINVAL;
+
+ ctx = pfm_get_ctx(fd);
+ if (unlikely(ctx == NULL))
+ return -EBADF;
+
+ sz = count*sizeof(*ureq);
+
+ ret = pfm_get_args(ureq, sz, sizeof(pmds), pmds, (void **)&req);
+ if (ret)
+ goto error;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+ if (!ret)
+ ret = __pfm_write_pmds(ctx, req, count, 0);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ if (copy_to_user(ureq, req, sz))
+ ret = -EFAULT;
+
+ if (count > PFM_PMD_STK_ARG)
+ kfree(req);
+error:
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+asmlinkage long sys_pfm_read_pmds(int fd, struct pfarg_pmd __user *ureq, int count)
+{
+ struct pfm_context *ctx;
+ struct pfarg_pmd pmds[PFM_PMD_STK_ARG];
+ struct pfarg_pmd *req;
+ unsigned long flags;
+ size_t sz;
+ int ret;
+
+ if (count < 0)
+ return -EINVAL;
+
+ ctx = pfm_get_ctx(fd);
+ if (unlikely(ctx == NULL))
+ return -EBADF;
+
+ sz = count*sizeof(*ureq);
+
+ ret = pfm_get_args(ureq, sz, sizeof(pmds), pmds, (void **)&req);
+ if (ret)
+ goto error;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+ if (!ret)
+ ret = __pfm_read_pmds(ctx, req, count);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ if (copy_to_user(ureq, req, sz))
+ ret = -EFAULT;
+
+ if (count > PFM_PMD_STK_ARG)
+ kfree(req);
+error:
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+asmlinkage long sys_pfm_restart(int fd)
+{
+ struct pfm_context *ctx;
+ unsigned long flags;
+ int ret = 0;
+
+ ctx = pfm_get_ctx(fd);
+ if (unlikely(ctx == NULL))
+ return -EBADF;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, 0, &flags);
+ if (!ret)
+ ret = __pfm_restart(ctx);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+
+asmlinkage long sys_pfm_stop(int fd)
+{
+ struct pfm_context *ctx;
+ unsigned long flags;
+ int ret;
+
+ ctx = pfm_get_ctx(fd);
+ if (unlikely(ctx == NULL))
+ return -EBADF;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+ if (!ret)
+ ret = __pfm_stop(ctx);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+asmlinkage long sys_pfm_start(int fd, struct pfarg_start __user *ureq)
+{
+ struct pfm_context *ctx;
+ struct pfarg_start req;
+ unsigned long flags;
+ int ret = 0;
+
+ ctx = pfm_get_ctx(fd);
+ if (ctx == NULL)
+ return -EBADF;
+
+ /*
+ * the one argument is actually optional
+ */
+ if (ureq && copy_from_user(&req, ureq, sizeof(req)))
+ return -EFAULT;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+ if (!ret)
+ ret = __pfm_start(ctx, ureq ? &req : NULL);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+
+
+asmlinkage long sys_pfm_load_context(int fd, struct pfarg_load __user *ureq)
+{
+ struct pfm_context *ctx;
+ unsigned long flags;
+ struct pfarg_load req;
+ int ret;
+
+ ctx = pfm_get_ctx(fd);
+ if (ctx == NULL)
+ return -EBADF;
+
+ if (copy_from_user(&req, ureq, sizeof(req)))
+ return -EFAULT;
+
+ /*
+ * irqsave is required to avoid race in case context is already
+ * loaded or with switch timeout in the case of self-monitoring
+ */
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
+ if (!ret)
+ ret = __pfm_load_context(ctx, &req);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+
+asmlinkage long sys_pfm_unload_context(int fd)
+{
+ struct pfm_context *ctx;
+ unsigned long flags;
+ int ret = 0;
+
+ ctx = pfm_get_ctx(fd);
+ if (ctx == NULL)
+ return -EBADF;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED|PFM_CMD_UNLOAD, &flags);
+ if (!ret)
+ ret = __pfm_unload_context(ctx, 0);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+asmlinkage long sys_pfm_delete_evtsets(int fd, struct pfarg_setinfo __user *ureq, int count)
+{
+ struct pfm_context *ctx;
+ struct pfarg_setinfo *req;
+ unsigned long flags;
+ size_t sz;
+ int ret;
+
+ if (count < 0)
+ return -EINVAL;
+
+ ctx = pfm_get_ctx(fd);
+ if (ctx == NULL)
+ return -EBADF;
+
+ sz = count*sizeof(*ureq);
+
+ ret = pfm_get_args(ureq, sz, 0, NULL, (void **)&req);
+ if (ret)
+ goto error;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_UNLOADED, &flags);
+ if (!ret)
+ ret = __pfm_delete_evtsets(ctx, req, count);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ if (copy_to_user(ureq, req, sz))
+ ret = -EFAULT;
+
+ kfree(req);
+
+error:
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+asmlinkage long sys_pfm_create_evtsets(int fd, struct pfarg_setdesc __user *ureq, int count)
+{
+ struct pfm_context *ctx;
+ struct pfarg_setdesc *req;
+ unsigned long flags;
+ size_t sz;
+ int ret;
+
+ if (count < 0)
+ return -EINVAL;
+
+ ctx = pfm_get_ctx(fd);
+ if (ctx == NULL)
+ return -EBADF;
+
+ sz = count*sizeof(*ureq);
+
+ ret = pfm_get_args(ureq, sz, 0, NULL, (void **)&req);
+ if (ret)
+ goto error;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, PFM_CMD_UNLOADED, &flags);
+ if (!ret)
+ ret = __pfm_create_evtsets(ctx, req, count);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ if (copy_to_user(ureq, req, sz))
+ ret = -EFAULT;
+
+ kfree(req);
+
+error:
+ pfm_put_ctx(ctx);
+
+ return ret;
+}
+
+asmlinkage long sys_pfm_getinfo_evtsets(int fd, struct pfarg_setinfo __user *ureq, int count)
+{
+ struct pfm_context *ctx;
+ struct pfarg_setinfo *req;
+ unsigned long flags;
+ size_t sz;
+ int ret;
+
+ if (count < 0)
+ return -EINVAL;
+
+ ctx = pfm_get_ctx(fd);
+ if (ctx == NULL)
+ return -EBADF;
+
+ sz = count*sizeof(*ureq);
+
+ ret = pfm_get_args(ureq, sz, 0, NULL, (void **)&req);
+ if (ret)
+ goto error;
+
+ spin_lock_irqsave(&ctx->lock, flags);
+
+ ret = pfm_check_task_state(ctx, 0, &flags);
+ if (!ret)
+ ret = __pfm_getinfo_evtsets(ctx, req, count);
+
+ spin_unlock_irqrestore(&ctx->lock, flags);
+
+ if (copy_to_user(ureq, req, sz))
+ ret = -EFAULT;
+
+ kfree(req);
+error:
+ pfm_put_ctx(ctx);
+
+ return ret;
+}


2006-08-23 22:24:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/18] 2.6.17.9 perfmon2 patch for review: new system calls support

On Wed, 23 Aug 2006 01:05:55 -0700
Stephane Eranian <[email protected]> wrote:

> This patch contains the perfmon2 system call interface.
>
> The interface consist of 12 new system calls. The front-end
> of each system call is implemented in perfmon_syscall.c.
> The front-end takes care of copying the parameters into
> kernel structures and also verifies that the perfmon state
> is appropriate for each command. The back-end of each syscall
> is implemented either in the core (perfmon.c) or in feature
> specific file (e.g. perfmon_sets.c).
>
> The system calls are defined as follows:
>
> sys_pfm_create_context():
> - create a new perfmon2 context and returns a file descriptor in
> the pfarg_ctx_t parameters. This is the first call an application
> must make to do monitoring
>
> sys_pfm_write_pmcs():
> - program the PMU configuration registers. Accepts vector of arguments
> of type pfarg_pmc_t
>
> sys_pfm_write_pmds():
> - program the PMU data registers. Accepts a vector of arguments of type
> pfarg_pmd_t
>
> sys_pfm_read_pmds():
> - read the PMU data registers. Accepts a vector of arguments of type
> pfarg_pmd_t
>
> sys_pfm_restart():
> - indicate that application is doing processing an overflow notification
>
> sys_pfm_start():
> - start monitoring
>
> sys_pfm_stop():
> - stop monitoring
>
> sys_pfm_load_context():
> - attach a perfmon2 context to a task or the current processor.
>
> sys_pfm_unload_context():
> - detach the perfmon2 context
>
> sys_pfm_create_evtsets():
> - create or change an event sets. By default a context is created with only one
> set
>
> sys_pfm_delete_evtsets():
> - delete any explicitely created event set
>
> sys_pfm_getinfo_evtsets():
> - get information about event sets, such as the number of activations. Accepts
> vector arguments of type pfarg_setinfo_t
>
> There are other more indirect system calls related to the fact that a context uses a file
> descriptor. Those system calls are in perfmon_file.c and part of another patch.
>

This code does quite a lot of worrisome poking around inside task lifetime
internals.

Perhaps you could describe what problems are being solved here so we can
take a closer look at whether this is the best way in which to solve them?

> + /*
> + * for syswide, we accept if running on the cpu the context is bound
> + * to. When monitoring another thread, must wait until stopped.
> + */
> + if (ctx->flags.system) {
> + if (ctx->cpu != smp_processor_id())
> + return -EBUSY;
> + return 0;
> + }

Hopefully we're running atomically here. Inside preempt_disable().

> +
> + PFM_DBG("going wait_inactive for [%d] state=%ld flags=0x%lx",
> + task->pid,
> + task->state,
> + local_flags);
> +
> + spin_unlock_irqrestore(&ctx->lock, local_flags);
> +
> + wait_task_inactive(task);
> +
> + spin_lock_irqsave(&ctx->lock, new_flags);

This sort of thing..

> +
> +int pfm_get_args(void __user *ureq, size_t sz, size_t max_sz, void *laddr,
> + void **req)
> +{
> + void *addr;
> +
> + if (sz <= max_sz) {
> + *req = laddr;
> + return copy_from_user(laddr, ureq, sz);
> + }
> +
> + if (unlikely(sz > pfm_controls.arg_size_max)) {
> + PFM_DBG("argument too big %zu max=%zu",
> + sz,
> + pfm_controls.arg_size_max);
> + return -E2BIG;
> + }
> +
> + addr = kmalloc(sz, GFP_KERNEL);
> + if (unlikely(addr == NULL))
> + return -ENOMEM;
> +
> + if (copy_from_user(addr, ureq, sz)) {
> + kfree(addr);
> + return -EFAULT;
> + }
> + *req = addr;
> +
> + return 0;
> +}

Did you really want to return the copy_from_user() return value here? Or
should it be returning

copy_from_user(laddr, ureq, sz) ? -EFAULT : 0;

> +/*
> + * function invoked in case, pfm_context_create fails
> + * at the last operation, copy_to_user. It needs to
> + * undo memory allocations and free the file descriptor
> + */
> +#ifndef CONFIG_IA64_PERFMON_COMPAT
> +static
> +#endif
> +void pfm_undo_create_context_fd(int fd, struct pfm_context *ctx)
> +{
> + struct files_struct *files = current->files;
> + struct file *file;
> +
> + file = fget(fd);
> + /*
> + * there is no fd_uninstall(), so we do it
> + * here. put_unused_fd() does not remove the
> + * effect of fd_install().
> + */
> +
> + spin_lock(&files->file_lock);
> + files->fd_array[fd] = NULL;
> + spin_unlock(&files->file_lock);
> +
> + /*
> + * undo the fget()
> + */
> + fput(file);
> +
> + /*
> + * decrement ref count and kill file
> + */
> + put_filp(file);
> +
> + put_unused_fd(fd);
> +
> + pfm_context_free(ctx);
> +}

There are a few places in here which could use fget_light().

> +asmlinkage long sys_pfm_write_pmcs(int fd, struct pfarg_pmc __user *ureq, int count)
> +{
> + struct pfm_context *ctx;
> + struct pfarg_pmc pmcs[PFM_PMC_STK_ARG];
> + struct pfarg_pmc *req;
> + unsigned long flags;
> + size_t sz;
> + int ret;
> +
> + if (count < 0)
> + return -EINVAL;
> +
> + ctx = pfm_get_ctx(fd);
> + if (unlikely(ctx == NULL))
> + return -EBADF;
> +
> + sz = count*sizeof(*ureq);

I'm worried about multiplication overflow here. A large value of `count'
can cause `count*sizeof(*ureq)' to yield a small positive result. It
appears that very bad things might happen.

> + ret = pfm_get_args(ureq, sz, sizeof(pmcs), pmcs, (void **)&req);
> + if (ret)
> + goto error;
> +
> + spin_lock_irqsave(&ctx->lock, flags);
> +
> + ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
> + if (!ret)
> + ret = __pfm_write_pmcs(ctx, req, count);
> +
> + spin_unlock_irqrestore(&ctx->lock, flags);
> +
> + if (copy_to_user(ureq, req, sz))
> + ret = -EFAULT;
> +
> + if (count > PFM_PMC_STK_ARG)
> + kfree(req);

That's rather obscure. This code is basically paralleling the arithmetic in
pfm_get_args(). I'd suggest that it would be cleaner to pass another arg
to pfm_get_args(): void **pointer_to_free, and fill that in if
pfm_get_args() did kmalloc:

ret = pfm_get_args(ureq, sz, sizeof(pmcs), pmcs, (void **)&req,
&pointer_to_free);
...
kfree(pointer_to_free);

> +asmlinkage long sys_pfm_write_pmds(int fd, struct pfarg_pmd __user *ureq, int count)
> +{
> + struct pfm_context *ctx;
> + struct pfarg_pmd pmds[PFM_PMD_STK_ARG];
> + struct pfarg_pmd *req;
> + unsigned long flags;
> + size_t sz;
> + int ret;
> +
> + if (count < 0)
> + return -EINVAL;
> +
> + ctx = pfm_get_ctx(fd);
> + if (unlikely(ctx == NULL))
> + return -EBADF;
> +
> + sz = count*sizeof(*ureq);

Please check all the syscalls for multiplication overflow.

> +asmlinkage long sys_pfm_read_pmds(int fd, struct pfarg_pmd __user *ureq, int count)
> +{
> + struct pfm_context *ctx;
> + struct pfarg_pmd pmds[PFM_PMD_STK_ARG];
> + struct pfarg_pmd *req;
> + unsigned long flags;
> + size_t sz;
> + int ret;
> +
> + if (count < 0)
> + return -EINVAL;
> +
> + ctx = pfm_get_ctx(fd);
> + if (unlikely(ctx == NULL))
> + return -EBADF;
> +
> + sz = count*sizeof(*ureq);
> +
> + ret = pfm_get_args(ureq, sz, sizeof(pmds), pmds, (void **)&req);
> + if (ret)
> + goto error;
> +
> + spin_lock_irqsave(&ctx->lock, flags);
> +
> + ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
> + if (!ret)
> + ret = __pfm_read_pmds(ctx, req, count);
> +
> + spin_unlock_irqrestore(&ctx->lock, flags);
> +
> + if (copy_to_user(ureq, req, sz))
> + ret = -EFAULT;

There's a risk here that if pfm_check_task_state() returned false, we just
copied a hunk of uninitialised kernel memory out to userspace.

AFAICT that won't happen because the memory at *req was also copied _in_
from userspace. But this idiom is all over the place in this patch and I'd
like you to say that this is all expected, designed-for and will be forever
safe.

> + if (count > PFM_PMD_STK_ARG)
> + kfree(req);
> +error:
> + pfm_put_ctx(ctx);
> +
> + return ret;
> +}
> +
>
> ...
>
> +asmlinkage long sys_pfm_stop(int fd)

None of these syscalls are documented. Where does one go to find the API
description?

> +{
> + struct pfm_context *ctx;
> + unsigned long flags;
> + int ret;
> +
> + ctx = pfm_get_ctx(fd);
> + if (unlikely(ctx == NULL))
> + return -EBADF;
> +
> + spin_lock_irqsave(&ctx->lock, flags);
> +
> + ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
> + if (!ret)
> + ret = __pfm_stop(ctx);
> +
> + spin_unlock_irqrestore(&ctx->lock, flags);
> +
> + pfm_put_ctx(ctx);
> +
> + return ret;
> +}
> +
>
> ...
>
> +asmlinkage long sys_pfm_create_evtsets(int fd, struct pfarg_setdesc __user *ureq, int count)
> +{
> + struct pfm_context *ctx;
> + struct pfarg_setdesc *req;
> + unsigned long flags;
> + size_t sz;
> + int ret;
> +
> + if (count < 0)
> + return -EINVAL;
> +
> + ctx = pfm_get_ctx(fd);
> + if (ctx == NULL)
> + return -EBADF;
> +
> + sz = count*sizeof(*ureq);
> +
> + ret = pfm_get_args(ureq, sz, 0, NULL, (void **)&req);
> + if (ret)
> + goto error;
> +
> + spin_lock_irqsave(&ctx->lock, flags);
> +
> + ret = pfm_check_task_state(ctx, PFM_CMD_UNLOADED, &flags);
> + if (!ret)
> + ret = __pfm_create_evtsets(ctx, req, count);
> +
> + spin_unlock_irqrestore(&ctx->lock, flags);
> +
> + if (copy_to_user(ureq, req, sz))
> + ret = -EFAULT;
> +
> + kfree(req);
> +
> +error:
> + pfm_put_ctx(ctx);
> +
> + return ret;
> +}

When copying a struct from kernel to userspace we must beware of
compiler-inserted padding. Because that can cause the kernel to leak
a few bytes of uninitialised kernel memory.

Unless `struct pfarg_setdesc' is very carefully designed and very simply
laid out it might be best just to zero out the kernel memory in
pfm_get_args().

2006-08-24 20:23:39

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 4/18] 2.6.17.9 perfmon2 patch for review: new system calls support

Andrew,

On Wed, Aug 23, 2006 at 03:14:39PM -0700, Andrew Morton wrote:
> > sys_pfm_load_context():
> > - attach a perfmon2 context to a task or the current processor.
> >
> > sys_pfm_unload_context():
> > - detach the perfmon2 context
> >
> > sys_pfm_create_evtsets():
> > - create or change an event sets. By default a context is created with only one
> > set
> >
> > sys_pfm_delete_evtsets():
> > - delete any explicitely created event set
> >
> > sys_pfm_getinfo_evtsets():
> > - get information about event sets, such as the number of activations. Accepts
> > vector arguments of type pfarg_setinfo_t
> >
> > There are other more indirect system calls related to the fact that a context uses a file
> > descriptor. Those system calls are in perfmon_file.c and part of another patch.
> >
>
> This code does quite a lot of worrisome poking around inside task lifetime
> internals.
>
> Perhaps you could describe what problems are being solved here so we can
> take a closer look at whether this is the best way in which to solve them?
>
Sure, let me try to explain why we have to do this.

Each system call represents an action to perfmon on the perfmon context, e.g.,
read/write the registers, start/stop monitoring. A context is either system-wide, i.e.,
bound to a CPU, or per-thread, i.e., attached to a task_struct. A context is
never attached automatically on creation, it must be attched via pfm_load_context().

A context can be is different states. Depending on the state, certain action may not
be allowed, for instance you cannot start monitoring if the context is not
attached.

For a system-wide context, the caller must be running on the CPU being monitored for
certain actions, such as reading and writing the registers, i.e., we don't do IPI.

For a per-thread context, and when you are not monitoring yourself, the thread you
want to operate on MUST be stopped in order to access its machine state.

The pfm_check_task_state() function perform all those nasty tests. It runs with
the context loacked and interrupt masked.

The tricky part is when you want to operate on another task. As I said it must
be stopped and OFF the CPU to guarantee its PMU state has been saved. So we
first check the task state, if it is STOPPED we ave to wait until it is acutally
off the CPU using wait_task_inactive() which need to run with interrupts unmasked.
So we unmask and unlock the context in order to wait. The context cannot
disapparead for under us because of the file descriptor reference count protecting us.
We take care of propagating the flags value for spin_lock_irq() to our caller.

This is not very pretty and I am open to any better suggestion you may have to perfmon
this check.



More on the rest of your comments on this patch later...

Thanks for you feedback.

> > + /*
> > + * for syswide, we accept if running on the cpu the context is bound
> > + * to. When monitoring another thread, must wait until stopped.
> > + */
> > + if (ctx->flags.system) {
> > + if (ctx->cpu != smp_processor_id())
> > + return -EBUSY;
> > + return 0;
> > + }
>
> Hopefully we're running atomically here. Inside preempt_disable().
>
> > +
> > + PFM_DBG("going wait_inactive for [%d] state=%ld flags=0x%lx",
> > + task->pid,
> > + task->state,
> > + local_flags);
> > +
> > + spin_unlock_irqrestore(&ctx->lock, local_flags);
> > +
> > + wait_task_inactive(task);
> > +
> > + spin_lock_irqsave(&ctx->lock, new_flags);
>
> This sort of thing..
>

2006-08-29 17:10:27

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 4/18] 2.6.17.9 perfmon2 patch for review: new system calls support

Andrew,

On Wed, Aug 23, 2006 at 03:14:39PM -0700, Andrew Morton wrote:
> >
> > sys_pfm_getinfo_evtsets():
> > - get information about event sets, such as the number of activations. Accepts
> > vector arguments of type pfarg_setinfo_t
> >
> > There are other more indirect system calls related to the fact that a context uses a file
> > descriptor. Those system calls are in perfmon_file.c and part of another patch.
> >
>
> This code does quite a lot of worrisome poking around inside task lifetime
> internals.
>
The only thing that needs to be checked for certain commands is task->state.
As I explained in an ealier reply, we can only operate on a task, to modify
its PMU state (effectively its machine state) when the task is not running.
For that we need to check its state. We enforce that the task must be STOPPED
(which you can do via ptrace, for instance). Being runnable and off any cpu
is not enough as it may be scheduled again as we operate on it. Also this
would not set any bound as to when it would be scheduled out if it is running
by the time we come to check on its state.


> Perhaps you could describe what problems are being solved here so we can
> take a closer look at whether this is the best way in which to solve them?
>
> > + /*
> > + * for syswide, we accept if running on the cpu the context is bound
> > + * to. When monitoring another thread, must wait until stopped.
> > + */
> > + if (ctx->flags.system) {
> > + if (ctx->cpu != smp_processor_id())
> > + return -EBUSY;
> > + return 0;
> > + }
>
> Hopefully we're running atomically here. Inside preempt_disable().
>
We are running inside spin_lock_irqsave(). I would assume this is sufficient
to prevent preeemption.

> > +
> > + PFM_DBG("going wait_inactive for [%d] state=%ld flags=0x%lx",
> > + task->pid,
> > + task->state,
> > + local_flags);
> > +
> > + spin_unlock_irqrestore(&ctx->lock, local_flags);
> > +
> > + wait_task_inactive(task);
> > +
> > + spin_lock_irqsave(&ctx->lock, new_flags);
>
> This sort of thing..

We need to wait until the task is effectively off the CPU, i.e., with its
machine state (incl PMU) saved. When we come back we re-run the series of tests.
This applies only to per-thread, therefore it is not affected by smp_processor_id().


> > +asmlinkage long sys_pfm_write_pmcs(int fd, struct pfarg_pmc __user *ureq, int count)
> > +{
> > + struct pfm_context *ctx;
> > + struct pfarg_pmc pmcs[PFM_PMC_STK_ARG];
> > + struct pfarg_pmc *req;
> > + unsigned long flags;
> > + size_t sz;
> > + int ret;
> > +
> > + if (count < 0)
> > + return -EINVAL;
> > +
> > + ctx = pfm_get_ctx(fd);
> > + if (unlikely(ctx == NULL))
> > + return -EBADF;
> > +
> > + sz = count*sizeof(*ureq);
>
> I'm worried about multiplication overflow here. A large value of `count'
> can cause `count*sizeof(*ureq)' to yield a small positive result. It
> appears that very bad things might happen.
>
I have fixed that now.

> > +asmlinkage long sys_pfm_write_pmds(int fd, struct pfarg_pmd __user *ureq, int count)
> > +{
> > + struct pfm_context *ctx;
> > + struct pfarg_pmd pmds[PFM_PMD_STK_ARG];
> > + struct pfarg_pmd *req;
> > + unsigned long flags;
> > + size_t sz;
> > + int ret;
> > +
> > + if (count < 0)
> > + return -EINVAL;
> > +
> > + ctx = pfm_get_ctx(fd);
> > + if (unlikely(ctx == NULL))
> > + return -EBADF;
> > +
> > + sz = count*sizeof(*ureq);
>
> Please check all the syscalls for multiplication overflow.
>
I fixed all of them now.



> > +asmlinkage long sys_pfm_read_pmds(int fd, struct pfarg_pmd __user *ureq, int count)
> > +{
> > + struct pfm_context *ctx;
> > + struct pfarg_pmd pmds[PFM_PMD_STK_ARG];
> > + struct pfarg_pmd *req;
> > + unsigned long flags;
> > + size_t sz;
> > + int ret;
> > +
> > + if (count < 0)
> > + return -EINVAL;
> > +
> > + ctx = pfm_get_ctx(fd);
> > + if (unlikely(ctx == NULL))
> > + return -EBADF;
> > +
> > + sz = count*sizeof(*ureq);
> > +
> > + ret = pfm_get_args(ureq, sz, sizeof(pmds), pmds, (void **)&req);
> > + if (ret)
> > + goto error;
> > +
> > + spin_lock_irqsave(&ctx->lock, flags);
> > +
> > + ret = pfm_check_task_state(ctx, PFM_CMD_STOPPED, &flags);
> > + if (!ret)
> > + ret = __pfm_read_pmds(ctx, req, count);
> > +
> > + spin_unlock_irqrestore(&ctx->lock, flags);
> > +
> > + if (copy_to_user(ureq, req, sz))
> > + ret = -EFAULT;
>
> There's a risk here that if pfm_check_task_state() returned false, we just
> copied a hunk of uninitialised kernel memory out to userspace.
>
> AFAICT that won't happen because the memory at *req was also copied _in_
> from userspace. But this idiom is all over the place in this patch and I'd
> like you to say that this is all expected, designed-for and will be forever
> safe.

Yes, that will not happen for the exact reason you are presenting. This is
the expected behavior. In this case, you get back what you passed in. Notice
also that the copy_to_user() error code takes over the return value of
any preceding calls. That is simply because if you cannot communicate through
the buffer, then you have a bigger problem than just passing trying to operate
on a task that is running for instance.

> > + if (count > PFM_PMD_STK_ARG)
> > + kfree(req);
> > +error:
> > + pfm_put_ctx(ctx);
> > +
> > + return ret;
> > +}
> > +
> >
> > ...
> >
> > +asmlinkage long sys_pfm_stop(int fd)
>
> None of these syscalls are documented. Where does one go to find the API
> description?
>
There exists an older description of the interface that does not cover sets and multiplexing.
I am certainly planning on publishing a full specfication.

>
> When copying a struct from kernel to userspace we must beware of
> compiler-inserted padding. Because that can cause the kernel to leak
> a few bytes of uninitialised kernel memory.

We are copying out exactly the same amount of data that was passed in.

Are you suggesting that copy_from/copy_to may copy more data?

> Unless `struct pfarg_setdesc' is very carefully designed and very simply
> laid out it might be best just to zero out the kernel memory in
> pfm_get_args().

I have tried to have every field aligned propely in both 32-bit and 64-bit.
Keep in mind that we are ABI compatible in for 32 and 64 bit modes on x86
for all perfmon2 system calls.

Thanks.

--
-Stephane

2006-08-29 17:32:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/18] 2.6.17.9 perfmon2 patch for review: new system calls support

On Tue, 29 Aug 2006 09:59:57 -0700
Stephane Eranian <[email protected]> wrote:

> > > +
> > > + PFM_DBG("going wait_inactive for [%d] state=%ld flags=0x%lx",
> > > + task->pid,
> > > + task->state,
> > > + local_flags);
> > > +
> > > + spin_unlock_irqrestore(&ctx->lock, local_flags);
> > > +
> > > + wait_task_inactive(task);
> > > +
> > > + spin_lock_irqsave(&ctx->lock, new_flags);
> >
> > This sort of thing..
>
> We need to wait until the task is effectively off the CPU, i.e., with its
> machine state (incl PMU) saved. When we come back we re-run the series of tests.
> This applies only to per-thread, therefore it is not affected by smp_processor_id().
>

Generally, if a reviewer asks a question and the developer answers that
question, this is a sign that a code comment is needed. Because others are
likely to ask themselves the same question ;)

>
> ..
>
> >
> > When copying a struct from kernel to userspace we must beware of
> > compiler-inserted padding. Because that can cause the kernel to leak
> > a few bytes of uninitialised kernel memory.
>
> We are copying out exactly the same amount of data that was passed in.
>
> Are you suggesting that copy_from/copy_to may copy more data?

No, that's OK. I was pointing out the problem in situations like this:

struct foo {
u32 a;
u64 b;
};

{
struct foo f;

f.a = 0;
f.b = 0;
copy_to_user(p, &f, sizeof(f));
}

which exposes kernel memory. As you appear to be confident that the
perfmon code can't do this, all is OK.


2006-08-29 22:25:38

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH 4/18] 2.6.17.9 perfmon2 patch for review: new system calls support

On Tue, Aug 29, 2006 at 10:26:27AM -0700, Andrew Morton wrote:
> On Tue, 29 Aug 2006 09:59:57 -0700
> Stephane Eranian <[email protected]> wrote:
>
> > > > +
> > > > + PFM_DBG("going wait_inactive for [%d] state=%ld flags=0x%lx",
> > > > + task->pid,
> > > > + task->state,
> > > > + local_flags);
> > > > +
> > > > + spin_unlock_irqrestore(&ctx->lock, local_flags);
> > > > +
> > > > + wait_task_inactive(task);
> > > > +
> > > > + spin_lock_irqsave(&ctx->lock, new_flags);
> > >
> > > This sort of thing..
> >
> > We need to wait until the task is effectively off the CPU, i.e., with its
> > machine state (incl PMU) saved. When we come back we re-run the series of tests.
> > This applies only to per-thread, therefore it is not affected by smp_processor_id().
> >
>
> Generally, if a reviewer asks a question and the developer answers that
> question, this is a sign that a code comment is needed. Because others are
> likely to ask themselves the same question ;)

Understood. I have added a comment about what is going there.

Now, I don't particularly like what we have to do here. You may have a better
solution to my problem:
- how inside the kernel, can I make sure a thread is stopped and off any CPU?

- if the thread is not stopped, how do I go about forcing it to stop and how
can I get notified when it is eventually off any CPU?

- how do I make sure it cannot wake up until I am done with it?

> > >
> > > When copying a struct from kernel to userspace we must beware of
> > > compiler-inserted padding. Because that can cause the kernel to leak
> > > a few bytes of uninitialised kernel memory.
> >
> > We are copying out exactly the same amount of data that was passed in.
> >
> > Are you suggesting that copy_from/copy_to may copy more data?
>
> No, that's OK. I was pointing out the problem in situations like this:
>
> struct foo {
> u32 a;
> u64 b;
> };
>
> {
> struct foo f;
>
> f.a = 0;
> f.b = 0;
> copy_to_user(p, &f, sizeof(f));
> }
>
> which exposes kernel memory. As you appear to be confident that the
> perfmon code can't do this, all is OK.

Ok, now I understand. I don't think I have that but I will write a test program
and run on all architectures to ensure this is indeed the case.

>

--

-Stephane