2008-11-26 08:44:04

by Stephane Eranian

[permalink] [raw]
Subject: [patch 05/24] perfmon: X86 generic code (x86)

This patch adds the X86 generic perfmon code. It is in charge of
implementing certain key functionalities required by the generic
code such as read/write of the PMU registers, low-level interrupt
handling.

Signed-off-by: Stephane Eranian <[email protected]>
--

Index: o3/arch/x86/perfmon/Kconfig
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ o3/arch/x86/perfmon/Kconfig 2008-11-25 18:22:52.000000000 +0100
@@ -0,0 +1,18 @@
+menu "Hardware Performance Monitoring support"
+config PERFMON
+ bool "Perfmon2 performance monitoring interface"
+ select X86_LOCAL_APIC
+ default n
+ help
+ Enables the perfmon2 interface to access the hardware
+ performance counters. See <http://perfmon2.sf.net/> for
+ more details.
+
+config PERFMON_DEBUG
+ bool "Perfmon debugging"
+ default n
+ depends on PERFMON
+ help
+ Enables perfmon debugging support
+
+endmenu
Index: o3/arch/x86/perfmon/Makefile
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ o3/arch/x86/perfmon/Makefile 2008-11-25 18:22:52.000000000 +0100
@@ -0,0 +1,5 @@
+#
+# Copyright (c) 2005-2007 Hewlett-Packard Development Company, L.P.
+# Contributed by Stephane Eranian <[email protected]>
+#
+obj-$(CONFIG_PERFMON) += perfmon.o
Index: o3/arch/x86/perfmon/perfmon.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ o3/arch/x86/perfmon/perfmon.c 2008-11-25 18:29:03.000000000 +0100
@@ -0,0 +1,619 @@
+/*
+ * This file implements the X86 specific support for the perfmon2 interface
+ *
+ * Copyright (c) 2005-2007 Hewlett-Packard Development Company, L.P.
+ * Contributed by Stephane Eranian <[email protected]>
+ *
+ * Copyright (c) 2007 Advanced Micro Devices, Inc.
+ * Contributed by Robert Richter <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+ * 02111-1307 USA
+ */
+#include <linux/interrupt.h>
+#include <linux/perfmon_kern.h>
+#include <linux/kprobes.h>
+#include <linux/kdebug.h>
+#include <linux/nmi.h>
+
+#include <asm/apic.h>
+
+DEFINE_PER_CPU(unsigned long, real_iip);
+DEFINE_PER_CPU(int, pfm_using_nmi);
+
+/**
+ * pfm_arch_ctxswin_thread - thread context switch in
+ * @task: task switched in
+ * @ctx: context for the task
+ * @set: active event set
+ *
+ * Called from pfm_ctxsw(). Task is guaranteed to be current.
+ * set cannot be NULL. Context is locked. Interrupts are masked.
+ *
+ * Caller has already restored all PMD and PMC registers, if
+ * necessary (i.e., lazy restore scheme).
+ *
+ * On x86, the only common code just needs to unsecure RDPMC if necessary
+ *
+ * On model-specific features, e.g., PEBS, IBS, are taken care of in the
+ * corresponding PMU description module
+ */
+void pfm_arch_ctxswin_thread(struct task_struct *task, struct pfm_context *ctx)
+{
+ struct pfm_arch_context *ctx_arch;
+
+ ctx_arch = pfm_ctx_arch(ctx);
+
+ /*
+ * restore saved real iip
+ */
+ if (ctx->active_set->npend_ovfls)
+ __get_cpu_var(real_iip) = ctx_arch->saved_real_iip;
+
+ /*
+ * enable RDPMC on this CPU
+ */
+ if (ctx_arch->flags.insecure)
+ set_in_cr4(X86_CR4_PCE);
+}
+
+/**
+ * pfm_arch_ctxswout_thread - context switch out thread
+ * @task: task switched out
+ * @ctx : context switched out
+ *
+ * Called from pfm_ctxsw(). Task is guaranteed to be current.
+ * Context is locked. Interrupts are masked. Monitoring may be active.
+ * PMU access is guaranteed. PMC and PMD registers are live in PMU.
+ *
+ * Return:
+ * non-zero : did not save PMDs (as part of stopping the PMU)
+ * 0 : saved PMDs (no need to save them in caller)
+ */
+int pfm_arch_ctxswout_thread(struct task_struct *task, struct pfm_context *ctx)
+{
+ struct pfm_arch_context *ctx_arch;
+ struct pfm_arch_pmu_info *pmu_info;
+
+ ctx_arch = pfm_ctx_arch(ctx);
+ pmu_info = pfm_pmu_info();
+
+ /*
+ * disable lazy restore of PMCS on ctxswin because
+ * we modify some of them.
+ */
+ ctx->active_set->priv_flags |= PFM_SETFL_PRIV_MOD_PMCS;
+
+ if (ctx->active_set->npend_ovfls)
+ ctx_arch->saved_real_iip = __get_cpu_var(real_iip);
+
+ /*
+ * disable RDPMC on this CPU
+ */
+ if (ctx_arch->flags.insecure)
+ clear_in_cr4(X86_CR4_PCE);
+
+ return pmu_info->stop_save(ctx, ctx->active_set);
+}
+
+/**
+ * pfm_arch_stop - deactivate monitoring
+ * @task: task to stop
+ * @ctx: context to stop
+ *
+ * Called from pfm_stop()
+ * Interrupts are masked. Context is locked. Set is the active set.
+ *
+ * For per-thread:
+ * task is not necessarily current. If not current task, then
+ * task is guaranteed stopped and off any cpu. Access to PMU
+ * is not guaranteed.
+ *
+ * must disable active monitoring. ctx cannot be NULL
+ */
+void pfm_arch_stop(struct task_struct *task, struct pfm_context *ctx)
+{
+ struct pfm_arch_pmu_info *pmu_info;
+
+ pmu_info = pfm_pmu_info();
+
+ /*
+ * no need to go through stop_save()
+ * if we are already stopped
+ */
+ if (!ctx->flags.started)
+ return;
+
+ if (task != current)
+ return;
+
+ pmu_info->stop_save(ctx, ctx->active_set);
+}
+
+
+/**
+ * pfm_arch_start - activate monitoring
+ * @task: task to start
+ * @ctx: context to stop
+ *
+ * Interrupts are masked. Context is locked.
+ *
+ * For per-thread:
+ * Task is not necessarily current. If not current task, then task
+ * is guaranteed stopped and off any cpu. No access to PMU is task
+ * is not current.
+ */
+void pfm_arch_start(struct task_struct *task, struct pfm_context *ctx)
+{
+ /*
+ * cannot restore PMC if no access to PMU. Will be done
+ * when the thread is switched back in
+ */
+ if (task != current)
+ return;
+
+ pfm_arch_restore_pmcs(ctx, ctx->active_set);
+}
+
+/**
+ * pfm_arch_restore_pmds - reload PMD registers
+ * @ctx: context to restore from
+ * @set: current event set
+ *
+ * function called from pfm_context_load(), pfm_ctxsw()
+ *
+ * Context is locked. Interrupts are masked. Set cannot be NULL.
+ * Access to the PMU is guaranteed.
+ */
+void pfm_arch_restore_pmds(struct pfm_context *ctx, struct pfm_event_set *set)
+{
+ u16 i, num;
+
+ num = set->nused_pmds;
+
+ /*
+ * we can restore only the PMD we use because:
+ *
+ * - can only read with pfm_read_pmds() the registers
+ * declared used via pfm_write_pmds()
+ *
+ * - if cr4.pce=1, only counters are exposed to user. RDPMC
+ * does not work with other types of PMU registers.Thus, no
+ * address is ever exposed by counters
+ *
+ * - there is never a dependency between one pmd register and
+ * another
+ */
+ for (i = 0; num; i++) {
+ if (likely(pfm_arch_bv_test_bit(i, set->used_pmds))) {
+ pfm_write_pmd(ctx, i, set->pmds[i]);
+ num--;
+ }
+ }
+}
+
+/**
+ * pfm_arch_restore_pmcs - reload PMC registers
+ * @ctx: context to restore from
+ * @set: current event set
+ *
+ * function called from pfm_context_load(), pfm_ctxsw().
+ *
+ * Context is locked. Interrupts are masked. set cannot be NULL.
+ * Access to the PMU is guaranteed.
+ *
+ * function must restore all PMC registers from set
+ */
+void pfm_arch_restore_pmcs(struct pfm_context *ctx, struct pfm_event_set *set)
+{
+ u16 i, num;
+
+ /*
+ * we need to restore PMCs only when:
+ * - context is not masked
+ * - monitoring activated
+ *
+ * Masking monitoring after an overflow does not change the
+ * value of flags.started
+ */
+ if (!ctx->flags.started)
+ return;
+
+ /*
+ * restore all pmcs
+ *
+ * It is not possible to restore only the pmcs we used because
+ * certain PMU models (e.g. Pentium 4) have dependencies. Thus
+ * we do not want one application using stale PMCs coming from
+ * another one.
+ *
+ * On PMU models where there is no dependencies between PMCs, then
+ * it is possible to optimize by only restoring the registers that
+ * are used, but this has to be done by model-specific code.
+ */
+ num = ctx->regs.num_pmcs;
+ for (i = 0; num; i++) {
+ if (pfm_arch_bv_test_bit(i, ctx->regs.pmcs)) {
+ pfm_arch_write_pmc(ctx, i, set->pmcs[i]);
+ num--;
+ }
+ }
+}
+
+/**
+ * smp_pmu_interrupt - lowest level PMU interrupt handler for X86
+ * @regs: machine state
+ *
+ * The PMU interrupt is handled through an interrupt gate, therefore
+ * the CPU automatically clears the EFLAGS.IF, i.e., masking interrupts.
+ *
+ * The perfmon interrupt handler MUST run with interrupts disabled due
+ * to possible race with other, higher priority interrupts, such as timer
+ * or IPI function calls.
+ *
+ * See description in IA-32 architecture manual, Vol 3 section 5.8.1
+ */
+void smp_pmu_interrupt(struct pt_regs *regs)
+{
+ unsigned long iip;
+ int using_nmi;
+
+ using_nmi = __get_cpu_var(pfm_using_nmi);
+
+ ack_APIC_irq();
+
+ irq_enter();
+
+ /*
+ * when using NMI, pfm_handle_nmi() gets called
+ * first. It stops monitoring and record the
+ * iip into real_iip, then it repost the interrupt
+ * using the lower priority vector LOCAL_PERFMON_VECTOR
+ *
+ * On some processors, e.g., P4, it may be that some
+ * state is already recorded from pfm_handle_nmi()
+ * and it only needs to be copied back into the normal
+ * fields so it can be used transparently by higher level
+ * code.
+ */
+ if (using_nmi)
+ iip = __get_cpu_var(real_iip);
+ else
+ iip = instruction_pointer(regs);
+
+ pfm_interrupt_handler(iip, regs);
+
+ /*
+ * On Intel processors:
+ * - it is necessary to clear the MASK field for the LVTPC
+ * vector. Otherwise interrupts remain masked. See
+ * section 8.5.1
+ * AMD X86-64:
+ * - the documentation does not stipulate the behavior but
+ * it seems to work without the write, so we skip
+ */
+ if (!using_nmi && current_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+ apic_write(APIC_LVTPC, LOCAL_PERFMON_VECTOR);
+
+ irq_exit();
+}
+
+/**
+ * pfm_handle_nmi - PMU NMI handler notifier callback
+ * @nb ; notifier block
+ * @val: type of die notifier
+ * @data: die notifier-specific data
+ *
+ * called from notify_die() notifier from an trap handler path. We only
+ * care about NMI related callbacks, and ignore everything else.
+ *
+ * Cannot grab any locks, include the perfmon context lock
+ *
+ * Must detect if NMI interrupt comes from perfmon, and if so it must
+ * stop the PMU and repost a lower-priority interrupt. The perfmon interrupt
+ * handler needs to grab the context lock, thus is cannot be run directly
+ * from the NMI interrupt call path.
+ */
+static int __kprobes pfm_handle_nmi(struct notifier_block *nb,
+ unsigned long val,
+ void *data)
+{
+ struct die_args *args = data;
+ struct pfm_context *ctx;
+ struct pfm_arch_pmu_info *pmu_info;
+
+ /*
+ * only NMI related calls
+ */
+ if (val != DIE_NMI_IPI)
+ return NOTIFY_DONE;
+
+ /*
+ * perfmon not using NMI
+ */
+ if (!__get_cpu_var(pfm_using_nmi))
+ return NOTIFY_DONE;
+
+ /*
+ * No context
+ */
+ ctx = __get_cpu_var(pmu_ctx);
+ if (!ctx) {
+ PFM_DBG_ovfl("no ctx");
+ return NOTIFY_DONE;
+ }
+
+ /*
+ * Detect if we have overflows, i.e., NMI interrupt
+ * caused by PMU
+ */
+ pmu_info = pfm_pmu_info();
+ if (!pmu_info->has_ovfls(ctx)) {
+ PFM_DBG_ovfl("no ovfl");
+ return NOTIFY_DONE;
+ }
+
+ /*
+ * we stop the PMU to avoid further overflow before this
+ * one is treated by lower priority interrupt handler
+ */
+ pmu_info->quiesce();
+
+ /*
+ * record actual instruction pointer
+ */
+ __get_cpu_var(real_iip) = instruction_pointer(args->regs);
+
+ /*
+ * post lower priority interrupt (LOCAL_PERFMON_VECTOR)
+ */
+ pfm_arch_resend_irq(ctx);
+
+ /*
+ * we need to rewrite the APIC vector on Intel
+ */
+ if (current_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+ apic_write(APIC_LVTPC, APIC_DM_NMI);
+
+ /*
+ * the notification was for us
+ */
+ return NOTIFY_STOP;
+}
+
+static struct notifier_block pfm_nmi_nb = {
+ .notifier_call = pfm_handle_nmi
+};
+
+/**
+ * pfm_arch_resend_irq - post perfmon interrupt on regular vector
+ *
+ * called from pfm_ctxswin_thread() and pfm_handle_nmi()
+ */
+void pfm_arch_resend_irq(struct pfm_context *ctx)
+{
+ unsigned long val, dest;
+ /*
+ * we cannot use hw_resend_irq() because it goes to
+ * the I/O APIC. We need to go to the Local APIC.
+ *
+ * The "int vec" is not the right solution either
+ * because it triggers a software intr. We need
+ * to regenerate the interrupt and have it pended
+ * until we unmask interrupts.
+ *
+ * Instead we send ourself an IPI on the perfmon
+ * vector.
+ */
+ val = APIC_DEST_SELF|APIC_INT_ASSERT|
+ APIC_DM_FIXED|LOCAL_PERFMON_VECTOR;
+
+ dest = apic_read(APIC_ID);
+ apic_write(APIC_ICR2, dest);
+ apic_write(APIC_ICR, val);
+}
+
+/**
+ * pfm_arch_pmu_acquire_percpu - setup APIC per CPU
+ * @data: contains pmu flags
+ */
+static void pfm_arch_pmu_acquire_percpu(void *data)
+{
+ struct pfm_arch_pmu_info *pmu_info;
+ unsigned int tmp, vec;
+ unsigned long flags = (unsigned long)data;
+ unsigned long lvtpc;
+
+ pmu_info = pfm_pmu_conf->pmu_info;
+ /*
+ * we only reprogram the LVTPC vector if we have detected
+ * no sharing, otherwise it means the APIC is already programmed
+ * and we use whatever vector (likely NMI) is there
+ */
+ if (!(flags & PFM_X86_FL_SHARING)) {
+ vec = LOCAL_PERFMON_VECTOR;
+
+ tmp = apic_read(APIC_LVTERR);
+ apic_write(APIC_LVTERR, tmp | APIC_LVT_MASKED);
+ apic_write(APIC_LVTPC, vec);
+ apic_write(APIC_LVTERR, tmp);
+ }
+ lvtpc = (unsigned long)apic_read(APIC_LVTPC);
+
+ __get_cpu_var(pfm_using_nmi) = lvtpc == APIC_DM_NMI;
+
+ PFM_DBG("LTVPC=0x%lx using_nmi=%d",
+ lvtpc, __get_cpu_var(pfm_using_nmi));
+ /*
+ * invoke model specific acquire routine.
+ */
+ if (pmu_info->acquire_pmu_percpu)
+ pmu_info->acquire_pmu_percpu();
+}
+
+/**
+ * pfm_arch_pmu_acquire - acquire PMU resource from system
+ * @unavail_pmcs : bitmask to use to set unavailable pmcs
+ * @unavail_pmds : bitmask to use to set unavailable pmds
+ *
+ * interrupts are not masked
+ *
+ * Grab PMU registers from lower level MSR allocator
+ *
+ * Program the APIC according the possible interrupt vector
+ * either LOCAL_PERFMON_VECTOR or NMI
+ */
+int pfm_arch_pmu_acquire(u64 *unavail_pmcs, u64 *unavail_pmds)
+{
+ struct pfm_arch_pmu_info *pmu_info;
+ struct pfm_regmap_desc *d;
+ u16 i, nlost;
+
+ pmu_info = pfm_pmu_conf->pmu_info;
+ pmu_info->flags &= ~PFM_X86_FL_SHARING;
+
+ nlost = 0;
+
+ d = pfm_pmu_conf->pmc_desc;
+ for (i = 0; i < pfm_pmu_conf->num_pmc_entries; i++, d++) {
+ if (!(d->type & PFM_REG_I))
+ continue;
+
+ /*
+ * reserve register with lower-level allocator
+ */
+ if (!reserve_evntsel_nmi(d->hw_addr)) {
+ PFM_DBG("pmc%d(%s) already used", i, d->desc);
+ pfm_arch_bv_set_bit(i, unavail_pmcs);
+ nlost++;
+ continue;
+ }
+ }
+ PFM_DBG("nlost=%d info_flags=0x%x\n", nlost, pmu_info->flags);
+ /*
+ * some PMU models (e.g., P6) do not support sharing
+ * so check if we found less than the expected number of PMC registers
+ */
+ if (nlost) {
+ if (pmu_info->flags & PFM_X86_FL_NO_SHARING) {
+ PFM_INFO("PMU already used by another subsystem, "
+ "PMU does not support sharing, "
+ "try disabling Oprofile or "
+ "reboot with nmi_watchdog=0");
+ goto undo;
+ }
+ pmu_info->flags |= PFM_X86_FL_SHARING;
+ }
+
+ d = pfm_pmu_conf->pmd_desc;
+ for (i = 0; i < pfm_pmu_conf->num_pmd_entries; i++, d++) {
+ if (!(d->type & PFM_REG_I))
+ continue;
+
+ if (!reserve_perfctr_nmi(d->hw_addr)) {
+ PFM_DBG("pmd%d(%s) already used", i, d->desc);
+ pfm_arch_bv_set_bit(i, unavail_pmds);
+ }
+ }
+ /*
+ * program APIC on each CPU
+ */
+ on_each_cpu(pfm_arch_pmu_acquire_percpu,
+ (void *)(unsigned long)pmu_info->flags , 1);
+
+ return 0;
+undo:
+ /*
+ * must undo reservation of pmcs in case of error
+ */
+ d = pfm_pmu_conf->pmc_desc;
+ for (i = 0; i < pfm_pmu_conf->num_pmc_entries; i++, d++) {
+ if (!(d->type & PFM_REG_I))
+ continue;
+ if (!pfm_arch_bv_test_bit(i, unavail_pmcs))
+ release_evntsel_nmi(d->hw_addr);
+ }
+ return -EBUSY;
+}
+
+/**
+ * pfm-arch_pmu_release_percpu - clear NMI state for one CPU
+ *
+ */
+static void pfm_arch_pmu_release_percpu(void *data)
+{
+ struct pfm_arch_pmu_info *pmu_info;
+
+ pmu_info = pfm_pmu_conf->pmu_info;
+
+ __get_cpu_var(pfm_using_nmi) = 0;
+ /*
+ * invoke model specific release routine.
+ */
+ if (pmu_info->release_pmu_percpu)
+ pmu_info->release_pmu_percpu();
+}
+
+/**
+ * pfm_arch_pmu_release - release PMU resource to system
+ *
+ * called from pfm_pmu_release()
+ * interrupts are not masked
+ *
+ * On x86, we return the PMU registers to the MSR allocator
+ */
+void pfm_arch_pmu_release(void)
+{
+ struct pfm_regmap_desc *d;
+ u16 i, n;
+
+ d = pfm_pmu_conf->pmc_desc;
+ n = pfm_pmu_conf->regs_all.num_pmcs;
+ for (i = 0; n; i++, d++) {
+ if (!pfm_arch_bv_test_bit(i, pfm_pmu_conf->regs_all.pmcs))
+ continue;
+ release_evntsel_nmi(d->hw_addr);
+ n--;
+ PFM_DBG("pmc%u released", i);
+ }
+ d = pfm_pmu_conf->pmd_desc;
+ n = pfm_pmu_conf->regs_all.num_pmds;
+ for (i = 0; n; i++, d++) {
+ if (!pfm_arch_bv_test_bit(i, pfm_pmu_conf->regs_all.pmds))
+ continue;
+ release_perfctr_nmi(d->hw_addr);
+ n--;
+ PFM_DBG("pmd%u released", i);
+ }
+
+ /* clear NMI variable if used */
+ if (__get_cpu_var(pfm_using_nmi))
+ on_each_cpu(pfm_arch_pmu_release_percpu, NULL , 1);
+}
+
+/**
+ * pfm_arch_init - one time global arch-specific initialization
+ *
+ * called from pfm_init()
+ */
+int __init pfm_arch_init(void)
+{
+ /*
+ * we need to register our NMI handler when the kernels boots
+ * to avoid a deadlock condition with the NMI watchdog or Oprofile
+ * if we were to try and register/unregister on-demand.
+ */
+ register_die_notifier(&pfm_nmi_nb);
+ return 0;
+}
Index: o3/arch/x86/include/asm/Kbuild
===================================================================
--- o3.orig/arch/x86/include/asm/Kbuild 2008-11-25 17:50:09.000000000 +0100
+++ o3/arch/x86/include/asm/Kbuild 2008-11-25 17:54:45.000000000 +0100
@@ -10,6 +10,7 @@
header-y += sigcontext32.h
header-y += ucontext.h
header-y += processor-flags.h
+header-y += perfmon.h

unifdef-y += e820.h
unifdef-y += ist.h
Index: o3/arch/x86/include/asm/perfmon.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ o3/arch/x86/include/asm/perfmon.h 2008-11-25 17:54:45.000000000 +0100
@@ -0,0 +1,34 @@
+/*
+ * Copyright (c) 2007 Hewlett-Packard Development Company, L.P.
+ * Contributed by Stephane Eranian <[email protected]>
+ *
+ * This file contains i386/x86_64 specific definitions for the perfmon
+ * interface.
+ *
+ * This file MUST never be included directly. Use linux/perfmon.h.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+ * 02111-1307 USA
+ */
+#ifndef _ASM_X86_PERFMON__H_
+#define _ASM_X86_PERFMON__H_
+
+/*
+ * arch-specific user visible interface definitions
+ */
+
+#define PFM_ARCH_MAX_PMCS (256+64) /* 256 HW 64 SW */
+#define PFM_ARCH_MAX_PMDS (256+64) /* 256 HW 64 SW */
+
+#endif /* _ASM_X86_PERFMON_H_ */
Index: o3/arch/x86/include/asm/perfmon_kern.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ o3/arch/x86/include/asm/perfmon_kern.h 2008-11-25 18:29:03.000000000 +0100
@@ -0,0 +1,438 @@
+/*
+ * Copyright (c) 2005-2006 Hewlett-Packard Development Company, L.P.
+ * Contributed by Stephane Eranian <[email protected]>
+ *
+ * Copyright (c) 2007 Advanced Micro Devices, Inc.
+ * Contributed by Robert Richter <[email protected]>
+ *
+ * This file contains X86 Processor Family specific definitions
+ * for the perfmon interface. This covers P6, Pentium M, P4/Xeon
+ * (32-bit and 64-bit, i.e., EM64T) and AMD X86-64.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+ * 02111-1307 USA
+ */
+#ifndef _ASM_X86_PERFMON_KERN_H_
+#define _ASM_X86_PERFMON_KERN_H_
+
+#ifdef CONFIG_PERFMON
+#include <linux/unistd.h>
+#ifdef CONFIG_4KSTACKS
+#define PFM_ARCH_STK_ARG 8
+#else
+#define PFM_ARCH_STK_ARG 16
+#endif
+
+struct pfm_arch_pmu_info {
+ u32 flags; /* PMU feature flags */
+ /*
+ * mandatory model-specific callbacks
+ */
+ int (*stop_save)(struct pfm_context *ctx, struct pfm_event_set *set);
+ int (*has_ovfls)(struct pfm_context *ctx);
+ void (*quiesce)(void);
+
+ /*
+ * optional model-specific callbacks
+ */
+ void (*acquire_pmu_percpu)(void);
+ void (*release_pmu_percpu)(void);
+ int (*load_context)(struct pfm_context *ctx);
+ void (*unload_context)(struct pfm_context *ctx);
+};
+
+/*
+ * PMU feature flags
+ */
+#define PFM_X86_FL_NO_SHARING 0x02 /* no sharing with other subsystems */
+#define PFM_X86_FL_SHARING 0x04 /* PMU is being shared */
+
+struct pfm_x86_ctx_flags {
+ unsigned int insecure:1; /* rdpmc per-thread self-monitoring */
+ unsigned int reserved:31; /* for future use */
+};
+
+struct pfm_arch_context {
+ u64 saved_real_iip; /* instr pointer of last NMI intr */
+ struct pfm_x86_ctx_flags flags; /* flags */
+ int saved_started;
+};
+
+/*
+ * functions implemented as inline on x86
+ */
+
+/**
+ * pfm_arch_write_pmc - write a single PMC register
+ * @ctx: context to work on
+ * @cnum: PMC index
+ * @value: PMC 64-bit value
+ *
+ * in certain situations, ctx may be NULL
+ */
+static inline void pfm_arch_write_pmc(struct pfm_context *ctx,
+ unsigned int cnum, u64 value)
+{
+ /*
+ * we only write to the actual register when monitoring is
+ * active (pfm_start was issued)
+ */
+ if (ctx && ctx->flags.started == 0)
+ return;
+
+ PFM_DBG_ovfl("pfm_arch_write_pmc(0x%lx, 0x%Lx)",
+ pfm_pmu_conf->pmc_desc[cnum].hw_addr,
+ (unsigned long long) value);
+
+ wrmsrl(pfm_pmu_conf->pmc_desc[cnum].hw_addr, value);
+}
+
+/**
+ * pfm_arch_write_pmd - write a single PMD register
+ * @ctx: context to work on
+ * @cnum: PMD index
+ * @value: PMD 64-bit value
+ */
+static inline void pfm_arch_write_pmd(struct pfm_context *ctx,
+ unsigned int cnum, u64 value)
+{
+ /*
+ * to make sure the counter overflows, we set the
+ * upper bits. we also clear any other unimplemented
+ * bits as this may cause crash on some processors.
+ */
+ if (pfm_pmu_conf->pmd_desc[cnum].type & PFM_REG_C64)
+ value = (value | ~pfm_pmu_conf->ovfl_mask)
+ & ~pfm_pmu_conf->pmd_desc[cnum].rsvd_msk;
+
+ PFM_DBG_ovfl("pfm_arch_write_pmd(0x%lx, 0x%Lx)",
+ pfm_pmu_conf->pmd_desc[cnum].hw_addr,
+ (unsigned long long) value);
+
+ wrmsrl(pfm_pmu_conf->pmd_desc[cnum].hw_addr, value);
+}
+
+/**
+ * pfm_arch_read_pmd - read a single PMD register
+ * @ctx: context to work on
+ * @cnum: PMD index
+ *
+ * return value is register 64-bit value
+ */
+static inline u64 pfm_arch_read_pmd(struct pfm_context *ctx, unsigned int cnum)
+{
+ u64 tmp;
+
+ rdmsrl(pfm_pmu_conf->pmd_desc[cnum].hw_addr, tmp);
+
+ PFM_DBG_ovfl("pfm_arch_read_pmd(0x%lx) = 0x%Lx",
+ pfm_pmu_conf->pmd_desc[cnum].hw_addr,
+ (unsigned long long) tmp);
+ return tmp;
+}
+
+/**
+ * pfm_arch_read_pmc - read a single PMC register
+ * @ctx: context to work on
+ * @cnum: PMC index
+ *
+ * return value is register 64-bit value
+ */
+static inline u64 pfm_arch_read_pmc(struct pfm_context *ctx, unsigned int cnum)
+{
+ u64 tmp;
+
+ rdmsrl(pfm_pmu_conf->pmc_desc[cnum].hw_addr, tmp);
+
+ PFM_DBG_ovfl("pfm_arch_read_pmc(0x%lx) = 0x%016Lx",
+ pfm_pmu_conf->pmc_desc[cnum].hw_addr,
+ (unsigned long long) tmp);
+ return tmp;
+}
+
+/**
+ * pfm_arch_is_active - return non-zero is monitoring has been started
+ * @ctx: context to check
+ *
+ * At certain points, perfmon needs to know if monitoring has been
+ * explicitly started.
+ *
+ * On x86, there is not other way but to use pfm_start/pfm_stop
+ * to activate monitoring, thus we can simply check flags.started
+ */
+static inline int pfm_arch_is_active(struct pfm_context *ctx)
+{
+ return ctx->flags.started;
+}
+
+
+/**
+ * pfm_arch_unload_context - detach context from thread or CPU
+ * @ctx: context to detach
+ *
+ * in system-wide ctx->task is NULL, otherwise it points to the
+ * attached thread
+ */
+static inline void pfm_arch_unload_context(struct pfm_context *ctx)
+{
+ struct pfm_arch_pmu_info *pmu_info;
+ struct pfm_arch_context *ctx_arch;
+
+ ctx_arch = pfm_ctx_arch(ctx);
+ pmu_info = pfm_pmu_info();
+
+ if (ctx_arch->flags.insecure) {
+ PFM_DBG("clear cr4.pce");
+ clear_in_cr4(X86_CR4_PCE);
+ }
+
+ if (pmu_info->unload_context)
+ pmu_info->unload_context(ctx);
+}
+
+/**
+ * pfm_arch_load_context - attach context to thread or CPU
+ * @ctx: context to attach
+ */
+static inline int pfm_arch_load_context(struct pfm_context *ctx)
+{
+ struct pfm_arch_pmu_info *pmu_info;
+ struct pfm_arch_context *ctx_arch;
+ int ret = 0;
+
+ ctx_arch = pfm_ctx_arch(ctx);
+ pmu_info = pfm_pmu_info();
+
+ /*
+ * RDPMC authorized in system-wide and
+ * per-thread self-monitoring.
+ *
+ * RDPMC only gives access to counts.
+ *
+ * The context-switch routine code does not restore
+ * all the PMD registers (optimization), thus there
+ * is a possible leak of counts there in per-thread
+ * mode.
+ */
+ if (ctx->task == current) {
+ PFM_DBG("set cr4.pce");
+ set_in_cr4(X86_CR4_PCE);
+ ctx_arch->flags.insecure = 1;
+ }
+
+ if (pmu_info->load_context)
+ ret = pmu_info->load_context(ctx);
+
+ return ret;
+}
+
+void pfm_arch_restore_pmcs(struct pfm_context *ctx, struct pfm_event_set *set);
+void pfm_arch_start(struct task_struct *task, struct pfm_context *ctx);
+void pfm_arch_stop(struct task_struct *task, struct pfm_context *ctx);
+
+/**
+ * pfm_arch_intr_freeze_pmu - stop monitoring when handling PMU interrupt
+ * @ctx: current context
+ * @set: current event set
+ *
+ * called from __pfm_interrupt_handler().
+ * ctx is not NULL. ctx is locked. interrupts are masked
+ *
+ * The following actions must take place:
+ * - stop all monitoring to ensure handler has consistent view.
+ * - collect overflowed PMDs bitmask into povfls_pmds and
+ * npend_ovfls. If no interrupt detected then npend_ovfls
+ * must be set to zero.
+ */
+static inline void pfm_arch_intr_freeze_pmu(struct pfm_context *ctx,
+ struct pfm_event_set *set)
+{
+ struct pfm_arch_context *ctx_arch;
+ ctx_arch = pfm_ctx_arch(ctx);
+ /*
+ * on X86, freezing is equivalent to stopping
+ */
+ pfm_arch_stop(current, ctx);
+
+ /*
+ * we mark monitoring as stopped to avoid
+ * certain side effects especially in
+ * pfm_arch_restore_pmcs()
+ */
+ ctx_arch->saved_started = ctx->flags.started;
+ ctx->flags.started = 0;
+}
+
+/**
+ * pfm_arch_intr_unfreeze_pmu - conditionally reactive monitoring
+ * @ctx: current context
+ *
+ * current context may be not when dealing when spurious interrupts
+ *
+ * Must re-activate monitoring if context is not MASKED.
+ * interrupts are masked.
+ */
+static inline void pfm_arch_intr_unfreeze_pmu(struct pfm_context *ctx)
+{
+ struct pfm_arch_context *ctx_arch;
+
+ if (ctx == NULL)
+ return;
+
+ ctx_arch = pfm_ctx_arch(ctx);
+
+ PFM_DBG_ovfl("state=%d", ctx->state);
+
+ /*
+ * restore flags.started which is cleared in
+ * pfm_arch_intr_freeze_pmu()
+ */
+ ctx->flags.started = ctx_arch->saved_started;
+
+ pfm_arch_restore_pmcs(ctx, ctx->active_set);
+}
+
+/**
+ * pfm_arch_ovfl_reset_pmd - reset pmd on overflow
+ * @ctx: current context
+ * @cnum: PMD index
+ *
+ * On some CPUs, the upper bits of a counter must be set in order for the
+ * overflow interrupt to happen. On overflow, the counter has wrapped around,
+ * and the upper bits are cleared. This function may be used to set them back.
+ *
+ * For x86, the current version loses whatever is remaining in the counter,
+ * which is usually has a small count. In order not to loose this count,
+ * we do a read-modify-write to set the upper bits while preserving the
+ * low-order bits. This is slow but works.
+ */
+static inline void pfm_arch_ovfl_reset_pmd(struct pfm_context *ctx, unsigned int cnum)
+{
+ u64 val;
+ val = pfm_arch_read_pmd(ctx, cnum);
+ pfm_arch_write_pmd(ctx, cnum, val);
+}
+
+/**
+ * pfm_arch_context_create - create context
+ * @ctx: newly created context
+ * @flags: context flags as passed by user
+ *
+ * called from __pfm_create_context()
+ */
+static inline int pfm_arch_context_create(struct pfm_context *ctx, u32 ctx_flags)
+{
+ return 0;
+}
+
+/**
+ * pfm_arch_context_free - free context
+ * @ctx: context to free
+ */
+static inline void pfm_arch_context_free(struct pfm_context *ctx)
+{}
+
+/*
+ * functions implemented in arch/x86/perfmon/perfmon.c
+ */
+int pfm_arch_init(void);
+void pfm_arch_resend_irq(struct pfm_context *ctx);
+
+int pfm_arch_ctxswout_thread(struct task_struct *task, struct pfm_context *ctx);
+void pfm_arch_ctxswin_thread(struct task_struct *task, struct pfm_context *ctx);
+
+void pfm_arch_restore_pmds(struct pfm_context *ctx, struct pfm_event_set *set);
+int pfm_arch_pmu_config_init(struct pfm_pmu_config *cfg);
+void pfm_arch_pmu_config_remove(void);
+char *pfm_arch_get_pmu_module_name(void);
+int pfm_arch_pmu_acquire(u64 *unavail_pmcs, u64 *unavail_pmds);
+void pfm_arch_pmu_release(void);
+
+static inline void pfm_arch_serialize(void)
+{}
+
+static inline void pfm_arch_arm_handle_work(struct task_struct *task)
+{}
+
+static inline void pfm_arch_disarm_handle_work(struct task_struct *task)
+{}
+
+#define PFM_ARCH_CTX_SIZE (sizeof(struct pfm_arch_context))
+/*
+ * x86 does not need extra alignment requirements for the sampling buffer
+ */
+#define PFM_ARCH_SMPL_ALIGN_SIZE 0
+
+asmlinkage void pmu_interrupt(void);
+
+static inline void pfm_arch_bv_copy(u64 *a, u64 *b, int nbits)
+{
+ bitmap_copy((unsigned long *)a,
+ (unsigned long *)b,
+ nbits);
+}
+
+static inline void pfm_arch_bv_or(u64 *a, u64 *b, u64 *c, int nbits)
+{
+ bitmap_or((unsigned long *)a,
+ (unsigned long *)b,
+ (unsigned long *)c,
+ nbits);
+}
+
+static inline void pfm_arch_bv_and(u64 *a, u64 *b, u64 *c, int nbits)
+{
+ bitmap_and((unsigned long *)a,
+ (unsigned long *)b,
+ (unsigned long *)c,
+ nbits);
+}
+
+
+static inline void pfm_arch_bv_zero(u64 *a, int nbits)
+{
+ bitmap_zero((unsigned long *)a, nbits);
+}
+
+static inline int pfm_arch_bv_weight(u64 *a, int nbits)
+{
+ return bitmap_weight((unsigned long *)a, nbits);
+}
+
+static inline void pfm_arch_bv_set_bit(int b, u64 *a)
+{
+ __set_bit(b, (unsigned long *)a);
+}
+
+static inline void pfm_arch_bv_clear_bit(int b, u64 *a)
+{
+ __clear_bit(b, (unsigned long *)a);
+}
+
+static inline int pfm_arch_bv_test_bit(int b, u64 *a)
+{
+ return test_bit(b, (unsigned long *)a);
+}
+
+static inline unsigned long pfm_arch_bv_find_next_bit(const u64 *addr,
+ unsigned long size,
+ unsigned long offset)
+{
+ return find_next_bit((unsigned long *)addr,
+ size,
+ offset);
+}
+#endif /* CONFIG_PEFMON */
+
+#endif /* _ASM_X86_PERFMON_KERN_H_ */

--


2008-11-26 11:22:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Wed, Nov 26, 2008 at 12:42:09AM -0800, [email protected] wrote:
> + * set cannot be NULL. Context is locked. Interrupts are masked.
> + *
> + * Caller has already restored all PMD and PMC registers, if
> + * necessary (i.e., lazy restore scheme).
> + *
> + * On x86, the only common code just needs to unsecure RDPMC if necessary

What is insecure about RDPMC? (except perhaps when secure
computing mode is on)

I think it should be enabled by default BTW because on Core2+ you
can always read the fixed counters with it.

> + */
> + if (using_nmi)
> + iip = __get_cpu_var(real_iip);

Call it real_rip perhaps?

> + /*
> + * only NMI related calls
> + */
> + if (val != DIE_NMI_IPI)
> + return NOTIFY_DONE;
> +
> + /*
> + * perfmon not using NMI
> + */
> + if (!__get_cpu_var(pfm_using_nmi))
> + return NOTIFY_DONE;

It should not register in this case. die notifiers are costly
because they make a lot of exceptions slower.

> + /*
> + * we need to register our NMI handler when the kernels boots
> + * to avoid a deadlock condition with the NMI watchdog or Oprofile

What deadlock?

> + * if we were to try and register/unregister on-demand.
> + */
> + register_die_notifier(&pfm_nmi_nb);
> + return 0;
> +
> +/*
> + * arch-specific user visible interface definitions
> + */
> +
> +#define PFM_ARCH_MAX_PMCS (256+64) /* 256 HW 64 SW */
> +#define PFM_ARCH_MAX_PMDS (256+64) /* 256 HW 64 SW */

A little excessive for current x86s?

> +#define _ASM_X86_PERFMON_KERN_H_
> +
> +#ifdef CONFIG_PERFMON
> +#include <linux/unistd.h>
> +#ifdef CONFIG_4KSTACKS
> +#define PFM_ARCH_STK_ARG 8
> +#else
> +#define PFM_ARCH_STK_ARG 16
> +#endif

Very fancy. But is it really worth it?

> + * bits as this may cause crash on some processors.
> + */
> + if (pfm_pmu_conf->pmd_desc[cnum].type & PFM_REG_C64)
> + value = (value | ~pfm_pmu_conf->ovfl_mask)
> + & ~pfm_pmu_conf->pmd_desc[cnum].rsvd_msk;
> +
> + PFM_DBG_ovfl("pfm_arch_write_pmd(0x%lx, 0x%Lx)",
> + pfm_pmu_conf->pmd_desc[cnum].hw_addr,
> + (unsigned long long) value);
> +
> + wrmsrl(pfm_pmu_conf->pmd_desc[cnum].hw_addr, value);

Not sure how well error handling would fit in here, but it's
normally a good idea to make at least the first wrmsrl to
these counters a checking_wrmsrl because sometimes simulators
or hypervisors don't implement them.

> + */
> +static inline void pfm_arch_unload_context(struct pfm_context *ctx)

In general a lot of these inlines seem rather large. If they are
called more than once consider out of lining for better code size.

> + * x86 does not need extra alignment requirements for the sampling buffer
> + */
> +#define PFM_ARCH_SMPL_ALIGN_SIZE 0
> +
> +asmlinkage void pmu_interrupt(void);
> +
> +static inline void pfm_arch_bv_copy(u64 *a, u64 *b, int nbits)

All these bitmap wrappers just seem like unnecessary obfuscation.
Could you just drop them and call the standard functions directly?


-Andi

--
[email protected]

2008-11-26 12:05:58

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Hi Andi,

On Wed, 26 Nov 2008 12:33:09 +0100 Andi Kleen <[email protected]> wrote:
>
> > + * x86 does not need extra alignment requirements for the sampling buffer
> > + */
> > +#define PFM_ARCH_SMPL_ALIGN_SIZE 0
> > +
> > +asmlinkage void pmu_interrupt(void);
> > +
> > +static inline void pfm_arch_bv_copy(u64 *a, u64 *b, int nbits)
>
> All these bitmap wrappers just seem like unnecessary obfuscation.
> Could you just drop them and call the standard functions directly?

These were added after comments from the PowerPC maintainer since how the
bitmaps are accessed needs to be arch specific.

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (705.00 B)
(No filename) (197.00 B)
Download all attachments

2008-11-26 12:12:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Wed, Nov 26, 2008 at 11:05:29PM +1100, Stephen Rothwell wrote:
> Hi Andi,
>
> On Wed, 26 Nov 2008 12:33:09 +0100 Andi Kleen <[email protected]> wrote:
> >
> > > + * x86 does not need extra alignment requirements for the sampling buffer
> > > + */
> > > +#define PFM_ARCH_SMPL_ALIGN_SIZE 0
> > > +
> > > +asmlinkage void pmu_interrupt(void);
> > > +
> > > +static inline void pfm_arch_bv_copy(u64 *a, u64 *b, int nbits)
> >
> > All these bitmap wrappers just seem like unnecessary obfuscation.
> > Could you just drop them and call the standard functions directly?
>
> These were added after comments from the PowerPC maintainer since how the
> bitmaps are accessed needs to be arch specific.

Why? They should not be exported outside, so endian shouldn't matter, should
it?

-Andi

2008-11-26 12:49:10

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Hi Andi,

On Wed, 26 Nov 2008 13:22:17 +0100 Andi Kleen <[email protected]> wrote:
>
> On Wed, Nov 26, 2008 at 11:05:29PM +1100, Stephen Rothwell wrote:
> >
> > On Wed, 26 Nov 2008 12:33:09 +0100 Andi Kleen <[email protected]> wrote:
> > >
> > > > + * x86 does not need extra alignment requirements for the sampling buffer
> > > > + */
> > > > +#define PFM_ARCH_SMPL_ALIGN_SIZE 0
> > > > +
> > > > +asmlinkage void pmu_interrupt(void);
> > > > +
> > > > +static inline void pfm_arch_bv_copy(u64 *a, u64 *b, int nbits)
> > >
> > > All these bitmap wrappers just seem like unnecessary obfuscation.
> > > Could you just drop them and call the standard functions directly?
> >
> > These were added after comments from the PowerPC maintainer since how the
> > bitmaps are accessed needs to be arch specific.
>
> Why? They should not be exported outside, so endian shouldn't matter, should
> it?

See http://lkml.org/lkml/2008/11/9/281 and the followups. Otherwise this
should be answered by Paul and Stephane.
--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (1.09 kB)
(No filename) (197.00 B)
Download all attachments

2008-11-26 13:33:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Wed, 26 Nov 2008, Stephen Rothwell wrote:
> Hi Andi,
>
> On Wed, 26 Nov 2008 12:33:09 +0100 Andi Kleen <[email protected]> wrote:
> >
> > > + * x86 does not need extra alignment requirements for the sampling buffer
> > > + */
> > > +#define PFM_ARCH_SMPL_ALIGN_SIZE 0
> > > +
> > > +asmlinkage void pmu_interrupt(void);
> > > +
> > > +static inline void pfm_arch_bv_copy(u64 *a, u64 *b, int nbits)
> >
> > All these bitmap wrappers just seem like unnecessary obfuscation.
> > Could you just drop them and call the standard functions directly?
>
> These were added after comments from the PowerPC maintainer since how the
> bitmaps are accessed needs to be arch specific.

Just because perfmon uses u64 for bitfields instead of unsigned
long, therefor it breaks on BE32 machines.

I have not yet found a good reason why it needs to use u64 instead of
using what's there already.

Thanks,

tglx

2008-11-26 13:36:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Stephane,

On Wed, 26 Nov 2008, [email protected] wrote:

> This patch adds the X86 generic perfmon code. It is in charge of
> implementing certain key functionalities required by the generic
> code such as read/write of the PMU registers, low-level interrupt
> handling.
>
> Signed-off-by: Stephane Eranian <[email protected]>
> --
>
> Index: o3/arch/x86/perfmon/Kconfig
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ o3/arch/x86/perfmon/Kconfig 2008-11-25 18:22:52.000000000 +0100
> @@ -0,0 +1,18 @@
> +menu "Hardware Performance Monitoring support"
> +config PERFMON
> + bool "Perfmon2 performance monitoring interface"
> + select X86_LOCAL_APIC

depends on X86_LOCAL_APIC

> +config PERFMON_DEBUG
> + bool "Perfmon debugging"
> + default n
> + depends on PERFMON
> + help
> + Enables perfmon debugging support
> +
> +endmenu

Please move this config options into perfmon/Kconfig

config PERFMON
bool "Perfmon2 performance monitoring interface"
depends on ARCH_HAS_PERFMON_SUPPORT

So we avoid duplicating that all over the tree

> Index: o3/arch/x86/perfmon/perfmon.c
> +
> +DEFINE_PER_CPU(unsigned long, real_iip);
> +DEFINE_PER_CPU(int, pfm_using_nmi);

static

> +/**
> + * pfm_arch_ctxswin_thread - thread context switch in
> + * @task: task switched in
> + * @ctx: context for the task
> + * @set: active event set
> + *
> + * Called from pfm_ctxsw(). Task is guaranteed to be current.
> + * set cannot be NULL. Context is locked. Interrupts are masked.
> + *
> + * Caller has already restored all PMD and PMC registers, if
> + * necessary (i.e., lazy restore scheme).
> + *
> + * On x86, the only common code just needs to unsecure RDPMC if necessary
> + *
> + * On model-specific features, e.g., PEBS, IBS, are taken care of in the
> + * corresponding PMU description module
> + */

Can we please move these docbook comments to include/linux/perfmon.h ?

Otherwise we have them in ten different places and probably out of
sync sooner than later. kgdb and other infrastructures which are cross
arch do the same.

> +void pfm_arch_ctxswin_thread(struct task_struct *task, struct pfm_context *ctx)
> +{
> + struct pfm_arch_context *ctx_arch;
> +
> + ctx_arch = pfm_ctx_arch(ctx);

Can we please make this

struct pfm_arch_context *ctx_arch = pfm_ctx_arch(ctx);

Same on various places below.

> +void pfm_arch_restore_pmds(struct pfm_context *ctx, struct pfm_event_set *set)
> +{
> + u16 i, num;

why u16 ? int is fine for loop counters

> + num = set->nused_pmds;
> +
> + /*
> + * we can restore only the PMD we use because:
> + *
> + * - can only read with pfm_read_pmds() the registers
> + * declared used via pfm_write_pmds()
> + *
> + * - if cr4.pce=1, only counters are exposed to user. RDPMC
> + * does not work with other types of PMU registers.Thus, no
> + * address is ever exposed by counters
> + *
> + * - there is never a dependency between one pmd register and
> + * another
> + */
> + for (i = 0; num; i++) {
> + if (likely(pfm_arch_bv_test_bit(i, set->used_pmds))) {
> + pfm_write_pmd(ctx, i, set->pmds[i]);
> + num--;
> + }
> + }

This loop construct looks scary. It relies on set->nused_pmds >=
bits set in set->used_pmds. I had to look more than once to
understand that. It's used all over the code in variations.

> +/**
> + * smp_pmu_interrupt - lowest level PMU interrupt handler for X86
> + * @regs: machine state
> + *
> + * The PMU interrupt is handled through an interrupt gate, therefore
> + * the CPU automatically clears the EFLAGS.IF, i.e., masking interrupts.
> + *
> + * The perfmon interrupt handler MUST run with interrupts disabled due
> + * to possible race with other, higher priority interrupts, such as timer
> + * or IPI function calls.
> + *
> + * See description in IA-32 architecture manual, Vol 3 section 5.8.1
> + */
> +void smp_pmu_interrupt(struct pt_regs *regs)

Why is this only hooked up to x86_64 ?

> +/**
> + * pfm_handle_nmi - PMU NMI handler notifier callback

> + * Cannot grab any locks, include the perfmon context lock
> + /*
> + * we stop the PMU to avoid further overflow before this
> + * one is treated by lower priority interrupt handler
> + */
> + pmu_info->quiesce();

This calls into the intel/amd implementations and iterates over
pfm_pmu_conf->regs_all.pmcs and clears the event select registers.

How is that NMI safe against pfm_arch_write_pmc() in other places ?

> + /*
> + * record actual instruction pointer
> + */
> + __get_cpu_var(real_iip) = instruction_pointer(args->regs);
> +
> + /*
> + * post lower priority interrupt (LOCAL_PERFMON_VECTOR)
> + */
> + pfm_arch_resend_irq(ctx);

Do we really need this whole NMI business ?

> + /*
> + * we need to rewrite the APIC vector on Intel
> + */
> + if (current_cpu_data.x86_vendor == X86_VENDOR_INTEL)
> + apic_write(APIC_LVTPC, APIC_DM_NMI);

Why reenable _before_ we handled the current set ?

> +/**
> + * pfm_arch_resend_irq - post perfmon interrupt on regular vector
> + *
> + * called from pfm_ctxswin_thread() and pfm_handle_nmi()
> + */
> +void pfm_arch_resend_irq(struct pfm_context *ctx)
> +{

> + * Instead we send ourself an IPI on the perfmon
> + * vector.
> + */
> + val = APIC_DEST_SELF|APIC_INT_ASSERT|
> + APIC_DM_FIXED|LOCAL_PERFMON_VECTOR;

Please use

send_IPI_self(LOCAL_PERFMON_VECTOR);

instead of open coding it.

Your code is also missing apic_wait_icr_idle() so it would kill an
queued, but not yet delivered IPI.

> + PFM_DBG("pmc%d(%s) already used", i, d->desc);

cant we simply use pr_debug ?

> + if (pmu_info->flags & PFM_X86_FL_NO_SHARING) {
> + PFM_INFO("PMU already used by another subsystem, "

printk please

> +/**
> + * pfm-arch_pmu_release_percpu - clear NMI state for one CPU
> + *
> + */
> +static void pfm_arch_pmu_release_percpu(void *data)
> +{
> + struct pfm_arch_pmu_info *pmu_info;
> +
> + pmu_info = pfm_pmu_conf->pmu_info;
> +
> + __get_cpu_var(pfm_using_nmi) = 0;
> + /*
> + * invoke model specific release routine.
> + */
> + if (pmu_info->release_pmu_percpu)
> + pmu_info->release_pmu_percpu();
> +}
> +
> +/**
> + * pfm_arch_pmu_release - release PMU resource to system
> + *
> + * called from pfm_pmu_release()
> + * interrupts are not masked
> + *
> + * On x86, we return the PMU registers to the MSR allocator
> + */
> +void pfm_arch_pmu_release(void)
> +{
> + struct pfm_regmap_desc *d;
> + u16 i, n;
> +
> + d = pfm_pmu_conf->pmc_desc;
> + n = pfm_pmu_conf->regs_all.num_pmcs;
> + for (i = 0; n; i++, d++) {
> + if (!pfm_arch_bv_test_bit(i, pfm_pmu_conf->regs_all.pmcs))
> + continue;
> + release_evntsel_nmi(d->hw_addr);
> + n--;
> + PFM_DBG("pmc%u released", i);
> + }
> + d = pfm_pmu_conf->pmd_desc;
> + n = pfm_pmu_conf->regs_all.num_pmds;
> + for (i = 0; n; i++, d++) {
> + if (!pfm_arch_bv_test_bit(i, pfm_pmu_conf->regs_all.pmds))
> + continue;
> + release_perfctr_nmi(d->hw_addr);
> + n--;
> + PFM_DBG("pmd%u released", i);
> + }

What is masking the vectors ?

> + /* clear NMI variable if used */
> + if (__get_cpu_var(pfm_using_nmi))
> + on_each_cpu(pfm_arch_pmu_release_percpu, NULL , 1);
> +}
> +

> Index: o3/arch/x86/include/asm/perfmon_kern.h

> +static inline void pfm_arch_unload_context(struct pfm_context *ctx)
> +static inline int pfm_arch_load_context(struct pfm_context *ctx)

> +void pfm_arch_restore_pmcs(struct pfm_context *ctx, struct pfm_event_set *set);
> +void pfm_arch_start(struct task_struct *task, struct pfm_context *ctx);
> +void pfm_arch_stop(struct task_struct *task, struct pfm_context *ctx);

> +static inline void pfm_arch_intr_freeze_pmu(struct pfm_context *ctx,
> +static inline void pfm_arch_intr_unfreeze_pmu(struct pfm_context *ctx)
> +static inline void pfm_arch_ovfl_reset_pmd(struct pfm_context *ctx, unsigned int cnum)

80 chars.

> +static inline int pfm_arch_context_create(struct pfm_context *ctx, u32 ctx_flags)

Ditto.

> +static inline void pfm_arch_context_free(struct pfm_context *ctx)

> +/*
> + * functions implemented in arch/x86/perfmon/perfmon.c
> + */

The above non inlines are in perfmon.c as well.

> +int pfm_arch_init(void);
> +void pfm_arch_resend_irq(struct pfm_context *ctx);
> +
> +int pfm_arch_ctxswout_thread(struct task_struct *task, struct pfm_context *ctx);
> +void pfm_arch_ctxswin_thread(struct task_struct *task, struct pfm_context *ctx);
> +
> +void pfm_arch_restore_pmds(struct pfm_context *ctx, struct pfm_event_set *set);
> +int pfm_arch_pmu_config_init(struct pfm_pmu_config *cfg);
> +void pfm_arch_pmu_config_remove(void);
> +char *pfm_arch_get_pmu_module_name(void);
> +int pfm_arch_pmu_acquire(u64 *unavail_pmcs, u64 *unavail_pmds);
> +void pfm_arch_pmu_release(void);
> +
> +static inline void pfm_arch_serialize(void)
> +{}
> +
> +static inline void pfm_arch_arm_handle_work(struct task_struct *task)
> +{}
> +
> +static inline void pfm_arch_disarm_handle_work(struct task_struct *task)
> +{}
> +
> +#define PFM_ARCH_CTX_SIZE (sizeof(struct pfm_arch_context))
> +/*
> + * x86 does not need extra alignment requirements for the sampling buffer
> + */
> +#define PFM_ARCH_SMPL_ALIGN_SIZE 0

Defines in one place please

> +asmlinkage void pmu_interrupt(void);

> +static inline void pfm_arch_bv_copy(u64 *a, u64 *b, int nbits)
> +{
> + bitmap_copy((unsigned long *)a,
> + (unsigned long *)b,
> + nbits);

bitmap_copy((unsigned long *) a, (unsigned long *) b, nbits);

Please.

> +static inline void pfm_arch_bv_or(u64 *a, u64 *b, u64 *c, int nbits)
> +static inline void pfm_arch_bv_and(u64 *a, u64 *b, u64 *c, int nbits)
> +static inline void pfm_arch_bv_zero(u64 *a, int nbits)
> +static inline int pfm_arch_bv_weight(u64 *a, int nbits)
> +static inline void pfm_arch_bv_set_bit(int b, u64 *a)
> +static inline void pfm_arch_bv_clear_bit(int b, u64 *a)
> +static inline int pfm_arch_bv_test_bit(int b, u64 *a)
> +static inline unsigned long pfm_arch_bv_find_next_bit(const u64 *addr,

9 simple wrappers around generic bitops. The only reason you need
those is because you use 64bit variables and that does not work on
32bit BE machines.

I do not understand in the first place why you cant use simple
unsigned longs for the bitfields, but if this is necessary for
whatever non obvious reason, then its not an excuse to make this arch
dependent code at all. You need a LE/BE64 and a BE32 version. So you
need a generic and a special be32 version. That's not arch specific.

Thanks,

tglx

2008-11-26 13:50:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Wed, Nov 26, 2008 at 02:35:18PM +0100, Thomas Gleixner wrote:
> > + * does not work with other types of PMU registers.Thus, no
> > + * address is ever exposed by counters
> > + *
> > + * - there is never a dependency between one pmd register and
> > + * another
> > + */
> > + for (i = 0; num; i++) {
> > + if (likely(pfm_arch_bv_test_bit(i, set->used_pmds))) {
> > + pfm_write_pmd(ctx, i, set->pmds[i]);
> > + num--;
> > + }
> > + }
>
> This loop construct looks scary. It relies on set->nused_pmds >=
> bits set in set->used_pmds. I had to look more than once to
> understand that. It's used all over the code in variations.

FWIW this loop style tripped me up during review too.

> > + */
> > + pfm_arch_resend_irq(ctx);
>
> Do we really need this whole NMI business ?

Without it you cannot profile interrupts off regions well.

>
> 9 simple wrappers around generic bitops. The only reason you need
> those is because you use 64bit variables and that does not work on
> 32bit BE machines.
>
> I do not understand in the first place why you cant use simple
> unsigned longs for the bitfields, but if this is necessary for
> whatever non obvious reason, then its not an excuse to make this arch
> dependent code at all. You need a LE/BE64 and a BE32 version. So you
> need a generic and a special be32 version. That's not arch specific.

Or a unsigned long x[VALUE_DEPENDS_ON_WORD_SIZE]

-Andi

2008-11-26 13:56:57

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Thomas,

On Wed, Nov 26, 2008 at 2:32 PM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 26 Nov 2008, Stephen Rothwell wrote:
>> Hi Andi,
>>
>> On Wed, 26 Nov 2008 12:33:09 +0100 Andi Kleen <[email protected]> wrote:
>> >
>> > > + * x86 does not need extra alignment requirements for the sampling buffer
>> > > + */
>> > > +#define PFM_ARCH_SMPL_ALIGN_SIZE 0
>> > > +
>> > > +asmlinkage void pmu_interrupt(void);
>> > > +
>> > > +static inline void pfm_arch_bv_copy(u64 *a, u64 *b, int nbits)
>> >
>> > All these bitmap wrappers just seem like unnecessary obfuscation.
>> > Could you just drop them and call the standard functions directly?
>>
>> These were added after comments from the PowerPC maintainer since how the
>> bitmaps are accessed needs to be arch specific.
>
> Just because perfmon uses u64 for bitfields instead of unsigned
> long, therefor it breaks on BE32 machines.
>
> I have not yet found a good reason why it needs to use u64 instead of
> using what's there already.
>
There is a good reason why we cannot use unsigned long. We must make sure
all data structures exchanged between user mode and the kernel have fixed size.
This way, we can have a 32-bit tool run unmodified on top of a 64-bit kernel AND
we do not need trampoline code to marshall/unmarshall the parameters.

And yes, the abstraction for bitmask ops was introduced because of issues
casting u64 -> unsigned long on Big-Endian32-bit machines such as PPC32.

2008-11-26 16:39:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Stephane,

On Wed, 26 Nov 2008, stephane eranian wrote:
> > I have not yet found a good reason why it needs to use u64 instead of
> > using what's there already.
> >
> There is a good reason why we cannot use unsigned long. We must make sure
> all data structures exchanged between user mode and the kernel have fixed size.
> This way, we can have a 32-bit tool run unmodified on top of a 64-bit kernel AND
> we do not need trampoline code to marshall/unmarshall the parameters.

That's not a good reason at all. We have in kernel interfaces and
kernel-userspace interfaces. Making them the same is nice if it works,
but horrible if it imposes crappy hackery like the bitops wrappers.

> And yes, the abstraction for bitmask ops was introduced because of issues
> casting u64 -> unsigned long on Big-Endian32-bit machines such as PPC32.

Sorry, I think it is simply stupid.

You can keep the userspace interface u64 and use unsigned long for the
bitmasks in the kernel and take care of it in the user space interface
code and do the BE32 conversion when you copy stuff from and to user.

That's a single well defined place and does not add extra crappola
over the kernel especially not into hot pathes like the interrupt.

Why do you want to do u64 -> u32 BE32 magic on every interrupt,
context switch etc., if you can do it once in the userspace interface ?

Thanks,

tglx

2008-11-26 21:19:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Wed, 26 Nov 2008, Andi Kleen wrote:

> On Wed, Nov 26, 2008 at 02:35:18PM +0100, Thomas Gleixner wrote:
> > > + * does not work with other types of PMU registers.Thus, no
> > > + * address is ever exposed by counters
> > > + *
> > > + * - there is never a dependency between one pmd register and
> > > + * another
> > > + */
> > > + for (i = 0; num; i++) {
> > > + if (likely(pfm_arch_bv_test_bit(i, set->used_pmds))) {
> > > + pfm_write_pmd(ctx, i, set->pmds[i]);
> > > + num--;
> > > + }
> > > + }
> >
> > This loop construct looks scary. It relies on set->nused_pmds >=
> > bits set in set->used_pmds. I had to look more than once to
> > understand that. It's used all over the code in variations.
>
> FWIW this loop style tripped me up during review too.

It's even worse than I thought when looking at it a second time:

All the loops iterate over an array which means in the worst case we
do full array_size iterations. In each iteration we check whether the
corresponding bit in the bitmask is set or not.

What a nonsense. We have a bitmask already. Why not iterate over the
bitmask and be done ?

Thanks,

tglx

2008-11-26 21:37:26

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Thomas,

On Wed, Nov 26, 2008 at 10:18 PM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 26 Nov 2008, Andi Kleen wrote:
>
>> On Wed, Nov 26, 2008 at 02:35:18PM +0100, Thomas Gleixner wrote:
>> > > + * does not work with other types of PMU registers.Thus, no
>> > > + * address is ever exposed by counters
>> > > + *
>> > > + * - there is never a dependency between one pmd register and
>> > > + * another
>> > > + */
>> > > + for (i = 0; num; i++) {
>> > > + if (likely(pfm_arch_bv_test_bit(i, set->used_pmds))) {
>> > > + pfm_write_pmd(ctx, i, set->pmds[i]);
>> > > + num--;
>> > > + }
>> > > + }
>> >
>> > This loop construct looks scary. It relies on set->nused_pmds >=
>> > bits set in set->used_pmds. I had to look more than once to
>> > understand that. It's used all over the code in variations.
>>
>> FWIW this loop style tripped me up during review too.
>
> It's even worse than I thought when looking at it a second time:
>
> All the loops iterate over an array which means in the worst case we
> do full array_size iterations. In each iteration we check whether the
> corresponding bit in the bitmask is set or not.
>
> What a nonsense. We have a bitmask already. Why not iterate over the
> bitmask and be done ?
>

Bitmask can be sparsed. Num represents the number of bits we have to find.
The idea is that we don't need to scan the entire bitmask, we stop as soon as
we have found all the bits we care about (i.e., all the bits that are set).

Example:
num = 3
bitmask=0000000010001001
^ we will iterate until we are
done with that bit.

2008-11-26 22:55:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Wed, 26 Nov 2008, Andi Kleen wrote:
> On Wed, Nov 26, 2008 at 02:35:18PM +0100, Thomas Gleixner wrote:
> > > + */
> > > + pfm_arch_resend_irq(ctx);
> >
> > Do we really need this whole NMI business ?
>
> Without it you cannot profile interrupts off regions well.

Fair enough, but I doubt that this is a real solution.

There is not even an attempt to avoid the obvious wrmrsl races, while
there are several comments which explain how expensive wrmrsl is. In
the NMI handler we enable the NMI right away. This might cause
multiple NMIs for nothing when the NMIs hit between the manipulations
of the counters. Not likely but can happen depending on the counter
settings.

Sending an self-IPI from NMI simply sucks: For every NMI we get an
extra local interrupt and we have an extra of 2 * NR_ACTIVE_COUNTERS
accesses to MSRs.

Designing that code to use lockless buffers instead is not really
rocket science.

This code sucks performance wise in all aspects I have looked at so
far.

Thanks,

tglx

2008-11-26 23:17:54

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Wed, 26 Nov 2008, stephane eranian wrote:
> > What a nonsense. We have a bitmask already. Why not iterate over the
> > bitmask and be done ?
> >
>
> Bitmask can be sparsed. Num represents the number of bits we have to find.
> The idea is that we don't need to scan the entire bitmask, we stop as soon as
> we have found all the bits we care about (i.e., all the bits that are set).
>
> Example:
> num = 3
> bitmask=0000000010001001
> ^ we will iterate until we are
> done with that bit.

Errm.

#define for_each_bit(bit, addr, size) \
for ((bit) = find_first_bit((addr), (size)); \
(bit) < (size); \
(bit) = find_next_bit((addr), (size), (bit) + 1))

find_first_bit() and find_next_bit() are single instructions on most
architectures. "size" is known upfront at setup time of the
context/set and can be cached.

This takes exactly 3 iterations, while your method needs 8. And it
gets worse with the following example:

Example:
num = 1
bitmask=1000 0000 0000 0000 0000 0000 0000 00000

^ you will iterate until we are done with that bit (32 times)

for_each_bit() will iterate exactly _once_.

Thanks,

tglx

2008-11-27 09:38:20

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Thomas,

On Thu, Nov 27, 2008 at 12:16 AM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 26 Nov 2008, stephane eranian wrote:
>> > What a nonsense. We have a bitmask already. Why not iterate over the
>> > bitmask and be done ?
>> >
>>
>> Bitmask can be sparsed. Num represents the number of bits we have to find.
>> The idea is that we don't need to scan the entire bitmask, we stop as soon as
>> we have found all the bits we care about (i.e., all the bits that are set).
>>
>> Example:
>> num = 3
>> bitmask=0000000010001001
>> ^ we will iterate until we are
>> done with that bit.
>
> Errm.
>
> #define for_each_bit(bit, addr, size) \
> for ((bit) = find_first_bit((addr), (size)); \
> (bit) < (size); \
> (bit) = find_next_bit((addr), (size), (bit) + 1))
>
> find_first_bit() and find_next_bit() are single instructions on most
> architectures. "size" is known upfront at setup time of the
> context/set and can be cached.
>
> This takes exactly 3 iterations, while your method needs 8. And it
> gets worse with the following example:
>
> Example:
> num = 1
> bitmask=1000 0000 0000 0000 0000 0000 0000 00000
>
> ^ you will iterate until we are done with that bit (32 times)
>
> for_each_bit() will iterate exactly _once_.
>
Ok, you've convinced me. I will make the change.
Thanks.

2008-11-27 09:51:49

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Thomas,

On Wed, Nov 26, 2008 at 5:38 PM, Thomas Gleixner <[email protected]> wrote:
> Stephane,
>
> On Wed, 26 Nov 2008, stephane eranian wrote:
>> > I have not yet found a good reason why it needs to use u64 instead of
>> > using what's there already.
>> >
>> There is a good reason why we cannot use unsigned long. We must make sure
>> all data structures exchanged between user mode and the kernel have fixed size.
>> This way, we can have a 32-bit tool run unmodified on top of a 64-bit kernel AND
>> we do not need trampoline code to marshall/unmarshall the parameters.
>
> That's not a good reason at all. We have in kernel interfaces and
> kernel-userspace interfaces. Making them the same is nice if it works,
> but horrible if it imposes crappy hackery like the bitops wrappers.
>
>> And yes, the abstraction for bitmask ops was introduced because of issues
>> casting u64 -> unsigned long on Big-Endian32-bit machines such as PPC32.
>
> Sorry, I think it is simply stupid.
>
> You can keep the userspace interface u64 and use unsigned long for the
> bitmasks in the kernel and take care of it in the user space interface
> code and do the BE32 conversion when you copy stuff from and to user.
>
> That's a single well defined place and does not add extra crappola
> over the kernel especially not into hot pathes like the interrupt.
>
> Why do you want to do u64 -> u32 BE32 magic on every interrupt,
> context switch etc., if you can do it once in the userspace interface ?
>
I think we agree as to why the user visible structures have to have fixed size.

Some structures have bitfields in them (no in the current patchset yet).
Here is an example:

#define PFM_PMD_BV (256/sizeof(__u64))

struct pfarg_pmd_attr {
__u16 reg_num; /* which register */
__u16 reg_set; /* which event set */
__u32 reg_flags; /* REGFL flags */
__u64 reg_value; /* 64-bit value */
__u64 reg_long_reset; /* write: value to reload after notification */
__u64 reg_short_reset; /* write: reset after counter overflow */
__u64 reg_random_mask; /* write: bitmask used to limit random value */
__u64 reg_smpl_pmds[PFM_PMD_BV]; /* write: record in sample */
__u64 reg_reset_pmds[PFM_PMD_BV]; /* write: reset on overflow */
__u64 reg_ovfl_swcnt; /* write: # of overflows before switch */
__u64 reg_smpl_eventid; /* write: opaque event identifier */
__u64 reg_last_value; /* read : return: PMD last reset value */
__u64 reg_reserved[8]; /* for future use */
};

So you are advocating keeping that layout for the user level code,
i.e., the user level perfmon.h.
But then, in the kernel, you'd have a different version of the same
structure with the same name
and the size but with the bitmask defined as unsigned long instead.
All internal only bitmask
would also be unsigned long. So the structure would like as follows:

#define PFM_PMD_BV (256/sizeof(unsigned long))
struct pfarg_pmd_attr {
__u16 reg_num; /* which register */
__u16 reg_set; /* which event set */
__u32 reg_flags; /* REGFL flags */
__u64 reg_value; /* 64-bit value */
__u64 reg_long_reset; /* write: value to reload after notification */
__u64 reg_short_reset; /* write: reset after counter overflow */
__u64 reg_random_mask; /* write: bitmask used to limit random value */
unsigned long reg_smpl_pmds[PFM_PMD_BV]; /* write: record in sample */
unsigned long reg_reset_pmds[PFM_PMD_BV]; /* write: reset on overflow */
__u64 reg_ovfl_swcnt; /* write: # of overflows before switch */
__u64 reg_smpl_eventid; /* write: opaque event identifier */
__u64 reg_last_value; /* read : return: PMD last reset value */
__u64 reg_reserved[8]; /* for future use */
};

Then we could not directly export include/linux/perfmon.h to user
via Kbuild.

2008-11-27 09:55:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Wed, Nov 26, 2008 at 11:54:30PM +0100, Thomas Gleixner wrote:
> On Wed, 26 Nov 2008, Andi Kleen wrote:
> > On Wed, Nov 26, 2008 at 02:35:18PM +0100, Thomas Gleixner wrote:
> > > > + */
> > > > + pfm_arch_resend_irq(ctx);
> > >
> > > Do we really need this whole NMI business ?
> >
> > Without it you cannot profile interrupts off regions well.
>
> Fair enough, but I doubt that this is a real solution.
>
> There is not even an attempt to avoid the obvious wrmrsl races, while
> there are several comments which explain how expensive wrmrsl is. In
> the NMI handler we enable the NMI right away. This might cause
> multiple NMIs for nothing when the NMIs hit between the manipulations
> of the counters. Not likely but can happen depending on the counter
> settings.
>
> Sending an self-IPI from NMI simply sucks: For every NMI we get an
> extra local interrupt and we have an extra of 2 * NR_ACTIVE_COUNTERS
> accesses to MSRs.

In newer Intel the counters can be reset/rearmed by accessing
only a few global control msrs. But it's probably still a problem
on other PMUs.

On the other hand it also has PEBS which allows at least some
profiling of irq-off regions without using NMIs.
>
> Designing that code to use lockless buffers instead is not really
> rocket science.

Lockless buffers are nasty, but it works in oprofile at least.

Taking out NMis in the first version at least seems like a reasonable
solution. After all you can still use standard oprofile where they work
just fine.

-Andi

--
[email protected]

2008-11-27 10:09:49

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Andi, Thomas,

On Thu, Nov 27, 2008 at 11:06 AM, Andi Kleen <[email protected]> wrote:
> On Wed, Nov 26, 2008 at 11:54:30PM +0100, Thomas Gleixner wrote:
>> On Wed, 26 Nov 2008, Andi Kleen wrote:
>> > On Wed, Nov 26, 2008 at 02:35:18PM +0100, Thomas Gleixner wrote:
>> > > > + */
>> > > > + pfm_arch_resend_irq(ctx);
>> > >
>> > > Do we really need this whole NMI business ?
>> >
>> > Without it you cannot profile interrupts off regions well.
>>
>> Fair enough, but I doubt that this is a real solution.
>>
>> There is not even an attempt to avoid the obvious wrmrsl races, while
>> there are several comments which explain how expensive wrmrsl is. In
>> the NMI handler we enable the NMI right away. This might cause
>> multiple NMIs for nothing when the NMIs hit between the manipulations
>> of the counters. Not likely but can happen depending on the counter
>> settings.
>>
>> Sending an self-IPI from NMI simply sucks: For every NMI we get an
>> extra local interrupt and we have an extra of 2 * NR_ACTIVE_COUNTERS
>> accesses to MSRs.
>
> In newer Intel the counters can be reset/rearmed by accessing
> only a few global control msrs. But it's probably still a problem
> on other PMUs.
>
> On the other hand it also has PEBS which allows at least some
> profiling of irq-off regions without using NMIs.
>>
>> Designing that code to use lockless buffers instead is not really
>> rocket science.
>
> Lockless buffers are nasty, but it works in oprofile at least.
>
> Taking out NMis in the first version at least seems like a reasonable
> solution. After all you can still use standard oprofile where they work
> just fine.
>
The only reason why I have to deal with NMI is not so much to allow
for profiling irq-off regions but because I have to share the PMU with
the NMI watchdog. Otherwise I'd have to fail or disable the NMI watchdog
on the fly.

2008-11-27 10:30:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, 27 Nov 2008, stephane eranian wrote:
> > Lockless buffers are nasty, but it works in oprofile at least.
> >
> > Taking out NMis in the first version at least seems like a reasonable
> > solution. After all you can still use standard oprofile where they work
> > just fine.
> >
> The only reason why I have to deal with NMI is not so much to allow
> for profiling irq-off regions but because I have to share the PMU with
> the NMI watchdog. Otherwise I'd have to fail or disable the NMI watchdog
> on the fly.

We can limit the functionality here. Fail or disable the NMI watchdog
are both valid solutions.

Thanks,

tglx

2008-11-27 10:59:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Stephane,

On Thu, 27 Nov 2008, stephane eranian wrote:
> > Why do you want to do u64 -> u32 BE32 magic on every interrupt,
> > context switch etc., if you can do it once in the userspace interface ?
> >
> I think we agree as to why the user visible structures have to have fixed size.
>
> Some structures have bitfields in them (no in the current patchset yet).
> Here is an example:
>
> #define PFM_PMD_BV (256/sizeof(__u64))
>
> struct pfarg_pmd_attr {
> __u16 reg_num; /* which register */
> __u16 reg_set; /* which event set */
> __u32 reg_flags; /* REGFL flags */
> __u64 reg_value; /* 64-bit value */
> __u64 reg_long_reset; /* write: value to reload after notification */
> __u64 reg_short_reset; /* write: reset after counter overflow */
> __u64 reg_random_mask; /* write: bitmask used to limit random value */
> __u64 reg_smpl_pmds[PFM_PMD_BV]; /* write: record in sample */
> __u64 reg_reset_pmds[PFM_PMD_BV]; /* write: reset on overflow */
> __u64 reg_ovfl_swcnt; /* write: # of overflows before switch */
> __u64 reg_smpl_eventid; /* write: opaque event identifier */
> __u64 reg_last_value; /* read : return: PMD last reset value */
> __u64 reg_reserved[8]; /* for future use */
> };

__attribute__ ((packed)); ???

> So you are advocating keeping that layout for the user level code,
> i.e., the user level perfmon.h.
> But then, in the kernel, you'd have a different version of the same
> structure with the same name
> and the size but with the bitmask defined as unsigned long instead.
> All internal only bitmask
> would also be unsigned long. So the structure would like as follows:
>
> #define PFM_PMD_BV (256/sizeof(unsigned long))
> struct pfarg_pmd_attr {
> __u16 reg_num; /* which register */
> __u16 reg_set; /* which event set */
> __u32 reg_flags; /* REGFL flags */
> __u64 reg_value; /* 64-bit value */
> __u64 reg_long_reset; /* write: value to reload after notification */
> __u64 reg_short_reset; /* write: reset after counter overflow */
> __u64 reg_random_mask; /* write: bitmask used to limit random value */
> unsigned long reg_smpl_pmds[PFM_PMD_BV]; /* write: record in sample */
> unsigned long reg_reset_pmds[PFM_PMD_BV]; /* write: reset on overflow */
> __u64 reg_ovfl_swcnt; /* write: # of overflows before switch */
> __u64 reg_smpl_eventid; /* write: opaque event identifier */
> __u64 reg_last_value; /* read : return: PMD last reset value */
> __u64 reg_reserved[8]; /* for future use */
> };
>
> Then we could not directly export include/linux/perfmon.h to user
> via Kbuild.

I really can not see the problem.

1) perfmon.h can have a section for kernel and user space. We have
tons of examples of this in the kernel already.

The user space data structures are separate, simply because they are
an user space ABI and can not be changed.

Kernel data structures can be completely different from the user ABI
and can be changed at any given time as the code evolves. The design
you are proposing is tying the in kernel data structures to the user
ABI forever.

I can see that it might be good performance wise if you dont have to
touch stuff twice, but reshuffling the bitmasks should not be that
expensive.

2) You can have an union of

union {
__u64 bla[N];
unsigned long blub[M];
};

Where M = N for 64 bit and M = 2 * N for 32 bit.

FYI, I talked to Paulus about that and he thinks that doing the PPC
BE32 bitops for u64 are pretty cheap, but we agreed that this is
neither a perfmon nor an arch feature and has to be done as generic as
possible similar to the atomic64 ops. That way we have just two
implementations (generic and BE32) and not 10 copies of the same
wrappers around ulong bitops all over the place.

Also test_bit_64() is definitely more acceptable and more useful than
pfm_arch_bv_test_bit().

Thanks,

tglx

2008-11-27 11:20:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

> The only reason why I have to deal with NMI is not so much to allow
> for profiling irq-off regions but because I have to share the PMU with
> the NMI watchdog. Otherwise I'd have to fail or disable the NMI watchdog
> on the fly.

The NMI watchdog is now off by default so failing with it enabled
is fine.

Longer term having NMI profiling is still a useful feature I think,
but of course it needs to be implemented cleanly.

-Andi

--
[email protected]

2008-11-27 11:36:14

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 12:31 PM, Andi Kleen <[email protected]> wrote:
>> The only reason why I have to deal with NMI is not so much to allow
>> for profiling irq-off regions but because I have to share the PMU with
>> the NMI watchdog. Otherwise I'd have to fail or disable the NMI watchdog
>> on the fly.
>
> The NMI watchdog is now off by default so failing with it enabled
> is fine.

Yes, but most likely it is on in distro kernels.

>
> Longer term having NMI profiling is still a useful feature I think,
> but of course it needs to be implemented cleanly.
>

The difficulty with NMI is locking. Unlike Oprofile, perfmon code needs
locking. If you recall our discussion yesterday about passing the file
descriptor
around or even using it with multiple threads inside the same process.
You have to handle the case where the NMI fires while you are holding
a perfmon lock. What you have in the patch (and the the fully-featured version)
is that we get the NMI and we stop the PMU WITHOUT grabbing any perfmon
lock, and the we repost the interrupt with the regular vector. We also make sure
we grab the RIP at NMI. That is the one we want to see reported in the sampling
buffer.

I am still wondering how Oprofile handles the case where multiple processes or
threads access the same file descriptor.

2008-11-27 11:37:52

by David Miller

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

From: Thomas Gleixner <[email protected]>
Date: Thu, 27 Nov 2008 11:56:24 +0100 (CET)

> Stephane,
>
> On Thu, 27 Nov 2008, stephane eranian wrote:
> > > Why do you want to do u64 -> u32 BE32 magic on every interrupt,
> > > context switch etc., if you can do it once in the userspace interface ?
> > >
> > I think we agree as to why the user visible structures have to have fixed size.
> >
> > Some structures have bitfields in them (no in the current patchset yet).
> > Here is an example:
> >
> > #define PFM_PMD_BV (256/sizeof(__u64))
> >
> > struct pfarg_pmd_attr {
> > __u16 reg_num; /* which register */
> > __u16 reg_set; /* which event set */
> > __u32 reg_flags; /* REGFL flags */
> > __u64 reg_value; /* 64-bit value */
> > __u64 reg_long_reset; /* write: value to reload after notification */
> > __u64 reg_short_reset; /* write: reset after counter overflow */
> > __u64 reg_random_mask; /* write: bitmask used to limit random value */
> > __u64 reg_smpl_pmds[PFM_PMD_BV]; /* write: record in sample */
> > __u64 reg_reset_pmds[PFM_PMD_BV]; /* write: reset on overflow */
> > __u64 reg_ovfl_swcnt; /* write: # of overflows before switch */
> > __u64 reg_smpl_eventid; /* write: opaque event identifier */
> > __u64 reg_last_value; /* read : return: PMD last reset value */
> > __u64 reg_reserved[8]; /* for future use */
> > };
>
> __attribute__ ((packed)); ???

Shouldn't be needed and this woule kill performance on RISC (every
struct member is accessed with byte sized loads and shifts/masking).

We did this for the pfkey.h structures by accident, and I truly regret
it.

2008-11-27 11:42:56

by David Miller

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

From: "stephane eranian" <[email protected]>
Date: Thu, 27 Nov 2008 12:35:54 +0100

> I am still wondering how Oprofile handles the case where multiple
> processes or threads access the same file descriptor.

There's only one profiling buffer active on a given cpu,
so it's pure per-cpu value insertion.

In any event I think that NMI profiling is a must, especially
for the kernel. You get total unusable crap otherwise. I
just learned this the hard way having gotten an NMI'ish scheme
working on sparc64 just the other day.

2008-11-27 11:50:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, 27 Nov 2008, David Miller wrote:

> From: "stephane eranian" <[email protected]>
> Date: Thu, 27 Nov 2008 12:35:54 +0100
>
> > I am still wondering how Oprofile handles the case where multiple
> > processes or threads access the same file descriptor.
>
> There's only one profiling buffer active on a given cpu,
> so it's pure per-cpu value insertion.
>
> In any event I think that NMI profiling is a must, especially
> for the kernel. You get total unusable crap otherwise. I
> just learned this the hard way having gotten an NMI'ish scheme
> working on sparc64 just the other day.

Not arguing about that, I'm just not agreeing with the implementation.

So for the moment we can go w/o the NMI and implement it cleanly after
we got the initial lot in.

Thanks,

tglx

2008-11-27 11:52:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, 2008-11-27 at 12:35 +0100, stephane eranian wrote:
> On Thu, Nov 27, 2008 at 12:31 PM, Andi Kleen <[email protected]> wrote:
> >> The only reason why I have to deal with NMI is not so much to allow
> >> for profiling irq-off regions but because I have to share the PMU with
> >> the NMI watchdog. Otherwise I'd have to fail or disable the NMI watchdog
> >> on the fly.
> >
> > The NMI watchdog is now off by default so failing with it enabled
> > is fine.
>
> Yes, but most likely it is on in distro kernels.

So? You can disable it on the fly when there is a perfmon user.

2008-11-27 12:04:52

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

Peter,

On Thu, Nov 27, 2008 at 12:52 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2008-11-27 at 12:35 +0100, stephane eranian wrote:
>> On Thu, Nov 27, 2008 at 12:31 PM, Andi Kleen <[email protected]> wrote:
>> >> The only reason why I have to deal with NMI is not so much to allow
>> >> for profiling irq-off regions but because I have to share the PMU with
>> >> the NMI watchdog. Otherwise I'd have to fail or disable the NMI watchdog
>> >> on the fly.
>> >
>> > The NMI watchdog is now off by default so failing with it enabled
>> > is fine.
>>
>> Yes, but most likely it is on in distro kernels.
>
> So? You can disable it on the fly when there is a perfmon user.
>
Yes, you can. There is clearly an interface to do this. I think this is the
best solution. I know it can work because it experimented with this approach
no later than last month. But I ran into a bug which I reported on LKML. I did
not provide a patch because I did not fully understand the connection to
suspend/resume.

The bug has to do with some obscure suspend/resume sequence in:

void setup_apic_nmi_watchdog(void *unused)
{
if (__get_cpu_var(wd_enabled))
return;

/* cheap hack to support suspend/resume */
/* if cpu0 is not active neither should the other cpus */
if (smp_processor_id() != 0 && atomic_read(&nmi_active) <= 0)
return;

Basically, when you re-enable the NMI watchdog, it is not always re-enabled
correctly on all CPUs, it depends on the order if which they process the IPI.

2008-11-27 12:17:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, 2008-11-27 at 13:04 +0100, stephane eranian wrote:
> Peter,
>
> On Thu, Nov 27, 2008 at 12:52 PM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2008-11-27 at 12:35 +0100, stephane eranian wrote:
> >> On Thu, Nov 27, 2008 at 12:31 PM, Andi Kleen <[email protected]> wrote:
> >> >> The only reason why I have to deal with NMI is not so much to allow
> >> >> for profiling irq-off regions but because I have to share the PMU with
> >> >> the NMI watchdog. Otherwise I'd have to fail or disable the NMI watchdog
> >> >> on the fly.
> >> >
> >> > The NMI watchdog is now off by default so failing with it enabled
> >> > is fine.
> >>
> >> Yes, but most likely it is on in distro kernels.
> >
> > So? You can disable it on the fly when there is a perfmon user.
> >
> Yes, you can. There is clearly an interface to do this. I think this is the
> best solution. I know it can work because it experimented with this approach
> no later than last month. But I ran into a bug which I reported on LKML. I did
> not provide a patch because I did not fully understand the connection to
> suspend/resume.
>
> The bug has to do with some obscure suspend/resume sequence in:
>
> void setup_apic_nmi_watchdog(void *unused)
> {
> if (__get_cpu_var(wd_enabled))
> return;
>
> /* cheap hack to support suspend/resume */
> /* if cpu0 is not active neither should the other cpus */
> if (smp_processor_id() != 0 && atomic_read(&nmi_active) <= 0)
> return;
>
> Basically, when you re-enable the NMI watchdog, it is not always re-enabled
> correctly on all CPUs, it depends on the order if which they process the IPI.

Hmm, either we loose that bit and fix the suspend/resume bit properly,
or we can send the IPIs one by one in the correct order ;-)

Dunno, CC'ed all the folks who touched it last.

2008-11-27 12:22:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 12:35:54PM +0100, stephane eranian wrote:
> On Thu, Nov 27, 2008 at 12:31 PM, Andi Kleen <[email protected]> wrote:
> >> The only reason why I have to deal with NMI is not so much to allow
> >> for profiling irq-off regions but because I have to share the PMU with
> >> the NMI watchdog. Otherwise I'd have to fail or disable the NMI watchdog
> >> on the fly.
> >
> > The NMI watchdog is now off by default so failing with it enabled
> > is fine.
>
> Yes, but most likely it is on in distro kernels.

Really? Why?

Old distros of course do it but only because they run old
kernels.

> You have to handle the case where the NMI fires while you are holding
> a perfmon lock. What you have in the patch (and the the fully-featured version)
> is that we get the NMI and we stop the PMU WITHOUT grabbing any perfmon
> lock, and the we repost the interrupt with the regular vector. We also make sure
> we grab the RIP at NMI. That is the one we want to see reported in the sampling
> buffer.
>
> I am still wondering how Oprofile handles the case where multiple processes or
> threads access the same file descriptor.

It uses per CPU buffers (so no races on the writer) and readers can
of course use a lock to coordinate between themselves.

I wrote a similar scheme for mce_log() (although the current version
in tree has some issues too)

-Andi

2008-11-27 12:27:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 12:49:37PM +0100, Thomas Gleixner wrote:
> On Thu, 27 Nov 2008, David Miller wrote:
>
> > From: "stephane eranian" <[email protected]>
> > Date: Thu, 27 Nov 2008 12:35:54 +0100
> >
> > > I am still wondering how Oprofile handles the case where multiple
> > > processes or threads access the same file descriptor.
> >
> > There's only one profiling buffer active on a given cpu,
> > so it's pure per-cpu value insertion.
> >
> > In any event I think that NMI profiling is a must, especially
> > for the kernel. You get total unusable crap otherwise. I
> > just learned this the hard way having gotten an NMI'ish scheme
> > working on sparc64 just the other day.
>
> Not arguing about that, I'm just not agreeing with the implementation.
>
> So for the moment we can go w/o the NMI and implement it cleanly after
> we got the initial lot in.

Note once Stephane readds PEBS support (it is currently stripped out)
you'll be also able to get somewhat reasonable results at least on modern
Intel x86 without NMI profiling. But longer term it is still
very useful because PEBS has some drawbacks too

-Andi
--
[email protected]

2008-11-27 12:28:42

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 1:32 PM, Andi Kleen <[email protected]> wrote:
> On Thu, Nov 27, 2008 at 12:35:54PM +0100, stephane eranian wrote:
>> On Thu, Nov 27, 2008 at 12:31 PM, Andi Kleen <[email protected]> wrote:
>> >> The only reason why I have to deal with NMI is not so much to allow
>> >> for profiling irq-off regions but because I have to share the PMU with
>> >> the NMI watchdog. Otherwise I'd have to fail or disable the NMI watchdog
>> >> on the fly.
>> >
>> > The NMI watchdog is now off by default so failing with it enabled
>> > is fine.
>>
>> Yes, but most likely it is on in distro kernels.
>
> Really? Why?
>
To make sure they can get a crash dump off of production systems.

> Old distros of course do it but only because they run old
> kernels.
>
>> You have to handle the case where the NMI fires while you are holding
>> a perfmon lock. What you have in the patch (and the the fully-featured version)
>> is that we get the NMI and we stop the PMU WITHOUT grabbing any perfmon
>> lock, and the we repost the interrupt with the regular vector. We also make sure
>> we grab the RIP at NMI. That is the one we want to see reported in the sampling
>> buffer.
>>
>> I am still wondering how Oprofile handles the case where multiple processes or
>> threads access the same file descriptor.
>
> It uses per CPU buffers (so no races on the writer) and readers can
> of course use a lock to coordinate between themselves.
>
What if a threads reprograms the counters while another is reading them?
How is the buffer reset?

2008-11-27 12:31:25

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 1:38 PM, Andi Kleen <[email protected]> wrote:
> On Thu, Nov 27, 2008 at 12:49:37PM +0100, Thomas Gleixner wrote:
>> On Thu, 27 Nov 2008, David Miller wrote:
>>
>> > From: "stephane eranian" <[email protected]>
>> > Date: Thu, 27 Nov 2008 12:35:54 +0100
>> >
>> > > I am still wondering how Oprofile handles the case where multiple
>> > > processes or threads access the same file descriptor.
>> >
>> > There's only one profiling buffer active on a given cpu,
>> > so it's pure per-cpu value insertion.
>> >
>> > In any event I think that NMI profiling is a must, especially
>> > for the kernel. You get total unusable crap otherwise. I
>> > just learned this the hard way having gotten an NMI'ish scheme
>> > working on sparc64 just the other day.
>>
>> Not arguing about that, I'm just not agreeing with the implementation.
>>
>> So for the moment we can go w/o the NMI and implement it cleanly after
>> we got the initial lot in.
>
> Note once Stephane readds PEBS support (it is currently stripped out)
> you'll be also able to get somewhat reasonable results at least on modern
> Intel x86 without NMI profiling. But longer term it is still
> very useful because PEBS has some drawbacks too

That's true for Intel.
On AMD64, I think you can get at most one sample out of a irq-off
region using IBS.

2008-11-27 12:34:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

> What if a threads reprograms the counters while another is reading them?

In the worst case you get an invalid event, which is then discarded.
I think. I've never tried to understand it in all details, but
at least it seems to work.

> How is the buffer reset?

drivers/oprofile/cpu_buffer.c:

/* Resets the cpu buffer to a sane state. */
void cpu_buffer_reset(struct oprofile_cpu_buffer *cpu_buf)
{
/* reset these to invalid values; the next sample
* collected will populate the buffer with proper
* values to initialize the buffer
*/
cpu_buf->last_is_kernel = -1;
cpu_buf->last_task = NULL;
}

-Andi

--
[email protected]

2008-11-27 12:36:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

> That's true for Intel.
> On AMD64, I think you can get at most one sample out of a irq-off
> region using IBS.

Assuming that the irqs off regions are not too long (and that's
usually the case) that should be enough, shouldn't it?

Sure there are still CPUs with neither IBS or PEBS.

-Andi

--
[email protected]

2008-11-27 13:30:45

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 1:45 PM, Andi Kleen <[email protected]> wrote:
>> What if a threads reprograms the counters while another is reading them?
>
> In the worst case you get an invalid event, which is then discarded.
> I think. I've never tried to understand it in all details, but
> at least it seems to work.
>
>> How is the buffer reset?
>
> drivers/oprofile/cpu_buffer.c:
>
> /* Resets the cpu buffer to a sane state. */
> void cpu_buffer_reset(struct oprofile_cpu_buffer *cpu_buf)
> {
> /* reset these to invalid values; the next sample
> * collected will populate the buffer with proper
> * values to initialize the buffer
> */
> cpu_buf->last_is_kernel = -1;
> cpu_buf->last_task = NULL;
> }
>
What about a thread doing this and another one in the middle of read the buffer?

Or what about a thread trying to reset the buffer while you're processing an PMU
interrupt on another CPU. I know each buffer is per-CPU, but that does not
prevent two threads for trying to operate on it at the same time from
different CPUs.

2008-11-27 13:34:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, 27 Nov 2008, stephane eranian wrote:
> That's true for Intel.
> On AMD64, I think you can get at most one sample out of a irq-off
> region using IBS.

Exactly the same amount which you get with the NMI/self_IPI out of an
irq-off region.

Thanks,

tglx

2008-11-27 13:37:59

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 2:32 PM, Thomas Gleixner <[email protected]> wrote:
> On Thu, 27 Nov 2008, stephane eranian wrote:
>> That's true for Intel.
>> On AMD64, I think you can get at most one sample out of a irq-off
>> region using IBS.
>
> Exactly the same amount which you get with the NMI/self_IPI out of an
> irq-off region.
>
Why would you get only one sample per irq-off region if you use NMI?
I would think you could get multiple, depending on your sampling period
and the size of the region.

2008-11-27 13:39:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

> What about a thread doing this and another one in the middle of read the buffer?

I assume that uses locks. Locking between threads is easy.

>
> Or what about a thread trying to reset the buffer while you're processing an PMU
> interrupt on another CPU. I know each buffer is per-CPU, but that does not
> prevent two threads for trying to operate on it at the same time from
> different CPUs.

That happens per CPU.

-Andi

--
[email protected]

2008-11-27 13:40:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 02:32:33PM +0100, Thomas Gleixner wrote:
> On Thu, 27 Nov 2008, stephane eranian wrote:
> > That's true for Intel.
> > On AMD64, I think you can get at most one sample out of a irq-off
> > region using IBS.
>
> Exactly the same amount which you get with the NMI/self_IPI out of an
> irq-off region.

True. It would really need a NMI safe queue.

-Andi

--
[email protected]

2008-11-27 13:41:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 02:37:47PM +0100, stephane eranian wrote:
> On Thu, Nov 27, 2008 at 2:32 PM, Thomas Gleixner <[email protected]> wrote:
> > On Thu, 27 Nov 2008, stephane eranian wrote:
> >> That's true for Intel.
> >> On AMD64, I think you can get at most one sample out of a irq-off
> >> region using IBS.
> >
> > Exactly the same amount which you get with the NMI/self_IPI out of an
> > irq-off region.
> >
> Why would you get only one sample per irq-off region if you use NMI?
> I would think you could get multiple, depending on your sampling period
> and the size of the region.

Because of the self IPI. You would need a NMI safe queue between
NMI and self IPI, but then you could as well have the same
queue between NMI and process readers.

-Andi

--
[email protected]

2008-11-27 13:47:33

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, Nov 27, 2008 at 2:49 PM, Andi Kleen <[email protected]> wrote:
>> What about a thread doing this and another one in the middle of read the buffer?
>
> I assume that uses locks. Locking between threads is easy.
>
I think that is fine as long as the interrupt handler does not need to
grab locks.
But that, in turn, means that the state needed/modified by the handler cannot
be altered in any other way that could be harmful.

>>
>> Or what about a thread trying to reset the buffer while you're processing an PMU
>> interrupt on another CPU. I know each buffer is per-CPU, but that does not
>> prevent two threads for trying to operate on it at the same time from
>> different CPUs.
>
> That happens per CPU.
>
So you're saying a thread can only access the buffer from the CPU it
is running on?

2008-11-27 14:41:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 05/24] perfmon: X86 generic code (x86)

On Thu, 27 Nov 2008, David Miller wrote:
> From: Thomas Gleixner <[email protected]>
> > __attribute__ ((packed)); ???
>
> Shouldn't be needed and this woule kill performance on RISC (every
> struct member is accessed with byte sized loads and shifts/masking).
>
> We did this for the pfkey.h structures by accident, and I truly regret
> it.

Fair enough.

Thanks,

tglx