These patches add basic pmus for 2 powerpc hypervisor interfaces to obtain
performance counters: gpci ("get performance counter info") and 24x7.
The counters supplied by these interfaces are continually counting and never
need to be (and cannot be) disabled or enabled. They additionally do not
generate any interrupts. This makes them in some regards similar to software
counters, and as a result their implimentation shares some common code (which
an initial patch exposes) with the sw counters.
There is ongoing work to support transactions for each of these pmus.
These 2 PMUs end up providing access to some cpu, core, and chip level counters
not exposed via other interfaces, and additionally allow monitoring the
performance of other lpars (guests) on the same host system. Because it
provides access to core and chip level counters, this pair of PMUs could be
thought of as powerpc's counterpart to x86's uncore events.
As an example, "processor_bus_utilization_abc" and
"processor_bus_utilization_wxyz" (in hv_gpci.h) allow retreval of total cycles
and idle cycles for various inter-chip buses.
GPCI is an interface that already exists on some power6 and power7 machines
(depending on the fw version), but is rather in-flexible and code intensive to
add additional counters to. The 24x7 interfaces currently are designed to
co-exist with the gpci interface while replacing most of gpci's functionality
on newer systems. Right now, the 24x7 code I've submitted uses the gpci calls
to check if it has permission to access certain classes of counters.
Example perf usage:
perf stat -e 'hv_gpci/counter_info_version=3,offset=0,length=8,secondary_index=0,starting_index=0xffffffff,request=0x10/' -r 0 -C 0 -x ' ' sleep 0.1
perf stat -e 'hv_24x7/domain=2,offset=8,starting_index=0,lpar=0xffffffff/' -r 0 -C 0 -x ' ' sleep 0.1
--
Changes since v1:
- add a few attributes to hv_gpci and hv_24x7 that expose some info about the interfaces
- so the attributes show up in the right place, fix bin_attr creation in sysfs groups.
- move hv_gpci.h and hv_24x7.h interface headers into arch/powerpc/perf
- fix bit ordering in hv_gpci.h
- split out hv_perf_caps_get() and use it to probe for the interface before registering
- ensure proper alignment of hypervisor args
- add a few missing counter requests to hv_gpci.h
- s/CIR_xxx/CIR_XXX/ in hv_gpci.h
- s/modules_init/device_initcall/
- Don't set event->cpu, use the user provided one
- remove the union of gpci events, just give the user 1024 bytes to play with
- clarify some comments (the list of fw versions is now labeled)
- provide and event_24x7_request() that wraps single_24x7_request()
- probably some other small fixes I'm forgetting.
Cody P Schafer (11):
perf: add PMU_RANGE_ATTR() helper for use by sw-like pmus
perf core: export swevent hrtimer helpers
sysfs: create bin_attributes under the requested group
powerpc: add hvcalls for 24x7 and gpci (get performance counter info)
powerpc: add hv_gpci interface header
powerpc: add 24x7 interface header
powerpc: add a shared interface to get gpci version and capabilities
powerpc/perf: add support for the hv gpci (get performance counter
info) interface
powerpc/perf: add support for the hv 24x7 interface
powerpc/perf: add kconfig option for hypervisor provided counters
powerpc/perf/hv_{gpci,24x7}: add documentation of device attributes
.../testing/sysfs-bus-event_source-devices-hv_24x7 | 22 +
.../testing/sysfs-bus-event_source-devices-hv_gpci | 43 ++
arch/powerpc/include/asm/hvcall.h | 5 +
arch/powerpc/perf/Makefile | 2 +
arch/powerpc/perf/hv-24x7.c | 491 +++++++++++++++++++
arch/powerpc/perf/hv-24x7.h | 239 ++++++++++
arch/powerpc/perf/hv-common.c | 39 ++
arch/powerpc/perf/hv-common.h | 17 +
arch/powerpc/perf/hv-gpci.c | 290 ++++++++++++
arch/powerpc/perf/hv-gpci.h | 521 +++++++++++++++++++++
arch/powerpc/platforms/Kconfig.cputype | 6 +
fs/sysfs/group.c | 7 +-
include/linux/perf_event.h | 22 +-
kernel/events/core.c | 8 +-
14 files changed, 1705 insertions(+), 7 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
create mode 100644 arch/powerpc/perf/hv-24x7.c
create mode 100644 arch/powerpc/perf/hv-24x7.h
create mode 100644 arch/powerpc/perf/hv-common.c
create mode 100644 arch/powerpc/perf/hv-common.h
create mode 100644 arch/powerpc/perf/hv-gpci.c
create mode 100644 arch/powerpc/perf/hv-gpci.h
--
1.8.5.4
Export the swevent hrtimer helpers currently only used in events/core.c
to allow the addition of architecture specific sw-like pmus.
Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/perf_event.h | 5 ++++-
kernel/events/core.c | 8 ++++----
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2702e91..24378a9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -559,7 +559,10 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
extern u64 perf_event_read_value(struct perf_event *event,
u64 *enabled, u64 *running);
-
+extern void perf_swevent_init_hrtimer(struct perf_event *event);
+extern void perf_swevent_start_hrtimer(struct perf_event *event);
+extern void perf_swevent_cancel_hrtimer(struct perf_event *event);
+extern int perf_swevent_event_idx(struct perf_event *event);
struct perf_sample_data {
u64 type;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 56003c6..feb0347 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5816,7 +5816,7 @@ static int perf_swevent_init(struct perf_event *event)
return 0;
}
-static int perf_swevent_event_idx(struct perf_event *event)
+int perf_swevent_event_idx(struct perf_event *event)
{
return 0;
}
@@ -6045,7 +6045,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
return ret;
}
-static void perf_swevent_start_hrtimer(struct perf_event *event)
+void perf_swevent_start_hrtimer(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
s64 period;
@@ -6067,7 +6067,7 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
HRTIMER_MODE_REL_PINNED, 0);
}
-static void perf_swevent_cancel_hrtimer(struct perf_event *event)
+void perf_swevent_cancel_hrtimer(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
@@ -6079,7 +6079,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
}
}
-static void perf_swevent_init_hrtimer(struct perf_event *event)
+void perf_swevent_init_hrtimer(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
--
1.8.5.4
"H_GetPerformanceCounterInfo" (refered to as hv_gpci or just gpci from
here on) is an interface to retrieve specific performance counters and
other data from the hypervisor. All outputs have a fixed format (and
are represented as structs in this patch).
Signed-off-by: Cody P Schafer <[email protected]>
---
arch/powerpc/perf/hv-gpci.h | 521 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 521 insertions(+)
create mode 100644 arch/powerpc/perf/hv-gpci.h
diff --git a/arch/powerpc/perf/hv-gpci.h b/arch/powerpc/perf/hv-gpci.h
new file mode 100644
index 0000000..d602809
--- /dev/null
+++ b/arch/powerpc/perf/hv-gpci.h
@@ -0,0 +1,521 @@
+#ifndef LINUX_POWERPC_PERF_HV_GPCI_H_
+#define LINUX_POWERPC_PERF_HV_GPCI_H_
+
+#include <linux/types.h>
+
+/* From the document "H_GetPerformanceCounterInfo Interface" v1.07 */
+
+/* H_GET_PERF_COUNTER_INFO argument */
+struct hv_get_perf_counter_info_params {
+ __be32 counter_request; /* I */
+ __be32 starting_index; /* IO */
+ __be16 secondary_index; /* IO */
+ __be16 returned_values; /* O */
+ __be32 detail_rc; /* O, only needed when called via *_norets() */
+
+ /*
+ * O, size each of counter_value element in bytes, only set for version
+ * >= 0x3
+ */
+ __be16 cv_element_size;
+
+ /* I, 0 (zero) for versions < 0x3 */
+ __u8 counter_info_version_in;
+
+ /* O, 0 (zero) if version < 0x3. Must be set to 0 when making hcall */
+ __u8 counter_info_version_out;
+ __u8 reserved[0xC];
+ __u8 counter_value[];
+} __packed;
+
+/*
+ * counter info version => fw version/reference (spec version)
+ *
+ * 8 => power8 (1.07)
+ * [7 is skipped by spec 1.07]
+ * 6 => TLBIE (1.07)
+ * 5 => v7r7m0.phyp (1.05)
+ * [4 skipped]
+ * 3 => v7r6m0.phyp (?)
+ * [1,2 skipped]
+ * 0 => v7r{2,3,4}m0.phyp (?)
+ */
+#define COUNTER_INFO_VERSION_CURRENT 0x8
+
+/*
+ * These determine the counter_value[] layout and the meaning of starting_index
+ * and secondary_index.
+ *
+ * Unless otherwise noted, @secondary_index is unused and ignored.
+ */
+enum counter_info_requests {
+
+ /* GENERAL */
+
+ /* @starting_index: "starting" physical processor index or -1 for
+ * current physical processor. Data is only collected
+ * for the processors' "primary" thread.
+ */
+ CIR_DISPATCH_TIMEBASE_BY_PROCESSOR = 0x10,
+
+ /* @starting_index: starting partition id or -1 for the current logical
+ * partition (virtual machine).
+ */
+ CIR_ENTITLED_CAPPED_UNCAPPED_DONATED_IDLE_TIMEBASE_BY_PARTITION = 0x20,
+
+ /* @starting_index: starting partition id or -1 for the current logical
+ * partition (virtual machine).
+ */
+ CIR_RUN_INSTRUCTIONS_RUN_CYCLES_BY_PARTITION = 0X30,
+
+ /* @starting_index: must be -1 (to refer to the current partition)
+ */
+ CIR_SYSTEM_PERFORMANCE_CAPABILITIES = 0X40,
+
+
+ /* Data from this should only be considered valid if
+ * counter_info_version >= 0x3
+ * @starting_index: starting hardware chip id or -1 for the current hw
+ * chip id
+ */
+ CIR_PROCESSOR_BUS_UTILIZATION_ABC_LINKS = 0X50,
+
+ /* Data from this should only be considered valid if
+ * counter_info_version >= 0x3
+ * @starting_index: starting hardware chip id or -1 for the current hw
+ * chip id
+ */
+ CIR_PROCESSOR_BUS_UTILIZATION_WXYZ_LINKS = 0X60,
+
+ /*
+ * EXPANDED - the following are only avaliable if the CV_CM_EXPANDED
+ * bit is set from system_performace_capabilities. Enforcement is left
+ * to the hypervisor.
+ */
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: starting hardware chip id or -1 for the current hw
+ * chip id
+ */
+ CIR_PROCESSOR_BUS_UTILIZATION_GX_LINKS = 0X70,
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: starting hardware chip id or -1 for the current hw
+ * chip id
+ */
+ CIR_PROCESSOR_BUS_UTILIZATION_MC_LINKS = 0X80,
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: starting physical processor or -1 for the current
+ * physical processor
+ */
+ CIR_PROCESSOR_CONFIG = 0X90,
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: starting physical processor or -1 for the current
+ * physical processor
+ */
+ CIR_CURRENT_PROCESSOR_FREQUENCY = 0X91,
+
+ /* Available if counter_info_version >= 0x3 and <= 0x7
+ * @starting_index: starting physical processor or -1 for the current
+ * physical processor
+ */
+ CIR_PROCESSOR_CORE_UTILIZATION = 0X94,
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: starting partition id or -1 for the current logical
+ * partition
+ */
+ CIR_PROCESSOR_CORE_POWER_MODE = 0X95,
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: starting partition id or -1 for the current
+ * partition
+ * @secondary_index: starting virtual processor index or -1 for the
+ * current virtual processor
+ *
+ * Note: -1 for @starting_index and @secondary_index is only allowed if
+ * both are -1
+ */
+ CIR_AFFINITY_DOMAIN_INFORMATION_BY_VIRUTAL_PROCESSOR = 0XA0,
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: primary affinity domain index or -1 for current
+ * primary affinity domain (as determined by the
+ * physical processor).
+ */
+ CIR_AFFINITY_DOMAIN_INFO_BY_DOMAIN = 0XB0,
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: partition id or -1 for current partition
+ */
+ CIR_AFFINITY_DOMAIN_INFO_BY_PARTITION = 0XB1,
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: unused
+ */
+ CIR_PHYSICAL_MEMORY_INFO = 0XC0,
+
+ /* Available if counter_info_version >= 0x3
+ * @starting_index: starting hw chip id or -1 for current hw chip
+ */
+ CIR_PROCESSOR_BUS_TOPOLOGY = 0XD0,
+
+ /* Available if counter_info_version >= 0x5
+ * @starting_index: starting partition id or -1 for current
+ */
+ CIR_PARTITION_HYPERVISOR_QUEUING_TIMES = 0XE0,
+
+ /* Available if counter_info_version >= 0x5
+ * @starting_index: unused
+ */
+ CIR_SYSTEM_HYPERVISOR_TIMES = 0XF0,
+
+ /* Available if counter_info_version >= 0x6
+ * @starting_index: unused
+ */
+ CIR_SYSTEM_TLBIE_COUNT_AND_TIME = 0xF4,
+
+ /* Available if counter_info_version >= 0x8
+ * @starting_index: starting partition id or -1 for current
+ */
+ CIR_PARTITION_INSTURCTION_COUNT_AND_TIME = 0X100,
+
+ /* LAB */
+
+ /* Available if counter_info_version < 0x8
+ * @starting_index: unused
+ */
+ CIR_SET_MMCRH = 0X80001000,
+
+ /* Available if counter_info_version < 0x8
+ * @starting_index: starting physical processor id or -1 for current
+ */
+ CIR_RETRIEVE_HPMCX = 0X80002000,
+};
+
+/* counter value layout */
+struct cv_dispatch_timebase_by_processor {
+ __be64 processor_time_in_timebase_cycles;
+ __be32 hw_processor_id;
+ __be16 owning_part_id; /* 0xffff if shared or unowned */
+ __u8 processor_state;
+ __u8 version; /* unused unless counter_info_version == 0 */
+ __be32 hw_chip_id; /* -1 for "Not Installed" processors */
+ __be32 phys_module_id; /* -1 for "Not Installed" processors */
+ __be32 primary_affinity_domain_idx;
+ __be32 secondary_affinity_domain_idx;
+ __be32 processor_version;
+ __be16 logical_processor_idx;
+ __u8 reserved[0x2];
+
+ /* counter_info_version >= 0x3 || version >= 0x1 */
+ __be32 processor_id_register;
+ __be32 phys_processor_idx; /* counter_info_version >= 0x3 */
+} __packed;
+
+struct cv_entitled_capped_uncapped_donated_idle_timebase_by_partition {
+ __be64 partition_id;
+ __be64 entitled_cycles;
+ __be64 consumed_capped_cycles;
+ __be64 consumed_uncapped_cycles;
+ __be64 cycles_donated;
+ __be64 purr_idle_cycles;
+} __packed;
+
+struct cv_run_instructions_run_cycles_by_partition {
+ __be64 partition_id;
+ __be64 instructions_completed; /* 0 if collection is unsupported */
+ __be64 cycles; /* 0 if collection is unsupported */
+} __packed;
+
+struct cv_system_performance_capabilities {
+ /* If != 0, allowed to collect data from other partitions */
+ __u8 perf_collect_privileged;
+
+ /* These following are only valid if counter_info_version >= 0x3 */
+#define CV_CM_GA (1 << 7)
+#define CV_CM_EXPANDED (1 << 6)
+#define CV_CM_LAB (1 << 5)
+ /* remaining bits are reserved */
+ __u8 capability_mask;
+ __u8 reserved[0xE];
+} __packed;
+
+struct cv_processor_bus_utilization_abc_links {
+ __be32 hw_chip_id;
+ __u8 reserved1[0xC];
+ __be64 total_link_cycles;
+ __be64 idle_cycles_a;
+ __be64 idle_cycles_b;
+ __be64 idle_cycles_c;
+ __u8 reserved2[0x20];
+} __packed;
+
+struct cv_processor_bus_utilization_wxyz_links {
+ __be32 hw_chip_id;
+ __u8 reserved1[0xC];
+ __be64 total_link_cycles;
+
+ /* Inactive links (all cycles idle) give -1 */
+ __be64 idle_cycles_w;
+ __be64 idle_cycles_x;
+ __be64 idle_cycles_y;
+ __be64 idle_cycles_z;
+ __u8 reserved2[0x28];
+} __packed;
+
+/* EXPANDED */
+
+struct cv_gx_cycles {
+ __be64 address_cycles;
+ __be64 data_cycles;
+ __be64 retries;
+ __be64 bus_cycles;
+ __be64 total_cycles;
+} __packed;
+
+struct cv_gx_cycles_io {
+ struct cv_gx_cycles in, out;
+} __packed;
+
+struct cv_processor_bus_utilization_gx {
+ __be32 hw_chip_id;
+ __u8 reserved1[0xC];
+ struct cv_gx_cycles_io gx[2];
+} __packed;
+
+struct cv_mc_counts {
+ __be64 frames;
+ __be64 reads;
+ __be64 writes;
+ __be64 total_cycles;
+} __packed;
+
+/* inactive links return 0 for all utilization data */
+struct cv_processor_bus_utilization_mc_links {
+ __be32 hw_chip_id;
+ __u8 reserved1[0xC];
+ struct cv_mc_counts mc[2];
+} __packed;
+
+struct cv_processor_config {
+ __be32 phys_processor_idx;
+ __be32 hw_node_id;
+ __be32 hw_card_id;
+ __be32 phys_module_id;
+ __be32 hw_chip_id;
+ __be32 hw_processor_id;
+ __be32 processor_id_register;
+
+#define CV_PS_NOT_INSTALLED 0x1
+#define CV_PS_GAURDED_OFF 0x2
+#define CV_PS_UNLICENSED 0x3
+#define CV_PS_SHARED 0x4
+#define CV_PS_BORROWED 0x5
+#define CV_PS_DEDICATED 0x6
+ __u8 processor_state;
+
+ __u8 reserved1[0x1];
+ __be16 owning_part_id;
+ __be32 processor_version;
+ __u8 reserved2[0x4];
+} __packed;
+
+struct cv_current_processor_frequency {
+ __be32 phys_processor_idx;
+ __be32 hw_processor_id;
+ __u8 reserved1[0x8];
+ __be32 nominal_freq_mhz;
+ __be32 current_freq_mhz;
+ __u8 reserved2[0x8];
+} __packed;
+
+struct cv_processor_core_utilization {
+ __be32 phys_processor_idx;
+ __be32 hw_processor_id;
+ __be64 cycles;
+ __be64 timebase_at_collection;
+ __be64 purr_cycles;
+ __be64 sum_of_cycles_across_threads;
+ __be64 instructions_completed;
+} __packed;
+
+struct cv_processor_core_power_mode {
+ __be16 partition_id;
+ __u8 reserved1[0x6];
+
+#define CV_PM_NONE 0x0
+#define CV_PM_NOMINAL 0x1
+#define CV_PM_DYNAMIC_MAX_PERF 0x2
+#define CV_PM_DYNAMIC_POWER_SAVE 0x3
+#define CV_PM_UNKNOWN 0xF
+ __be16 power_mode;
+
+ __u8 reserved2[0x6];
+} __packed;
+
+struct cv_affinity_domain_information_by_virutal_processor {
+ __be16 partition_id;
+ __be16 virtual_processor_idx;
+ __u8 reserved1[0xC];
+ __be16 phys_processor_idx;
+ __be16 primary_affinity_domain_idx;
+ __be16 secondary_affinity_domain_idx;
+ __u8 reserved2[0x2];
+ __u8 reserved3[0x8];
+} __packed;
+
+struct cv_affinity_domain_info_by_domain {
+ __be16 primary_affinity_domain_idx;
+ __be16 secondary_affinity_domain_idx;
+ __be32 total_processor_units;
+ __be32 free_dedicated_processor_units;
+ __be32 free_shared_processor_units;
+ __be32 total_memory_lmbs;
+ __be32 free_memory_lmbs;
+ __be32 num_partitions_in_domain;
+ __u8 reserved1[0x14];
+} __packed;
+
+struct cv_affinity_domain_info_by_partition {
+ __be16 partition_id;
+ __u8 reserved1[0x6];
+ __be16 assignment_order;
+
+#define CV_PPS_UNKNOWN 0x00
+#define CV_PPS_CONTAIN_IN_PRIMARY_DOMAIN 0x01
+#define CV_PPS_CONTAIN_IN_SECONDARY_DOMAIN 0x02
+#define CV_PPS_SPREAD_ACROSS_SECONDAY_DOMAINS 0x03
+#define CV_PPS_WHEREEVER 0x04
+#define CV_PPS_SCRAMBLE 0x05
+ __u8 partition_placement_spread;
+
+ __u8 parition_affinity_score;
+ __be16 num_affinity_domain_elements;
+ __be16 affinity_domain_element_size;
+ __u8 domain_elements[];
+} __packed;
+
+struct cv_affinity_domain_elem {
+ __be16 primary_affinity_domain_idx;
+ __be16 secondary_affinity_domain_idx;
+ __be32 dedicated_processor_units_allocated;
+ __be32 dedicated_memory_allocated_reserved_1;
+ __be32 dedicated_memory_allocated_reserved_2;
+ __be32 dedicated_memory_allocated_16Gb_pages;
+ __u8 reserved[0x8];
+} __packed;
+
+/* Also available via `of_get_flat_dt_prop(node, "ibm,lmb-size", &l)` */
+struct cv_physical_memory_info {
+ __be64 lmb_size_in_bytes;
+ __u8 reserved1[0x18];
+} __packed;
+
+struct cv_processor_bus_topology {
+ __be32 hw_chip_id;
+ __be32 hw_node_id;
+ __be32 fabric_chip_id;
+ __u8 reserved1[0x4];
+
+#define CV_IM_A_LINK_ACTIVE (1 << 31)
+#define CV_IM_B_LINK_ACTIVE (1 << 30)
+#define CV_IM_C_LINK_ACTIVE (1 << 29)
+/* Bits 3-5 (28,27,26) are reserved */
+#define CV_IM_ABC_LINK_WIDTH_MASK ((1 << 25) | (1 << 24))
+#define CV_IM_ABC_LINK_WIDTH_SHIFT 24
+#define CV_IM_ABC_LINK_WIDTH_8B 0x0
+#define CV_IM_ABC_LINK_WIDTH_4B 0x1
+
+#define CV_IM_W_LINK_ACTIVE (1 << 23)
+#define CV_IM_X_LINK_ACTIVE (1 << 22)
+#define CV_IM_Y_LINK_ACTIVE (1 << 21)
+#define CV_IM_Z_LINK_ACTIVE (1 << 20)
+/* Bits 12-13 (19,18) are reserved */
+
+#define CV_IM_WXYZ_LINK_WIDTH_MASK ((1 << 17) | (1 << 16))
+#define CV_IM_WXYZ_LINK_WIDTH_SHIFT 16
+#define CV_IM_WXYZ_LINK_WIDTH_8B 0x0
+#define CV_IM_WXYZ_LINK_WIDTH_4B 0x1
+
+#define CV_IM_GX0_CONFIGURED (1 << 15)
+#define CV_IM_GX1_CONFIGURED (1 << 14)
+/* Bits 18-23 (13,12,11,10,9,8) are reserved */
+#define CV_IM_MC0_CONFIGURED (1 << 7)
+#define CV_IM_MC1_CONFIGURED (1 << 6)
+/* Bits 26-31 (5,4,3,2,1,0) are reserved */
+
+ __be32 info_mask;
+
+ __u8 hw_node_id_connected_to_a_link;
+ __u8 hw_node_id_connected_to_b_link;
+
+ __u8 reserved2[0x2];
+
+ __u8 fabric_chip_id_connected_to_w_link;
+ __u8 fabric_chip_id_connected_to_x_link;
+ __u8 fabric_chip_id_connected_to_y_link;
+ __u8 fabric_chip_id_connected_to_z_link;
+
+ __u8 reserved3[0x4];
+} __packed;
+
+struct cv_partition_hypervisor_queuing_times {
+ __be16 partition_id;
+ __u8 reserved1[0x6];
+ __be64 time_waiting_for_entitlement; /* in timebase cycles */
+ __be64 times_waited_for_entitlement;
+ __be64 time_waiting_for_phys_processor; /* in timebase cycles */
+ __be64 times_waited_for_phys_processor;
+ __be64 dispatches_on_home_processor_core;
+ __be64 dispatches_on_home_primary_affinity_domain;
+ __be64 dispatches_on_home_secondary_affinity_domain;
+ __be64 dispatches_off_home_secondary_affinity_domain;
+ __be64 dispatches_on_dedicated_processor_donating_cycles;
+} __packed;
+
+struct cv_system_hypervisor_times {
+ __be64 phyp_time_spent_to_dispatch_virtual_processors;
+ __be64 phyp_time_spent_processing_virtual_processor_timers;
+ __be64 phyp_time_spent_managing_partitions_over_entitlement;
+ __be64 time_spent_on_system_managment;
+} __packed;
+
+struct cv_system_tlbie_count_and_time {
+ __be64 tlbie_instructions_issued;
+ __be64 time_spent_issuing;
+} __packed;
+
+struct cv_partition_instruction_count_and_time {
+ __be16 partition_id;
+ __u8 reserved1[0x6];
+ __be64 instructions_performed;
+ __be64 time_collected;
+} __packed;
+
+/* LAB */
+
+struct cv_set_mmcrh {
+ /*
+ * Only HPMC bits (40:46, 48:54) used, all others ignored
+ * -1 = default (0x00000000_003C1200)
+ */
+ __be64 mmcrh_value_to_set;
+} __packed;
+
+struct cv_retrieve_hpmcx {
+ __be32 hw_processor_id;
+ __u8 reserved1[0x4];
+ __be64 mmcrh_current;
+ __be64 time_since_mmcrh_was_set;
+ __be64 hpmc1_since_current_mmcrh;
+ __be64 hpmc2_since_current_mmcrh;
+ __be64 hpmc3_since_current_mmcrh;
+ __be64 hpmc3_current;
+ __be64 hpmc4_since_current_mmcrh;
+ __be64 hpmc4_current;
+} __packed;
+
+#endif
--
1.8.5.4
24x7 (also called hv_24x7 or H_24X7) is an interface to obtain
performance counters from the hypervisor. These counters do not have a
fixed format/possition and are instead documented in a "24x7 Catalog",
which is provided by the hypervisor (that interface is also documented
in this header).
This method of obtaining performance counters from the hypervisor is
intended to paritialy replace the gpci interface.
Signed-off-by: Cody P Schafer <[email protected]>
---
arch/powerpc/perf/hv-24x7.h | 239 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 239 insertions(+)
create mode 100644 arch/powerpc/perf/hv-24x7.h
diff --git a/arch/powerpc/perf/hv-24x7.h b/arch/powerpc/perf/hv-24x7.h
new file mode 100644
index 0000000..bf079da
--- /dev/null
+++ b/arch/powerpc/perf/hv-24x7.h
@@ -0,0 +1,239 @@
+#ifndef LINUX_POWERPC_PERF_HV_24X7_H_
+#define LINUX_POWERPC_PERF_HV_24X7_H_
+
+#include <linux/types.h>
+
+struct hv_24x7_request {
+ /* PHYSICAL domains require enabling via phyp/hmc. */
+#define HV_24X7_PERF_DOMAIN_PHYSICAL_CHIP 0x01
+#define HV_24X7_PERF_DOMAIN_PHYSICAL_CORE 0x02
+#define HV_24X7_PERF_DOMAIN_VIRTUAL_PROCESSOR_HOME_CORE 0x03
+#define HV_24X7_PERF_DOMAIN_VIRTUAL_PROCESSOR_HOME_CHIP 0x04
+#define HV_24X7_PERF_DOMAIN_VIRTUAL_PROCESSOR_HOME_NODE 0x05
+#define HV_24X7_PERF_DOMAIN_VIRTUAL_PROCESSOR_REMOTE_NODE 0x06
+ __u8 performance_domain;
+ __u8 reserved[0x1];
+
+ /* bytes to read starting at @data_offset. must be a multiple of 8 */
+ __be16 data_size;
+
+ /*
+ * byte offset within the perf domain to read from. must be 8 byte
+ * aligned
+ */
+ __be32 data_offset;
+
+ /*
+ * only valid for VIRTUAL_PROCESSOR domains, ignored for others.
+ * -1 means "current partition only"
+ * Enabling via phyp/hmc required for non-"-1" values. 0 forbidden
+ * unless requestor is 0.
+ */
+ __be16 starting_lpar_ix;
+
+ /*
+ * Ignored when @starting_lpar_ix == -1
+ * Ignored when @performance_domain is not VIRTUAL_PROCESSOR_*
+ * -1 means "infinite" or all
+ */
+ __be16 max_num_lpars;
+
+ /* chip, core, or virtual processor based on @performance_domain */
+ __be16 starting_ix;
+ __be16 max_ix;
+} __packed;
+
+struct hv_24x7_request_buffer {
+ /* 0 - ? */
+ /* 1 - ? */
+#define HV_24X7_IF_VERSION_CURRENT 0x01
+ __u8 interface_version;
+ __u8 num_requests;
+ __u8 reserved[0xE];
+ struct hv_24x7_request requests[];
+} __packed;
+
+struct hv_24x7_result_element {
+ __be16 lpar_ix;
+
+ /*
+ * represents the core, chip, or virtual processor based on the
+ * request's @performance_domain
+ */
+ __be16 domain_ix;
+
+ /* -1 if @performance_domain does not refer to a virtual processor */
+ __be32 lpar_cfg_instance_id;
+
+ /* size = @result_element_data_size of cointaining result. */
+ __u8 element_data[];
+} __packed;
+
+struct hv_24x7_result {
+ __u8 result_ix;
+
+ /*
+ * 0 = not all result elements fit into the buffer, additional requests
+ * required
+ * 1 = all result elements were returned
+ */
+ __u8 results_complete;
+ __be16 num_elements_returned;
+
+ /* This is a copy of @data_size from the coresponding hv_24x7_request */
+ __be16 result_element_data_size;
+ __u8 reserved[0x2];
+
+ /* WARNING: only valid for first result element due to variable sizes
+ * of result elements */
+ /* struct hv_24x7_result_element[@num_elements_returned] */
+ struct hv_24x7_result_element elements[];
+} __packed;
+
+struct hv_24x7_data_result_buffer {
+ /* See versioning for request buffer */
+ __u8 interface_version;
+
+ __u8 num_results;
+ __u8 reserved[0x1];
+ __u8 failing_request_ix;
+ __be32 detailed_rc;
+ __be64 cec_cfg_instance_id;
+ __be64 catalog_version_num;
+ __u8 reserved2[0x8];
+ /* WARNING: only valid for the first result due to variable sizes of
+ * results */
+ struct hv_24x7_result results[]; /* [@num_results] */
+} __packed;
+
+/* From document "24x7 Event and Group Catalog Formats Proposal" v0.14 */
+struct hv_24x7_catalog_page_0 {
+#define HV_24X7_CATALOG_MAGIC 0x32347837 /* "24x7" in ASCII */
+ __be32 magic;
+ __be32 length; /* In 4096 byte pages */
+ __u8 reserved1[4];
+ __be32 version;
+ __u8 build_time_stamp[16]; /* "YYYYMMDDHHMMSS\0\0" */
+ __u8 reserved2[32];
+ __be16 schema_data_offs; /* in 4096 byte pages */
+ __be16 schema_data_len; /* in 4096 byte pages */
+ __be16 schema_entry_count;
+ __u8 reserved3[2];
+ __be16 group_data_offs; /* in 4096 byte pages */
+ __be16 group_data_len; /* in 4096 byte pages */
+ __be16 group_entry_count;
+ __u8 reserved4[2];
+ __be16 formula_data_offs; /* in 4096 byte pages */
+ __be16 formula_data_len; /* in 4096 byte pages */
+ __be16 formula_entry_count;
+ __u8 reserved5[2];
+} __packed;
+
+struct hv_24x7_event_data {
+ __be16 length; /* in bytes, must be a multiple of 16 */
+ __u8 reserved1[2];
+ __u8 domain; /* Chip = 1, Core = 2 */
+ __u8 reserved2[1];
+ __be16 event_group_record_offs; /* in bytes, must be 8 byte aligned */
+ __be16 event_group_record_len; /* in bytes */
+
+ /* in bytes, offset from event_group_record */
+ __be16 event_counter_offs;
+
+ /* verified_state, unverified_state, caveat_state, broken_state, ... */
+ __be32 flags;
+
+ __be16 primary_group_ix;
+ __be16 group_count;
+ __be16 event_name_len;
+ __u8 remainder[];
+ /* __u8 event_name[event_name_len - 2]; */
+ /* __be16 event_description_len; */
+ /* __u8 event_desc[event_description_len - 2]; */
+ /* __be16 detailed_desc_len; */
+ /* __u8 detailed_desc[detailed_desc_len - 2]; */
+} __packed;
+
+struct hv_24x7_group_data {
+ __be16 length; /* in bytes, must be multiple of 16 */
+ __u8 reserved1[2];
+ __be32 flags; /* undefined contents */
+ __u8 domain; /* Chip = 1, Core = 2 */
+ __u8 reserved2[1];
+ __be16 event_group_record_offs;
+ __be16 event_group_record_len;
+ __u8 group_schema_ix;
+ __u8 event_count; /* 1 to 16 */
+ __be16 event_ixs;
+ __be16 group_name_len;
+ __u8 remainder[];
+ /* __u8 group_name[group_name_len]; */
+ /* __be16 group_desc_len; */
+ /* __u8 group_desc[group_desc_len]; */
+} __packed;
+
+/* TODO: Schema Data */
+/* TODO: Event Counter Group Record (see the PORE/SLW workbook) */
+
+/* "Get Event Counter Group Record Schema hypervisor interface" */
+
+enum hv_24x7_grs_field_enums {
+ /* GRS_COUNTER_1 = 1
+ * GRS_COUNTER_2 = 2
+ * ...
+ * GRS_COUNTER_31 = 32 // FIXME: Doc issue.
+ */
+ GRS_COUNTER_BASE = 1,
+ GRS_COUNTER_LAST = 32,
+ GRS_TIMEBASE_UPDATE = 48,
+ GRS_TIMEBASE_FENCE = 49,
+ GRS_UPDATE_COUNT = 50,
+ GRS_MEASUREMENT_PERIOD = 51,
+ GRS_ACCUMULATED_MEASUREMENT_PERIOD = 52,
+ GRS_LAST_UPDATE_PERIOD = 53,
+ GRS_STATUS_FLAGS = 54,
+};
+
+enum hv_24x7_grs_enums {
+ GRS_CORE_SCHEMA_INDEX = 0,
+};
+
+struct hv_24x7_grs_field {
+ __be16 field_enum;
+ __be16 offs; /* in bytes, within Event Counter group record */
+ __be16 length; /* in bytes */
+ __be16 flags; /* presently unused */
+} __packed;
+
+struct hv_24x7_grs {
+ __be16 length;
+ __u8 reserved1[2];
+ __be16 descriptor;
+ __be16 version_id;
+ __u8 reserved2[6];
+ __be16 field_entry_count;
+ __u8 field_entrys[];
+} __packed;
+
+struct hv_24x7_formula_data {
+ __be32 length; /* in bytes, must be multiple of 16 */
+ __u8 reserved1[2];
+ __be32 flags; /* not yet defined */
+ __be16 group;
+ __u8 reserved2[6];
+ __be16 name_len;
+ __u8 remainder[];
+ /* __u8 name[name_len]; */
+ /* __be16 desc_len; */
+ /* __u8 desc[name_len]; */
+ /* __be16 formula_len */
+ /* __u8 formula[formula_len]; */
+} __packed;
+
+/* Formula Syntax: ie, impliment a forth interpereter. */
+/* need fast lookup of the formula names, event names, "delta-timebase",
+ * "delta-cycles", "delta-instructions", "delta-seconds" */
+/* operators: '+', '-', '*', '/', 'mod', 'rem', 'sqr', 'x^y' (XXX: pow? xor?),
+ * 'rot', 'dup' */
+
+#endif
--
1.8.5.4
Signed-off-by: Cody P Schafer <[email protected]>
---
arch/powerpc/perf/hv-common.c | 39 +++++++++++++++++++++++++++++++++++++++
arch/powerpc/perf/hv-common.h | 17 +++++++++++++++++
2 files changed, 56 insertions(+)
create mode 100644 arch/powerpc/perf/hv-common.c
create mode 100644 arch/powerpc/perf/hv-common.h
diff --git a/arch/powerpc/perf/hv-common.c b/arch/powerpc/perf/hv-common.c
new file mode 100644
index 0000000..47e02b3
--- /dev/null
+++ b/arch/powerpc/perf/hv-common.c
@@ -0,0 +1,39 @@
+#include <asm/io.h>
+#include <asm/hvcall.h>
+
+#include "hv-gpci.h"
+#include "hv-common.h"
+
+unsigned long hv_perf_caps_get(struct hv_perf_caps *caps)
+{
+ unsigned long r;
+ struct p {
+ struct hv_get_perf_counter_info_params params;
+ struct cv_system_performance_capabilities caps;
+ } __packed __aligned(sizeof(uint64_t));
+
+ struct p arg = {
+ .params = {
+ .counter_request = cpu_to_be32(
+ CIR_SYSTEM_PERFORMANCE_CAPABILITIES),
+ .starting_index = cpu_to_be32(-1),
+ .counter_info_version_in = 0,
+ }
+ };
+
+ r = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
+ virt_to_phys(&arg), sizeof(arg));
+
+ if (r)
+ return r;
+
+ pr_devel("capability_mask: 0x%x\n", arg.caps.capability_mask);
+
+ caps->version = arg.params.counter_info_version_out;
+ caps->collect_privileged = !!arg.caps.perf_collect_privileged;
+ caps->ga = !!(arg.caps.capability_mask & CV_CM_GA);
+ caps->expanded = !!(arg.caps.capability_mask & CV_CM_EXPANDED);
+ caps->lab = !!(arg.caps.capability_mask & CV_CM_LAB);
+
+ return r;
+}
diff --git a/arch/powerpc/perf/hv-common.h b/arch/powerpc/perf/hv-common.h
new file mode 100644
index 0000000..7e615bd
--- /dev/null
+++ b/arch/powerpc/perf/hv-common.h
@@ -0,0 +1,17 @@
+#ifndef LINUX_POWERPC_PERF_HV_COMMON_H_
+#define LINUX_POWERPC_PERF_HV_COMMON_H_
+
+#include <linux/types.h>
+
+struct hv_perf_caps {
+ u16 version;
+ u16 collect_privileged:1,
+ ga:1,
+ expanded:1,
+ lab:1,
+ unused:12;
+};
+
+unsigned long hv_perf_caps_get(struct hv_perf_caps *caps);
+
+#endif
--
1.8.5.4
This provides a basic link between perf and hv_gpci. Notably, it does
not yet support transactions and does not list any events (they can
still be manually composed).
Signed-off-by: Cody P Schafer <[email protected]>
---
arch/powerpc/perf/hv-gpci.c | 290 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 290 insertions(+)
create mode 100644 arch/powerpc/perf/hv-gpci.c
diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
new file mode 100644
index 0000000..1f5d96d
--- /dev/null
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -0,0 +1,290 @@
+/*
+ * Hypervisor supplied "gpci" ("get performance counter info") performance
+ * counter support
+ *
+ * Author: Cody P Schafer <[email protected]>
+ * Copyright 2014 IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#define pr_fmt(fmt) "hv-gpci: " fmt
+
+#include <linux/init.h>
+#include <linux/perf_event.h>
+#include <asm/firmware.h>
+#include <asm/hvcall.h>
+#include <asm/io.h>
+
+#include "hv-gpci.h"
+#include "hv-common.h"
+
+PMU_RANGE_ATTR(request, config, 0, 31); /* u32 */
+PMU_RANGE_ATTR(starting_index, config, 32, 63); /* u32 */
+PMU_RANGE_ATTR(secondary_index, config1, 0, 15); /* u16 */
+PMU_RANGE_ATTR(counter_info_version, config1, 16, 23); /* u8 */
+PMU_RANGE_ATTR(length, config1, 24, 31); /* u8, bytes of data (1-8) */
+PMU_RANGE_ATTR(offset, config1, 32, 63); /* u32, byte offset */
+
+static struct attribute *format_attrs[] = {
+ &format_attr_request.attr,
+ &format_attr_starting_index.attr,
+ &format_attr_secondary_index.attr,
+ &format_attr_counter_info_version.attr,
+
+ &format_attr_offset.attr,
+ &format_attr_length.attr,
+ NULL,
+};
+
+static struct attribute_group format_group = {
+ .name = "format",
+ .attrs = format_attrs,
+};
+
+#define HV_CAPS_ATTR(_name, _format) \
+static ssize_t _name##_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *page) \
+{ \
+ struct hv_perf_caps caps; \
+ unsigned long hret = hv_perf_caps_get(&caps); \
+ if (hret) \
+ return -EIO; \
+ \
+ return sprintf(page, _format, caps._name); \
+} \
+static struct device_attribute hv_caps_attr_##_name = __ATTR_RO(_name)
+
+static ssize_t kernel_version_show(struct device *dev,
+ struct device_attribute *attr,
+ char *page)
+{
+ return sprintf(page, "0x%x\n", COUNTER_INFO_VERSION_CURRENT);
+}
+
+DEVICE_ATTR_RO(kernel_version);
+HV_CAPS_ATTR(version, "0x%x\n");
+HV_CAPS_ATTR(ga, "%d\n");
+HV_CAPS_ATTR(expanded, "%d\n");
+HV_CAPS_ATTR(lab, "%d\n");
+HV_CAPS_ATTR(collect_privileged, "%d\n");
+
+static struct attribute *interface_attrs[] = {
+ &dev_attr_kernel_version.attr,
+ &hv_caps_attr_version.attr,
+ &hv_caps_attr_ga.attr,
+ &hv_caps_attr_expanded.attr,
+ &hv_caps_attr_lab.attr,
+ &hv_caps_attr_collect_privileged.attr,
+ NULL,
+};
+
+static struct attribute_group interface_group = {
+ .name = "interface",
+ .attrs = interface_attrs,
+};
+
+static const struct attribute_group *attr_groups[] = {
+ &format_group,
+ &interface_group,
+ NULL,
+};
+
+#define GPCI_MAX_DATA_BYTES \
+ (1024 - sizeof(struct hv_get_perf_counter_info_params))
+
+static unsigned long single_gpci_request(u32 req, u32 starting_index,
+ u16 secondary_index, u8 version_in, u32 offset, u8 length,
+ u64 *value)
+{
+ unsigned long ret;
+ size_t i;
+ u64 count;
+
+ struct {
+ struct hv_get_perf_counter_info_params params;
+ uint8_t bytes[GPCI_MAX_DATA_BYTES];
+ } __packed __aligned(sizeof(uint64_t)) arg = {
+ .params = {
+ .counter_request = cpu_to_be32(req),
+ .starting_index = cpu_to_be32(starting_index),
+ .secondary_index = cpu_to_be16(secondary_index),
+ .counter_info_version_in = version_in,
+ }
+ };
+
+ ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
+ virt_to_phys(&arg), sizeof(arg));
+ if (ret) {
+ pr_devel("hcall failed: 0x%lx\n", ret);
+ return ret;
+ }
+
+ /*
+ * we verify offset and length are within the zeroed buffer at event
+ * init.
+ */
+ count = 0;
+ for (i = offset; i < offset + length; i++)
+ count |= arg.bytes[i] << (i - offset);
+
+ *value = count;
+ return ret;
+}
+
+static u64 h_gpci_get_value(struct perf_event *event)
+{
+ u64 count;
+ unsigned long ret = single_gpci_request(event_get_request(event),
+ event_get_starting_index(event),
+ event_get_secondary_index(event),
+ event_get_counter_info_version(event),
+ event_get_offset(event),
+ event_get_length(event),
+ &count);
+ if (ret)
+ return 0;
+ return count;
+}
+
+static void h_gpci_event_update(struct perf_event *event)
+{
+ s64 prev;
+ u64 now = h_gpci_get_value(event);
+ prev = local64_xchg(&event->hw.prev_count, now);
+ local64_add(now - prev, &event->count);
+}
+
+static void h_gpci_event_start(struct perf_event *event, int flags)
+{
+ local64_set(&event->hw.prev_count, h_gpci_get_value(event));
+ perf_swevent_start_hrtimer(event);
+}
+
+static void h_gpci_event_stop(struct perf_event *event, int flags)
+{
+ perf_swevent_cancel_hrtimer(event);
+ h_gpci_event_update(event);
+}
+
+static int h_gpci_event_add(struct perf_event *event, int flags)
+{
+ if (flags & PERF_EF_START)
+ h_gpci_event_start(event, flags);
+
+ return 0;
+}
+
+static void h_gpci_event_del(struct perf_event *event, int flags)
+{
+ h_gpci_event_stop(event, flags);
+}
+
+static void h_gpci_event_read(struct perf_event *event)
+{
+ h_gpci_event_update(event);
+}
+
+static int h_gpci_event_init(struct perf_event *event)
+{
+ u64 count;
+ u8 length;
+
+ /* Not our event */
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /* config2 is unused */
+ if (event->attr.config2) {
+ pr_devel("config2 set when reserved\n");
+ return -EINVAL;
+ }
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest ||
+ is_sampling_event(event)) /* no sampling */
+ return -EINVAL;
+
+ /* no branch sampling */
+ if (has_branch_stack(event))
+ return -EOPNOTSUPP;
+
+ length = event_get_length(event);
+ if (length < 1 || length > 8) {
+ pr_devel("length invalid\n");
+ return -EINVAL;
+ }
+
+ /* last byte within the buffer? */
+ if ((event_get_offset(event) + length) > GPCI_MAX_DATA_BYTES) {
+ pr_devel("request outside of buffer: %zu > %zu\n",
+ (size_t)event_get_offset(event) + length,
+ GPCI_MAX_DATA_BYTES);
+ return -EINVAL;
+ }
+
+ /* check if the request works... */
+ if (single_gpci_request(event_get_request(event),
+ event_get_starting_index(event),
+ event_get_secondary_index(event),
+ event_get_counter_info_version(event),
+ event_get_offset(event),
+ length,
+ &count)) {
+ pr_devel("gpci hcall failed\n");
+ return -EINVAL;
+ }
+
+ perf_swevent_init_hrtimer(event);
+ return 0;
+}
+
+static struct pmu h_gpci_pmu = {
+ .task_ctx_nr = perf_invalid_context,
+
+ .name = "hv_gpci",
+ .attr_groups = attr_groups,
+ .event_init = h_gpci_event_init,
+ .add = h_gpci_event_add,
+ .del = h_gpci_event_del,
+ .start = h_gpci_event_start,
+ .stop = h_gpci_event_stop,
+ .read = h_gpci_event_read,
+
+ .event_idx = perf_swevent_event_idx,
+};
+
+static int hv_gpci_init(void)
+{
+ int r;
+ unsigned long hret;
+ struct hv_perf_caps caps;
+
+ if (!firmware_has_feature(FW_FEATURE_LPAR)) {
+ pr_info("not a virtualized system, not enabling\n");
+ return -ENODEV;
+ }
+
+ hret = hv_perf_caps_get(&caps);
+ if (hret) {
+ pr_info("could not obtain capabilities, error 0x%80lx, not enabling\n",
+ hret);
+ return -ENODEV;
+ }
+
+ r = perf_pmu_register(&h_gpci_pmu, h_gpci_pmu.name, -1);
+ if (r)
+ return r;
+
+ return 0;
+}
+
+device_initcall(hv_gpci_init);
--
1.8.5.4
This provides a basic interface between hv_24x7 and perf. Similar to
the one provided for gpci, it lacks transaction support and does not
list any events.
Signed-off-by: Cody P Schafer <[email protected]>
---
arch/powerpc/perf/hv-24x7.c | 491 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 491 insertions(+)
create mode 100644 arch/powerpc/perf/hv-24x7.c
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
new file mode 100644
index 0000000..13de140
--- /dev/null
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -0,0 +1,491 @@
+/*
+ * Hypervisor supplied "24x7" performance counter support
+ *
+ * Author: Cody P Schafer <[email protected]>
+ * Copyright 2014 IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define pr_fmt(fmt) "hv-24x7: " fmt
+
+#include <linux/perf_event.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <asm/firmware.h>
+#include <asm/hvcall.h>
+#include <asm/io.h>
+
+#include "hv-24x7.h"
+#include "hv-common.h"
+
+/*
+ * TODO: Merging events:
+ * - Think of the hcall as an interface to a 4d array of counters:
+ * - x = domains
+ * - y = indexes in the domain (core, chip, vcpu, node, etc)
+ * - z = offset into the counter space
+ * - w = lpars (guest vms, "logical partitions")
+ * - A single request is: x,y,y_last,z,z_last,w,w_last
+ * - this means we can retrieve a rectangle of counters in y,z for a single x.
+ *
+ * - Things to consider (ignoring w):
+ * - input cost_per_request = 16
+ * - output cost_per_result(ys,zs) = 8 + 8 * ys + ys * zs
+ * - limited number of requests per hcall (must fit into 4K bytes)
+ * - 4k = 16 [buffer header] - 16 [request size] * request_count
+ * - 255 requests per hcall
+ * - sometimes it will be more efficient to read extra data and discard
+ */
+
+PMU_RANGE_ATTR(domain, config, 0, 3); /* u3 0-6, one of HV_24X7_PERF_DOMAIN */
+PMU_RANGE_ATTR(starting_index, config, 16, 31); /* u16 */
+PMU_RANGE_ATTR(offset, config, 32, 63); /* u32, see "data_offset" */
+PMU_RANGE_ATTR(lpar, config1, 0, 15); /* u16 */
+
+PMU_RANGE_RESV(reserved1, config, 4, 15);
+PMU_RANGE_RESV(reserved2, config1, 16, 63);
+PMU_RANGE_RESV(reserved3, config2, 0, 63);
+
+static struct attribute *format_attrs[] = {
+ &format_attr_domain.attr,
+ &format_attr_offset.attr,
+ &format_attr_starting_index.attr,
+ &format_attr_lpar.attr,
+ NULL,
+};
+
+static struct attribute_group format_group = {
+ .name = "format",
+ .attrs = format_attrs,
+};
+
+/*
+ * read_offset_data - copy data from one buffer to another while treating the
+ * source buffer as a small view on the total avaliable
+ * source data.
+ *
+ * @dest: buffer to copy into
+ * @dest_len: length of @dest in bytes
+ * @requested_offset: the offset within the source data we want. Must be > 0
+ * @src: buffer to copy data from
+ * @src_len: length of @src in bytes
+ * @source_offset: the offset in the sorce data that (src,src_len) refers to.
+ * Must be > 0
+ *
+ * returns the number of bytes copied.
+ *
+ * '.' areas in d are written to.
+ *
+ * u
+ * x w v z
+ * d |.........|
+ * s |----------------------|
+ *
+ * u
+ * x w z v
+ * d |........------|
+ * s |------------------|
+ *
+ * x w u,z,v
+ * d |........|
+ * s |------------------|
+ *
+ * x,w u,v,z
+ * d |------------------|
+ * s |------------------|
+ *
+ * x u
+ * w v z
+ * d |........|
+ * s |------------------|
+ *
+ * x z w v
+ * d |------|
+ * s |------|
+ *
+ * x = source_offset
+ * w = requested_offset
+ * z = source_offset + src_len
+ * v = requested_offset + dest_len
+ *
+ * w_offset_in_s = w - x = requested_offset - source_offset
+ * z_offset_in_s = z - x = src_len
+ * v_offset_in_s = v - x = request_offset + dest_len - src_len
+ * u_offset_in_s = min(z_offset_in_s, v_offset_in_s)
+ *
+ * copy_len = u_offset_in_s - w_offset_in_s = min(z_offset_in_s, v_offset_in_s)
+ * - w_offset_in_s
+ */
+static ssize_t read_offset_data(void *dest, size_t dest_len,
+ loff_t requested_offset, void *src,
+ size_t src_len, loff_t source_offset)
+{
+ size_t w_offset_in_s = requested_offset - source_offset;
+ size_t z_offset_in_s = src_len;
+ size_t v_offset_in_s = requested_offset + dest_len - src_len;
+ size_t u_offset_in_s = min(z_offset_in_s, v_offset_in_s);
+ size_t copy_len = u_offset_in_s - w_offset_in_s;
+
+ if (requested_offset < 0 || source_offset < 0)
+ return -EINVAL;
+
+ if (z_offset_in_s <= w_offset_in_s)
+ return 0;
+
+ memcpy(dest, src + w_offset_in_s, copy_len);
+ return copy_len;
+}
+
+static unsigned long h_get_24x7_catalog_page(char page[static 4096],
+ u32 version, u32 index)
+{
+ WARN_ON(!IS_ALIGNED((unsigned long)page, 4096));
+ return plpar_hcall_norets(H_GET_24X7_CATALOG_PAGE,
+ virt_to_phys(page),
+ version,
+ index);
+}
+
+static ssize_t catalog_read(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t offset, size_t count)
+{
+ unsigned long hret;
+ ssize_t ret = 0;
+ size_t catalog_len = 0, catalog_page_len = 0, page_count = 0;
+ loff_t page_offset = 0;
+ uint32_t catalog_version_num = 0;
+ void *page = kmalloc(4096, GFP_USER);
+ struct hv_24x7_catalog_page_0 *page_0 = page;
+ if (!page)
+ return -ENOMEM;
+
+
+ hret = h_get_24x7_catalog_page(page, 0, 0);
+ if (hret) {
+ ret = -EIO;
+ goto e_free;
+ }
+
+ catalog_version_num = be32_to_cpu(page_0->version);
+ catalog_page_len = be32_to_cpu(page_0->length);
+ catalog_len = catalog_page_len * 4096;
+
+ page_offset = offset / 4096;
+ page_count = count / 4096;
+
+ if (page_offset >= catalog_page_len)
+ goto e_free;
+
+ if (page_offset != 0) {
+ hret = h_get_24x7_catalog_page(page, catalog_version_num,
+ page_offset);
+ if (hret) {
+ ret = -EIO;
+ goto e_free;
+ }
+ }
+
+ ret = read_offset_data(buf, count, offset,
+ page, 4096, page_offset * 4096);
+e_free:
+ if (hret)
+ pr_err("h_get_24x7_catalog_page(ver=%d, page=%lld) failed: rc=%ld\n",
+ catalog_version_num, page_offset, hret);
+ kfree(page);
+
+ pr_devel("catalog_read: offset=%lld(%lld) count=%zu(%zu) catalog_len=%zu(%zu) => %zd\n",
+ offset, page_offset, count, page_count, catalog_len,
+ catalog_page_len, ret);
+
+ return ret;
+}
+
+#define PAGE_0_ATTR(_name, _fmt, _expr) \
+static ssize_t _name##_show(struct device *dev, \
+ struct device_attribute *dev_attr, \
+ char *buf) \
+{ \
+ unsigned long hret; \
+ ssize_t ret = 0; \
+ void *page = kmalloc(4096, GFP_USER); \
+ struct hv_24x7_catalog_page_0 *page_0 = page; \
+ if (!page) \
+ return -ENOMEM; \
+ hret = h_get_24x7_catalog_page(page, 0, 0); \
+ if (hret) { \
+ ret = -EIO; \
+ goto e_free; \
+ } \
+ ret = sprintf(buf, _fmt, _expr); \
+e_free: \
+ kfree(page); \
+ return ret; \
+} \
+static DEVICE_ATTR_RO(_name)
+
+PAGE_0_ATTR(catalog_version, "%lld\n",
+ (unsigned long long)be32_to_cpu(page_0->version));
+PAGE_0_ATTR(catalog_len, "%lld\n",
+ (unsigned long long)be32_to_cpu(page_0->length) * 4096);
+static BIN_ATTR_RO(catalog, 0/* real length varies */);
+
+static struct bin_attribute *if_bin_attrs[] = {
+ &bin_attr_catalog,
+ NULL,
+};
+
+static struct attribute *if_attrs[] = {
+ &dev_attr_catalog_len.attr,
+ &dev_attr_catalog_version.attr,
+ NULL,
+};
+
+static struct attribute_group if_group = {
+ .name = "interface",
+ .bin_attrs = if_bin_attrs,
+ .attrs = if_attrs,
+};
+
+static const struct attribute_group *attr_groups[] = {
+ &format_group,
+ &if_group,
+ NULL,
+};
+
+static bool is_physical_domain(int domain)
+{
+ return domain == HV_24X7_PERF_DOMAIN_PHYSICAL_CHIP ||
+ domain == HV_24X7_PERF_DOMAIN_PHYSICAL_CORE;
+}
+
+static unsigned long single_24x7_request(u8 domain, u32 offset, u16 ix,
+ u16 lpar, u64 *res,
+ bool success_expected)
+{
+ unsigned long ret;
+
+ /*
+ * request_buffer and result_buffer are not required to be 4k aligned,
+ * but are not allowed to cross any 4k boundary. Aligning them to 4k is
+ * the simplest way to ensure that.
+ */
+ struct reqb {
+ struct hv_24x7_request_buffer buf;
+ struct hv_24x7_request req;
+ } __packed __aligned(4096) request_buffer = {
+ .buf = {
+ .interface_version = HV_24X7_IF_VERSION_CURRENT,
+ .num_requests = 1,
+ },
+ .req = {
+ .performance_domain = domain,
+ .data_size = cpu_to_be16(8),
+ .data_offset = cpu_to_be32(offset),
+ .starting_lpar_ix = cpu_to_be16(lpar),
+ .max_num_lpars = cpu_to_be16(1),
+ .starting_ix = cpu_to_be16(ix),
+ .max_ix = cpu_to_be16(1),
+ }
+ };
+
+ struct resb {
+ struct hv_24x7_data_result_buffer buf;
+ struct hv_24x7_result res;
+ struct hv_24x7_result_element elem;
+ __be64 result;
+ } __packed __aligned(4096) result_buffer = {};
+
+ ret = plpar_hcall_norets(H_GET_24X7_DATA,
+ virt_to_phys(&request_buffer), sizeof(request_buffer),
+ virt_to_phys(&result_buffer), sizeof(result_buffer));
+
+ if (ret) {
+ if (success_expected)
+ pr_err_ratelimited("hcall failed: %d %#x %#x %d => 0x%lx (%ld) detail=0x%x failing ix=%x\n",
+ domain, offset, ix, lpar,
+ ret, ret,
+ result_buffer.buf.detailed_rc,
+ result_buffer.buf.failing_request_ix);
+ return ret;
+ }
+
+ *res = be64_to_cpu(result_buffer.result);
+ return ret;
+}
+
+static unsigned long event_24x7_request(struct perf_event *event, u64 *res,
+ bool success_expected)
+{
+ return single_24x7_request(event_get_domain(event),
+ event_get_offset(event),
+ event_get_starting_index(event),
+ event_get_lpar(event),
+ res,
+ success_expected);
+}
+
+static int h_24x7_event_init(struct perf_event *event)
+{
+ struct hv_perf_caps caps;
+ unsigned domain;
+ unsigned long hret;
+ u64 ct;
+
+ /* Not our event */
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /* Unused areas must be 0 */
+ if (event_get_reserved1(event) ||
+ event_get_reserved2(event) ||
+ event_get_reserved3(event)) {
+ pr_devel("reserved set when forbidden 0x%llx(0x%llx) 0x%llx(0x%llx) 0x%llx(0x%llx)\n",
+ event->attr.config,
+ event_get_reserved1(event),
+ event->attr.config1,
+ event_get_reserved2(event),
+ event->attr.config2,
+ event_get_reserved3(event));
+ return -EINVAL;
+ }
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest ||
+ is_sampling_event(event)) /* no sampling */
+ return -EINVAL;
+
+ /* no branch sampling */
+ if (has_branch_stack(event))
+ return -EOPNOTSUPP;
+
+ /* offset must be 8 byte aligned */
+ if (event_get_offset(event) % 8) {
+ pr_devel("bad alignment\n");
+ return -EINVAL;
+ }
+
+ /* Domains above 6 are invalid */
+ domain = event_get_domain(event);
+ if (domain > 6) {
+ pr_devel("invalid domain %d\n", domain);
+ return -EINVAL;
+ }
+
+ hret = hv_perf_caps_get(&caps);
+ if (hret) {
+ pr_devel("could not get capabilities: rc=%ld\n", hret);
+ return -EIO;
+ }
+
+ /* PHYSICAL domains & other lpars require extra capabilities */
+ if (!caps.collect_privileged && (is_physical_domain(domain) ||
+ (event_get_lpar(event) != event_get_lpar_max()))) {
+ pr_devel("hv permisions disallow: is_physical_domain:%d, lpar=0x%llx\n",
+ is_physical_domain(domain),
+ event_get_lpar(event));
+ return -EACCES;
+ }
+
+ /* see if the event complains */
+ if (event_24x7_request(event, &ct, false)) {
+ pr_devel("test hcall failed\n");
+ return -EIO;
+ }
+
+ perf_swevent_init_hrtimer(event);
+ return 0;
+}
+
+static u64 h_24x7_get_value(struct perf_event *event)
+{
+ unsigned long ret;
+ u64 ct;
+ ret = event_24x7_request(event, &ct, true);
+ if (ret)
+ /* We checked this in event init, shouldn't fail here... */
+ return 0;
+
+ return ct;
+}
+
+static void h_24x7_event_update(struct perf_event *event)
+{
+ s64 prev;
+ u64 now;
+ now = h_24x7_get_value(event);
+ prev = local64_xchg(&event->hw.prev_count, now);
+ local64_add(now - prev, &event->count);
+}
+
+static void h_24x7_event_start(struct perf_event *event, int flags)
+{
+ if (flags & PERF_EF_RELOAD)
+ local64_set(&event->hw.prev_count, h_24x7_get_value(event));
+ perf_swevent_start_hrtimer(event);
+}
+
+static void h_24x7_event_stop(struct perf_event *event, int flags)
+{
+ perf_swevent_cancel_hrtimer(event);
+ h_24x7_event_update(event);
+}
+
+static int h_24x7_event_add(struct perf_event *event, int flags)
+{
+ if (flags & PERF_EF_START)
+ h_24x7_event_start(event, flags);
+
+ return 0;
+}
+
+static struct pmu h_24x7_pmu = {
+ .task_ctx_nr = perf_invalid_context,
+
+ .name = "hv_24x7",
+ .attr_groups = attr_groups,
+ .event_init = h_24x7_event_init,
+ .add = h_24x7_event_add,
+ .del = h_24x7_event_stop,
+ .start = h_24x7_event_start,
+ .stop = h_24x7_event_stop,
+ .read = h_24x7_event_update,
+
+ .event_idx = perf_swevent_event_idx,
+};
+
+static int hv_24x7_init(void)
+{
+ int r;
+ unsigned long hret;
+ struct hv_perf_caps caps;
+
+ if (!firmware_has_feature(FW_FEATURE_LPAR)) {
+ pr_info("not a virtualized system, not enabling\n");
+ return -ENODEV;
+ }
+
+ hret = hv_perf_caps_get(&caps);
+ if (hret) {
+ pr_info("could not obtain capabilities, error 0x%80lx, not enabling\n",
+ hret);
+ return -ENODEV;
+ }
+
+ r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
+ if (r)
+ return r;
+
+ return 0;
+}
+
+device_initcall(hv_24x7_init);
--
1.8.5.4
Signed-off-by: Cody P Schafer <[email protected]>
---
arch/powerpc/perf/Makefile | 2 ++
arch/powerpc/platforms/Kconfig.cputype | 6 ++++++
2 files changed, 8 insertions(+)
diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 60d71ee..f9c083a 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -11,5 +11,7 @@ obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o
obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
obj-$(CONFIG_FSL_EMB_PERF_EVENT_E500) += e500-pmu.o e6500-pmu.o
+obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o
+
obj-$(CONFIG_PPC64) += $(obj64-y)
obj-$(CONFIG_PPC32) += $(obj32-y)
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 434fda3..dcc67cd 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -364,6 +364,12 @@ config PPC_PERF_CTRS
help
This enables the powerpc-specific perf_event back-end.
+config HV_PERF_CTRS
+ def_bool y
+ depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT
+ help
+ Enable access to perf counters provided by the hypervisor
+
config SMP
depends on PPC_BOOK3S || PPC_BOOK3E || FSL_BOOKE || PPC_47x
bool "Symmetric multi-processing support"
--
1.8.5.4
gpci and 24x7 expose some device specific attributes. Add some
documentation for them.
Signed-off-by: Cody P Schafer <[email protected]>
---
.../testing/sysfs-bus-event_source-devices-hv_24x7 | 22 +++++++++++
.../testing/sysfs-bus-event_source-devices-hv_gpci | 43 ++++++++++++++++++++++
2 files changed, 65 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
new file mode 100644
index 0000000..13474d3
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
@@ -0,0 +1,22 @@
+What: /sys/bus/event_source/devices/hv_24x7/interface/catalog
+Date: February 2014
+Contact: Cody P Schafer <[email protected]>
+Description:
+ Provides access to the binary "24x7 catalog" provided by the
+ hypervisor on POWER7 and 8 systems. This catalog lists events
+ avaliable from the powerpc "hv_24x7" pmu. Its format is
+ documented in arch/powerpc/perf/hv_24x7.h.
+
+What: /sys/bus/event_source/devices/hv_24x7/interface/catalog_length
+Date: February 2014
+Contact: Cody P Schafer <[email protected]>
+Description:
+ A number equal to the length in bytes of the catalog. This is
+ also extractable from the provided binary "catalog" sysfs entry.
+
+What: /sys/bus/event_source/devices/hv_24x7/interface/catalog_version
+Date: February 2014
+Contact: Cody P Schafer <[email protected]>
+Description:
+ Exposes the "version" field of the 24x7 catalog. This is also
+ extractable from the provided binary "catalog" sysfs entry.
diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
new file mode 100644
index 0000000..3fa58c2
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_gpci
@@ -0,0 +1,43 @@
+What: /sys/bus/event_source/devices/hv_gpci/interface/collect_privileged
+Date: February 2014
+Contact: Cody P Schafer <[email protected]>
+Description:
+ '0' if the hypervisor is configured to forbid access to event
+ counters being accumulated by other guests and to physical
+ domain event counters.
+ '1' if that access is allowed.
+
+What: /sys/bus/event_source/devices/hv_gpci/interface/ga
+Date: February 2014
+Contact: Cody P Schafer <[email protected]>
+Description:
+ 0 or 1. Indicates whether we have access to "GA" events (listed
+ in arch/powerpc/perf/hv-gpci.h).
+
+What: /sys/bus/event_source/devices/hv_gpci/interface/expanded
+Date: February 2014
+Contact: Cody P Schafer <[email protected]>
+Description:
+ 0 or 1. Indicates whether we have access to "EXPANDED" events (listed
+ in arch/powerpc/perf/hv-gpci.h).
+
+What: /sys/bus/event_source/devices/hv_gpci/interface/lab
+Date: February 2014
+Contact: Cody P Schafer <[email protected]>
+Description:
+ 0 or 1. Indicates whether we have access to "LAB" events (listed
+ in arch/powerpc/perf/hv-gpci.h).
+
+What: /sys/bus/event_source/devices/hv_gpci/interface/version
+Date: February 2014
+Contact: Cody P Schafer <[email protected]>
+Description:
+ A number indicating the version of the gpci interface that the
+ hypervisor reports supporting.
+
+What: /sys/bus/event_source/devices/hv_gpci/interface/kernel_version
+Date: February 2014
+Contact: Cody P Schafer <[email protected]>
+Description:
+ A number indicating the latest version of the gpci interface
+ that the kernel is aware of.
--
1.8.5.4
Signed-off-by: Cody P Schafer <[email protected]>
---
arch/powerpc/include/asm/hvcall.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
index d8b600b..652f7e4 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -274,6 +274,11 @@
/* Platform specific hcalls, used by KVM */
#define H_RTAS 0xf000
+/* "Platform specific hcalls", provided by PHYP */
+#define H_GET_24X7_CATALOG_PAGE 0xF078
+#define H_GET_24X7_DATA 0xF07C
+#define H_GET_PERF_COUNTER_INFO 0xF080
+
#ifndef __ASSEMBLY__
/**
--
1.8.5.4
bin_attributes created/updated in create_files() (such as those listed
via (struct device).attribute_groups) were not placed under the
specified group, and instead appeared in the base kobj directory.
Fix this by making bin_attributes use creating code similar to normal
attributes.
A quick grep shows that no one is using bin_attrs in a named attribute
group yet, so we can do this without breaking anything in usespace.
Note that I do not add is_visible() support to
bin_attributes, though that could be done as well.
Signed-off-by: Cody P Schafer <[email protected]>
---
fs/sysfs/group.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index 6b57938..aa04068 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -70,8 +70,11 @@ static int create_files(struct kernfs_node *parent, struct kobject *kobj,
if (grp->bin_attrs) {
for (bin_attr = grp->bin_attrs; *bin_attr; bin_attr++) {
if (update)
- sysfs_remove_bin_file(kobj, *bin_attr);
- error = sysfs_create_bin_file(kobj, *bin_attr);
+ kernfs_remove_by_name(parent,
+ (*bin_attr)->attr.name);
+ error = sysfs_add_file_mode_ns(parent,
+ &(*bin_attr)->attr, true,
+ (*bin_attr)->attr.mode, NULL);
if (error)
break;
}
--
1.8.5.4
Add PMU_RANGE_ATTR() and PMU_RANGE_RESV() (for reserved areas) which
generate functions to extract the relevent bits from
event->attr.config{,1,2} for use by sw-like pmus where the
'config{,1,2}' values don't map directly to hardware registers.
Signed-off-by: Cody P Schafer <[email protected]>
---
include/linux/perf_event.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e56b07f..2702e91 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -871,4 +871,21 @@ _name##_show(struct device *dev, \
\
static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
+#define PMU_RANGE_ATTR(name, attr_var, bit_start, bit_end) \
+PMU_FORMAT_ATTR(name, #attr_var ":" #bit_start "-" #bit_end); \
+PMU_RANGE_RESV(name, attr_var, bit_start, bit_end)
+
+#define PMU_RANGE_RESV(name, attr_var, bit_start, bit_end) \
+static u64 event_get_##name##_max(void) \
+{ \
+ int bits = (bit_end) - (bit_start) + 1; \
+ return ((0x1ULL << (bits - 1ULL)) - 1ULL) | \
+ (0xFULL << (bits - 4ULL)); \
+} \
+static u64 event_get_##name(struct perf_event *event) \
+{ \
+ return (event->attr.attr_var >> (bit_start)) & \
+ event_get_##name##_max(); \
+}
+
#endif /* _LINUX_PERF_EVENT_H */
--
1.8.5.4
On Fri, Feb 14, 2014 at 04:32:13PM -0600, Scott Wood wrote:
> On Fri, 2014-02-14 at 14:02 -0800, Cody P Schafer wrote:
> > Signed-off-by: Cody P Schafer <[email protected]>
> > ---
> > arch/powerpc/perf/Makefile | 2 ++
> > arch/powerpc/platforms/Kconfig.cputype | 6 ++++++
> > 2 files changed, 8 insertions(+)
> >
> > diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
> > index 60d71ee..f9c083a 100644
> > --- a/arch/powerpc/perf/Makefile
> > +++ b/arch/powerpc/perf/Makefile
> > @@ -11,5 +11,7 @@ obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o
> > obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
> > obj-$(CONFIG_FSL_EMB_PERF_EVENT_E500) += e500-pmu.o e6500-pmu.o
> >
> > +obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o
> > +
> > obj-$(CONFIG_PPC64) += $(obj64-y)
> > obj-$(CONFIG_PPC32) += $(obj32-y)
> > diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> > index 434fda3..dcc67cd 100644
> > --- a/arch/powerpc/platforms/Kconfig.cputype
> > +++ b/arch/powerpc/platforms/Kconfig.cputype
> > @@ -364,6 +364,12 @@ config PPC_PERF_CTRS
> > help
> > This enables the powerpc-specific perf_event back-end.
> >
> > +config HV_PERF_CTRS
> > + def_bool y
> > + depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT
> > + help
> > + Enable access to perf counters provided by the hypervisor
>
> Please don't add default-y stuff that is platform-specific, and
> definitely point out that platform dependency in the config description
> -- I have to look elsewhere in the patchset to determine that this is
> for "Power Hypervisor". PPC_HAVE_PMU_SUPPORT is enabled by all 6xx
> builds, even for hardware like e300 that doesn't have PMU at all (it has
> the FSL embedded perfmon instead), much less this hv interface.
>
> And yes, PPC_PERF_CTRS has the same problem and should be fixed. :-)
Yep, I just based this one on what PPC_PERF_CTRS was doing.
How about the following:
+config HV_PERF_CTRS
+ bool "Perf Hypervisor supplied counters"
+ default y
+ depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT && PPC_PSERIES
+ help
+ Enable access to hypervisor supplied counters in perf. Currently,
+ this enables code that uses the hcall GetPerfCounterInfo and 24x7
+ interfaces to retrieve counters. GPCI exists on Power 6 and later
+ systems. 24x7 is available on Power 8 systems.
+
+ If unsure, select Y.
And relocated to arch/powerpc/platforms/Kconfig (as this isn't really
strictly "cputype" related).
On Fri, Feb 14, 2014 at 02:02:07PM -0800, Cody P Schafer wrote:
> bin_attributes created/updated in create_files() (such as those listed
> via (struct device).attribute_groups) were not placed under the
> specified group, and instead appeared in the base kobj directory.
>
> Fix this by making bin_attributes use creating code similar to normal
> attributes.
>
> A quick grep shows that no one is using bin_attrs in a named attribute
> group yet, so we can do this without breaking anything in usespace.
>
> Note that I do not add is_visible() support to
> bin_attributes, though that could be done as well.
is_visible() support would be nice to add if you get a chance.
thanks,
greg k-h
On Fri, 2014-02-14 at 16:25 -0800, Cody P Schafer wrote:
> On Fri, Feb 14, 2014 at 04:32:13PM -0600, Scott Wood wrote:
> > On Fri, 2014-02-14 at 14:02 -0800, Cody P Schafer wrote:
> > > diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> > > index 434fda3..dcc67cd 100644
> > > --- a/arch/powerpc/platforms/Kconfig.cputype
> > > +++ b/arch/powerpc/platforms/Kconfig.cputype
> > > @@ -364,6 +364,12 @@ config PPC_PERF_CTRS
> > > help
> > > This enables the powerpc-specific perf_event back-end.
> > >
> > > +config HV_PERF_CTRS
> > > + def_bool y
> > > + depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT
> > > + help
> > > + Enable access to perf counters provided by the hypervisor
> >
> > Please don't add default-y stuff that is platform-specific, and
> > definitely point out that platform dependency in the config description
> > -- I have to look elsewhere in the patchset to determine that this is
> > for "Power Hypervisor". PPC_HAVE_PMU_SUPPORT is enabled by all 6xx
> > builds, even for hardware like e300 that doesn't have PMU at all (it has
> > the FSL embedded perfmon instead), much less this hv interface.
> >
> > And yes, PPC_PERF_CTRS has the same problem and should be fixed. :-)
>
> Yep, I just based this one on what PPC_PERF_CTRS was doing.
>
> How about the following:
>
> +config HV_PERF_CTRS
> + bool "Perf Hypervisor supplied counters"
"Support for Hypervisor supplied PMU events (24x7 & GPCI)" ?
> + default y
> + depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT && PPC_PSERIES
I think you just want:
depends on PERF_EVENTS && PPC_PSERIES
Because you're adding two completely new PMUs, they're not a "struct power_pmu"
backend for the existing powerpc PMU implementation.
cheers
On 02/16/2014 11:11 PM, Michael Ellerman wrote:
> On Fri, 2014-02-14 at 16:25 -0800, Cody P Schafer wrote:
>> On Fri, Feb 14, 2014 at 04:32:13PM -0600, Scott Wood wrote:
>>> On Fri, 2014-02-14 at 14:02 -0800, Cody P Schafer wrote:
>>>> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
>>>> index 434fda3..dcc67cd 100644
>>>> --- a/arch/powerpc/platforms/Kconfig.cputype
>>>> +++ b/arch/powerpc/platforms/Kconfig.cputype
>>>> @@ -364,6 +364,12 @@ config PPC_PERF_CTRS
>>>> help
>>>> This enables the powerpc-specific perf_event back-end.
>>>>
>>>> +config HV_PERF_CTRS
>>>> + def_bool y
>>>> + depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT
>>>> + help
>>>> + Enable access to perf counters provided by the hypervisor
>>>
>>> Please don't add default-y stuff that is platform-specific, and
>>> definitely point out that platform dependency in the config description
>>> -- I have to look elsewhere in the patchset to determine that this is
>>> for "Power Hypervisor". PPC_HAVE_PMU_SUPPORT is enabled by all 6xx
>>> builds, even for hardware like e300 that doesn't have PMU at all (it has
>>> the FSL embedded perfmon instead), much less this hv interface.
>>>
>>> And yes, PPC_PERF_CTRS has the same problem and should be fixed. :-)
>>
>> Yep, I just based this one on what PPC_PERF_CTRS was doing.
>>
>> How about the following:
>>
>> +config HV_PERF_CTRS
>> + bool "Perf Hypervisor supplied counters"
>
> "Support for Hypervisor supplied PMU events (24x7 & GPCI)" ?
Sounds good to me.
>
>> + default y
>> + depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT && PPC_PSERIES
>
> I think you just want:
>
> depends on PERF_EVENTS && PPC_PSERIES
>
>
> Because you're adding two completely new PMUs, they're not a "struct power_pmu"
> backend for the existing powerpc PMU implementation.
>
Ack. I'll fix this up in v3
Signed-off-by: Cody P Schafer <[email protected]>
---
arch/powerpc/perf/Makefile | 2 ++
arch/powerpc/platforms/pseries/Kconfig | 12 ++++++++++++
2 files changed, 14 insertions(+)
diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 60d71ee..f9c083a 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -11,5 +11,7 @@ obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o
obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
obj-$(CONFIG_FSL_EMB_PERF_EVENT_E500) += e500-pmu.o e6500-pmu.o
+obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o
+
obj-$(CONFIG_PPC64) += $(obj64-y)
obj-$(CONFIG_PPC32) += $(obj32-y)
diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index 80b1d57..2cb8b77 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -111,6 +111,18 @@ config CMM
will be reused for other LPARs. The interface allows firmware to
balance memory across many LPARs.
+config HV_PERF_CTRS
+ bool "Hypervisor supplied PMU events (24x7 & GPCI)"
+ default y
+ depends on PERF_EVENTS && PPC_PSERIES
+ help
+ Enable access to hypervisor supplied counters in perf. Currently,
+ this enables code that uses the hcall GetPerfCounterInfo and 24x7
+ interfaces to retrieve counters. GPCI exists on Power 6 and later
+ systems. 24x7 is available on Power 8 systems.
+
+ If unsure, select Y.
+
config DTL
bool "Dispatch Trace Log"
depends on PPC_SPLPAR && DEBUG_FS
--
1.9.0
Whoops, should be "[Patch v2.1 10/11]"
On Fri, 2014-14-02 at 22:02:13 UTC, Cody P Schafer wrote:
> This provides a basic interface between hv_24x7 and perf. Similar to
> the one provided for gpci, it lacks transaction support and does not
> list any events.
>
> Signed-off-by: Cody P Schafer <[email protected]>
> ---
> arch/powerpc/perf/hv-24x7.c | 491 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 491 insertions(+)
> create mode 100644 arch/powerpc/perf/hv-24x7.c
>
> diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
> new file mode 100644
> index 0000000..13de140
> --- /dev/null
> +++ b/arch/powerpc/perf/hv-24x7.c
...
> +
> +/*
> + * read_offset_data - copy data from one buffer to another while treating the
> + * source buffer as a small view on the total avaliable
> + * source data.
> + *
> + * @dest: buffer to copy into
> + * @dest_len: length of @dest in bytes
> + * @requested_offset: the offset within the source data we want. Must be > 0
> + * @src: buffer to copy data from
> + * @src_len: length of @src in bytes
> + * @source_offset: the offset in the sorce data that (src,src_len) refers to.
> + * Must be > 0
> + *
> + * returns the number of bytes copied.
> + *
> + * '.' areas in d are written to.
> + *
> + * u
> + * x w v z
> + * d |.........|
> + * s |----------------------|
> + *
> + * u
> + * x w z v
> + * d |........------|
> + * s |------------------|
> + *
> + * x w u,z,v
> + * d |........|
> + * s |------------------|
> + *
> + * x,w u,v,z
> + * d |------------------|
> + * s |------------------|
> + *
> + * x u
> + * w v z
> + * d |........|
> + * s |------------------|
> + *
> + * x z w v
> + * d |------|
> + * s |------|
> + *
> + * x = source_offset
> + * w = requested_offset
> + * z = source_offset + src_len
> + * v = requested_offset + dest_len
> + *
> + * w_offset_in_s = w - x = requested_offset - source_offset
> + * z_offset_in_s = z - x = src_len
> + * v_offset_in_s = v - x = request_offset + dest_len - src_len
> + * u_offset_in_s = min(z_offset_in_s, v_offset_in_s)
> + *
> + * copy_len = u_offset_in_s - w_offset_in_s = min(z_offset_in_s, v_offset_in_s)
> + * - w_offset_in_s
Comments are great, especially for complicated code like this. But at a glance
I don't actually understand what this comment is trying to tell me.
> + */
> +static ssize_t read_offset_data(void *dest, size_t dest_len,
> + loff_t requested_offset, void *src,
> + size_t src_len, loff_t source_offset)
> +{
> + size_t w_offset_in_s = requested_offset - source_offset;
> + size_t z_offset_in_s = src_len;
> + size_t v_offset_in_s = requested_offset + dest_len - src_len;
> + size_t u_offset_in_s = min(z_offset_in_s, v_offset_in_s);
> + size_t copy_len = u_offset_in_s - w_offset_in_s;
> +
> + if (requested_offset < 0 || source_offset < 0)
> + return -EINVAL;
> +
> + if (z_offset_in_s <= w_offset_in_s)
> + return 0;
> +
> + memcpy(dest, src + w_offset_in_s, copy_len);
> + return copy_len;
> +}
> +
> +static unsigned long h_get_24x7_catalog_page(char page[static 4096],
> + u32 version, u32 index)
> +{
> + WARN_ON(!IS_ALIGNED((unsigned long)page, 4096));
> + return plpar_hcall_norets(H_GET_24X7_CATALOG_PAGE,
> + virt_to_phys(page),
> + version,
> + index);
> +}
> +
> +static ssize_t catalog_read(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *bin_attr, char *buf,
> + loff_t offset, size_t count)
> +{
> + unsigned long hret;
> + ssize_t ret = 0;
> + size_t catalog_len = 0, catalog_page_len = 0, page_count = 0;
> + loff_t page_offset = 0;
> + uint32_t catalog_version_num = 0;
> + void *page = kmalloc(4096, GFP_USER);
> + struct hv_24x7_catalog_page_0 *page_0 = page;
> + if (!page)
> + return -ENOMEM;
> +
> +
> + hret = h_get_24x7_catalog_page(page, 0, 0);
> + if (hret) {
> + ret = -EIO;
> + goto e_free;
> + }
> +
> + catalog_version_num = be32_to_cpu(page_0->version);
> + catalog_page_len = be32_to_cpu(page_0->length);
> + catalog_len = catalog_page_len * 4096;
> +
> + page_offset = offset / 4096;
> + page_count = count / 4096;
> +
> + if (page_offset >= catalog_page_len)
> + goto e_free;
> +
> + if (page_offset != 0) {
> + hret = h_get_24x7_catalog_page(page, catalog_version_num,
> + page_offset);
> + if (hret) {
> + ret = -EIO;
> + goto e_free;
> + }
> + }
> +
> + ret = read_offset_data(buf, count, offset,
> + page, 4096, page_offset * 4096);
> +e_free:
> + if (hret)
> + pr_err("h_get_24x7_catalog_page(ver=%d, page=%lld) failed: rc=%ld\n",
> + catalog_version_num, page_offset, hret);
> + kfree(page);
> +
> + pr_devel("catalog_read: offset=%lld(%lld) count=%zu(%zu) catalog_len=%zu(%zu) => %zd\n",
> + offset, page_offset, count, page_count, catalog_len,
> + catalog_page_len, ret);
> +
> + return ret;
> +}
> +
> +#define PAGE_0_ATTR(_name, _fmt, _expr) \
> +static ssize_t _name##_show(struct device *dev, \
> + struct device_attribute *dev_attr, \
> + char *buf) \
> +{ \
> + unsigned long hret; \
> + ssize_t ret = 0; \
> + void *page = kmalloc(4096, GFP_USER); \
> + struct hv_24x7_catalog_page_0 *page_0 = page; \
> + if (!page) \
> + return -ENOMEM; \
> + hret = h_get_24x7_catalog_page(page, 0, 0); \
> + if (hret) { \
> + ret = -EIO; \
> + goto e_free; \
> + } \
> + ret = sprintf(buf, _fmt, _expr); \
> +e_free: \
> + kfree(page); \
> + return ret; \
> +} \
> +static DEVICE_ATTR_RO(_name)
> +
> +PAGE_0_ATTR(catalog_version, "%lld\n",
> + (unsigned long long)be32_to_cpu(page_0->version));
> +PAGE_0_ATTR(catalog_len, "%lld\n",
> + (unsigned long long)be32_to_cpu(page_0->length) * 4096);
> +static BIN_ATTR_RO(catalog, 0/* real length varies */);
So we're dumping the catalog out as a binary blob.
Why do we want to do that?
It clearly violates the sysfs rule-of-sorts of ASCII and one value per file.
Obviously there can be exceptions, but what's our justification?
> +static struct bin_attribute *if_bin_attrs[] = {
> + &bin_attr_catalog,
> + NULL,
> +};
> +
> +static struct attribute *if_attrs[] = {
> + &dev_attr_catalog_len.attr,
> + &dev_attr_catalog_version.attr,
> + NULL,
> +};
> +
> +static struct attribute_group if_group = {
> + .name = "interface",
> + .bin_attrs = if_bin_attrs,
> + .attrs = if_attrs,
> +};
Both pmus have an "interface" directory, but they don't seem to have anything
in common? Its feels a little ad-hoc.
> +static const struct attribute_group *attr_groups[] = {
> + &format_group,
> + &if_group,
> + NULL,
> +};
cheers
On Fri, 2014-14-02 at 22:02:07 UTC, Cody P Schafer wrote:
> bin_attributes created/updated in create_files() (such as those listed
> via (struct device).attribute_groups) were not placed under the
> specified group, and instead appeared in the base kobj directory.
>
> Fix this by making bin_attributes use creating code similar to normal
> attributes.
>
> A quick grep shows that no one is using bin_attrs in a named attribute
> group yet, so we can do this without breaking anything in usespace.
>
> Note that I do not add is_visible() support to
> bin_attributes, though that could be done as well.
>
> Signed-off-by: Cody P Schafer <[email protected]>
Greg has already taken this, so we'll consider that as good as an ack from him,
unless he wants to give us one.
cheers
On Fri, 2014-14-02 at 22:02:14 UTC, Cody P Schafer wrote:
> Signed-off-by: Cody P Schafer <[email protected]>
> ---
> arch/powerpc/perf/Makefile | 2 ++
> arch/powerpc/platforms/Kconfig.cputype | 6 ++++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
> index 60d71ee..f9c083a 100644
> --- a/arch/powerpc/perf/Makefile
> +++ b/arch/powerpc/perf/Makefile
> @@ -11,5 +11,7 @@ obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o
> obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
> obj-$(CONFIG_FSL_EMB_PERF_EVENT_E500) += e500-pmu.o e6500-pmu.o
>
> +obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o
> +
> obj-$(CONFIG_PPC64) += $(obj64-y)
> obj-$(CONFIG_PPC32) += $(obj32-y)
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index 434fda3..dcc67cd 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -364,6 +364,12 @@ config PPC_PERF_CTRS
> help
> This enables the powerpc-specific perf_event back-end.
>
> +config HV_PERF_CTRS
> + def_bool y
This was bool, why did you change it?
> + depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT
Should be:
depends on PERF_EVENTS && PPC_PSERIES
> + help
> + Enable access to perf counters provided by the hypervisor
> +
cheers
On Fri, 2014-14-02 at 22:02:08 UTC, Cody P Schafer wrote:
> Signed-off-by: Cody P Schafer <[email protected]>
> ---
> arch/powerpc/include/asm/hvcall.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
> index d8b600b..652f7e4 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -274,6 +274,11 @@
> /* Platform specific hcalls, used by KVM */
> #define H_RTAS 0xf000
>
> +/* "Platform specific hcalls", provided by PHYP */
> +#define H_GET_24X7_CATALOG_PAGE 0xF078
> +#define H_GET_24X7_DATA 0xF07C
> +#define H_GET_PERF_COUNTER_INFO 0xF080
Some tabs some spaces, use tabs.
cheers
On Fri, 2014-14-02 at 22:02:12 UTC, Cody P Schafer wrote:
> This provides a basic link between perf and hv_gpci. Notably, it does
> not yet support transactions and does not list any events (they can
> still be manually composed).
Can you explain how the HV_CAPS stuff ends up looking.
I'm not against adding it, but I'd like to understand how we expect it to be
used a bit better.
cheers
> diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
> new file mode 100644
> index 0000000..1f5d96d
> --- /dev/null
> +++ b/arch/powerpc/perf/hv-gpci.c
> +
> +static struct pmu h_gpci_pmu = {
> + .task_ctx_nr = perf_invalid_context,
> +
> + .name = "hv_gpci",
> + .attr_groups = attr_groups,
> + .event_init = h_gpci_event_init,
> + .add = h_gpci_event_add,
> + .del = h_gpci_event_del,
= h_gpci_event_stop,
> + .start = h_gpci_event_start,
> + .stop = h_gpci_event_stop,
> + .read = h_gpci_event_read,
= h_gpci_event_update
> + .event_idx = perf_swevent_event_idx,
> +};
[PATCH v2 07/11] powerpc: add a shared interface to get gpci version and capabilities
All the patches that touch perf should be "powerpc/perf: foo"
On Fri, 2014-14-02 at 22:02:11 UTC, Cody P Schafer wrote:
> ...
I realise this is a fairly small patch but a changelog is still nice. You could
for example mention that we don't currently use .ga, .expanded or .lab but
we're adding the logic anyway because ...
> Signed-off-by: Cody P Schafer <[email protected]>
> ---
> arch/powerpc/perf/hv-common.c | 39 +++++++++++++++++++++++++++++++++++++++
> arch/powerpc/perf/hv-common.h | 17 +++++++++++++++++
> 2 files changed, 56 insertions(+)
> create mode 100644 arch/powerpc/perf/hv-common.c
> create mode 100644 arch/powerpc/perf/hv-common.h
>
> diff --git a/arch/powerpc/perf/hv-common.c b/arch/powerpc/perf/hv-common.c
> new file mode 100644
> index 0000000..47e02b3
> --- /dev/null
> +++ b/arch/powerpc/perf/hv-common.c
> @@ -0,0 +1,39 @@
> +#include <asm/io.h>
> +#include <asm/hvcall.h>
> +
> +#include "hv-gpci.h"
> +#include "hv-common.h"
> +
> +unsigned long hv_perf_caps_get(struct hv_perf_caps *caps)
> +{
> + unsigned long r;
> + struct p {
> + struct hv_get_perf_counter_info_params params;
> + struct cv_system_performance_capabilities caps;
> + } __packed __aligned(sizeof(uint64_t));
> +
> + struct p arg = {
> + .params = {
> + .counter_request = cpu_to_be32(
> + CIR_SYSTEM_PERFORMANCE_CAPABILITIES),
> + .starting_index = cpu_to_be32(-1),
> + .counter_info_version_in = 0,
> + }
> + };
> +
> + r = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
> + virt_to_phys(&arg), sizeof(arg));
> +
> + if (r)
> + return r;
> +
> + pr_devel("capability_mask: 0x%x\n", arg.caps.capability_mask);
> +
> + caps->version = arg.params.counter_info_version_out;
> + caps->collect_privileged = !!arg.caps.perf_collect_privileged;
> + caps->ga = !!(arg.caps.capability_mask & CV_CM_GA);
> + caps->expanded = !!(arg.caps.capability_mask & CV_CM_EXPANDED);
> + caps->lab = !!(arg.caps.capability_mask & CV_CM_LAB);
> +
> + return r;
> +}
> diff --git a/arch/powerpc/perf/hv-common.h b/arch/powerpc/perf/hv-common.h
> new file mode 100644
> index 0000000..7e615bd
> --- /dev/null
> +++ b/arch/powerpc/perf/hv-common.h
> @@ -0,0 +1,17 @@
> +#ifndef LINUX_POWERPC_PERF_HV_COMMON_H_
> +#define LINUX_POWERPC_PERF_HV_COMMON_H_
> +
> +#include <linux/types.h>
> +
> +struct hv_perf_caps {
> + u16 version;
> + u16 collect_privileged:1,
> + ga:1,
> + expanded:1,
> + lab:1,
> + unused:12;
> +};
> +
> +unsigned long hv_perf_caps_get(struct hv_perf_caps *caps);
> +
> +#endif
> --
> 1.8.5.4
>
>
On Fri, 2014-14-02 at 22:02:05 UTC, Cody P Schafer wrote:
> Add PMU_RANGE_ATTR() and PMU_RANGE_RESV() (for reserved areas) which
> generate functions to extract the relevent bits from
> event->attr.config{,1,2} for use by sw-like pmus where the
> 'config{,1,2}' values don't map directly to hardware registers.
>
> Signed-off-by: Cody P Schafer <[email protected]>
> ---
> include/linux/perf_event.h | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index e56b07f..2702e91 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -871,4 +871,21 @@ _name##_show(struct device *dev, \
> \
> static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
>
> +#define PMU_RANGE_ATTR(name, attr_var, bit_start, bit_end) \
> +PMU_FORMAT_ATTR(name, #attr_var ":" #bit_start "-" #bit_end); \
> +PMU_RANGE_RESV(name, attr_var, bit_start, bit_end)
> +
> +#define PMU_RANGE_RESV(name, attr_var, bit_start, bit_end) \
> +static u64 event_get_##name##_max(void) \
> +{ \
> + int bits = (bit_end) - (bit_start) + 1; \
> + return ((0x1ULL << (bits - 1ULL)) - 1ULL) | \
> + (0xFULL << (bits - 4ULL)); \
> +} \
> +static u64 event_get_##name(struct perf_event *event) \
> +{ \
> + return (event->attr.attr_var >> (bit_start)) & \
> + event_get_##name##_max(); \
> +}
I still don't like the names.
EVENT_GETTER_AND_FORMAT()
EVENT_RESERVED()
?
It's not clear to me the max routine is useful in general. Can't we just do:
> +#define EVENT_RESERVED(name, attr_var, bit_start, bit_end) \
> +static u64 event_get_##name(struct perf_event *event) \
> +{ \
> + return (event->attr.attr_var >> (bit_start)) & \
> + ((0x1ULL << ((bit_end) - (bit_start) + 1)) - 1ULL); \
> +}
cheers
On Fri, 2014-14-02 at 22:02:09 UTC, Cody P Schafer wrote:
> "H_GetPerformanceCounterInfo" (refered to as hv_gpci or just gpci from
> here on) is an interface to retrieve specific performance counters and
> other data from the hypervisor. All outputs have a fixed format (and
> are represented as structs in this patch).
I still see unused stuff in here, can you strip it back to just what we need.
Same goes for the next patch.
cheers
On Fri, 2014-14-02 at 22:02:06 UTC, Cody P Schafer wrote:
> Export the swevent hrtimer helpers currently only used in events/core.c
> to allow the addition of architecture specific sw-like pmus.
Peter, Ingo, can we get your ACK on this please?
cheers
> Signed-off-by: Cody P Schafer <[email protected]>
> ---
> include/linux/perf_event.h | 5 ++++-
> kernel/events/core.c | 8 ++++----
> 2 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2702e91..24378a9 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -559,7 +559,10 @@ extern void perf_pmu_migrate_context(struct pmu *pmu,
> int src_cpu, int dst_cpu);
> extern u64 perf_event_read_value(struct perf_event *event,
> u64 *enabled, u64 *running);
> -
> +extern void perf_swevent_init_hrtimer(struct perf_event *event);
> +extern void perf_swevent_start_hrtimer(struct perf_event *event);
> +extern void perf_swevent_cancel_hrtimer(struct perf_event *event);
> +extern int perf_swevent_event_idx(struct perf_event *event);
>
> struct perf_sample_data {
> u64 type;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 56003c6..feb0347 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5816,7 +5816,7 @@ static int perf_swevent_init(struct perf_event *event)
> return 0;
> }
>
> -static int perf_swevent_event_idx(struct perf_event *event)
> +int perf_swevent_event_idx(struct perf_event *event)
> {
> return 0;
> }
> @@ -6045,7 +6045,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
> return ret;
> }
>
> -static void perf_swevent_start_hrtimer(struct perf_event *event)
> +void perf_swevent_start_hrtimer(struct perf_event *event)
> {
> struct hw_perf_event *hwc = &event->hw;
> s64 period;
> @@ -6067,7 +6067,7 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
> HRTIMER_MODE_REL_PINNED, 0);
> }
>
> -static void perf_swevent_cancel_hrtimer(struct perf_event *event)
> +void perf_swevent_cancel_hrtimer(struct perf_event *event)
> {
> struct hw_perf_event *hwc = &event->hw;
>
> @@ -6079,7 +6079,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
> }
> }
>
> -static void perf_swevent_init_hrtimer(struct perf_event *event)
> +void perf_swevent_init_hrtimer(struct perf_event *event)
> {
> struct hw_perf_event *hwc = &event->hw;
>
> --
> 1.8.5.4
>
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
>
On Tue, Feb 25, 2014 at 02:33:26PM +1100, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:06 UTC, Cody P Schafer wrote:
> > Export the swevent hrtimer helpers currently only used in events/core.c
> > to allow the addition of architecture specific sw-like pmus.
>
> Peter, Ingo, can we get your ACK on this please?
How are they used? I saw some usage in patch 9 or so; but its not
explained anywhere. All patches have non-existent Changelogs and the few
comments that are there are pretty hardware specific.
So please do tell; what do you need this for?
On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:05 UTC, Cody P Schafer wrote:
>> Add PMU_RANGE_ATTR() and PMU_RANGE_RESV() (for reserved areas) which
>> generate functions to extract the relevent bits from
>> event->attr.config{,1,2} for use by sw-like pmus where the
>> 'config{,1,2}' values don't map directly to hardware registers.
>>
>> Signed-off-by: Cody P Schafer <[email protected]>
>> ---
>> include/linux/perf_event.h | 17 +++++++++++++++++
>> 1 file changed, 17 insertions(+)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index e56b07f..2702e91 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -871,4 +871,21 @@ _name##_show(struct device *dev, \
>> \
>> static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
>>
>> +#define PMU_RANGE_ATTR(name, attr_var, bit_start, bit_end) \
>> +PMU_FORMAT_ATTR(name, #attr_var ":" #bit_start "-" #bit_end); \
>> +PMU_RANGE_RESV(name, attr_var, bit_start, bit_end)
>> +
>> +#define PMU_RANGE_RESV(name, attr_var, bit_start, bit_end) \
>> +static u64 event_get_##name##_max(void) \
>> +{ \
>> + int bits = (bit_end) - (bit_start) + 1; \
>> + return ((0x1ULL << (bits - 1ULL)) - 1ULL) | \
>> + (0xFULL << (bits - 4ULL)); \
>> +} \
>> +static u64 event_get_##name(struct perf_event *event) \
>> +{ \
>> + return (event->attr.attr_var >> (bit_start)) & \
>> + event_get_##name##_max(); \
>> +}
>
> I still don't like the names.
>
> EVENT_GETTER_AND_FORMAT()
EVENT_RANGE()
I'd prefer to describe the intended usage rather than what is generated
both in case we change some of the specifics later, and to provide
additional information to the developers beyond what a simple code
reading gives.
> EVENT_RESERVED()
Sure. The PMU_* naming was just based on the PMU_FORMAT_ATTR() naming,
so I kept it for continuity with the existing API. Maybe
EVENT_RANGE_RESERVED() would be more appropriate?
> ?
>
> It's not clear to me the max routine is useful in general. Can't we just do:
>
>> +#define EVENT_RESERVED(name, attr_var, bit_start, bit_end) \
>> +static u64 event_get_##name(struct perf_event *event) \
>> +{ \
>> + return (event->attr.attr_var >> (bit_start)) & \
>> + ((0x1ULL << ((bit_end) - (bit_start) + 1)) - 1ULL); \
>> +}
I use event_get_*_max() for some checking of parameters in event_init().
Having it lets me avoid specifying the maximum explicitly (0x7ffff =
0-19, for example). Specifying it explicitly would mean we'd have the
bit width of the field in question encoded in two places instead of one,
and I'd prefer to avoid unneeded duplication.
On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:09 UTC, Cody P Schafer wrote:
>> "H_GetPerformanceCounterInfo" (refered to as hv_gpci or just gpci from
>> here on) is an interface to retrieve specific performance counters and
>> other data from the hypervisor. All outputs have a fixed format (and
>> are represented as structs in this patch).
>
> I still see unused stuff in here, can you strip it back to just what we need.
> Same goes for the next patch.
>
Sure, I can remove the unused structures and enum entries (hadn't
realized you wanted that in the last review).
On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:13 UTC, Cody P Schafer wrote:
>> This provides a basic interface between hv_24x7 and perf. Similar to
>> the one provided for gpci, it lacks transaction support and does not
>> list any events.
>>
>> Signed-off-by: Cody P Schafer <[email protected]>
>> ---
>> arch/powerpc/perf/hv-24x7.c | 491 ++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 491 insertions(+)
>> create mode 100644 arch/powerpc/perf/hv-24x7.c
>>
>> diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
>> new file mode 100644
>> index 0000000..13de140
>> --- /dev/null
>> +++ b/arch/powerpc/perf/hv-24x7.c
> ...
>> +
>> +/*
>> + * read_offset_data - copy data from one buffer to another while treating the
>> + * source buffer as a small view on the total avaliable
>> + * source data.
>> + *
>> + * @dest: buffer to copy into
>> + * @dest_len: length of @dest in bytes
>> + * @requested_offset: the offset within the source data we want. Must be > 0
>> + * @src: buffer to copy data from
>> + * @src_len: length of @src in bytes
>> + * @source_offset: the offset in the sorce data that (src,src_len) refers to.
>> + * Must be > 0
>> + *
>> + * returns the number of bytes copied.
>> + *
>> + * '.' areas in d are written to.
>> + *
>> + * u
>> + * x w v z
>> + * d |.........|
>> + * s |----------------------|
>> + *
>> + * u
>> + * x w z v
>> + * d |........------|
>> + * s |------------------|
>> + *
>> + * x w u,z,v
>> + * d |........|
>> + * s |------------------|
>> + *
>> + * x,w u,v,z
>> + * d |------------------|
>> + * s |------------------|
>> + *
>> + * x u
>> + * w v z
>> + * d |........|
>> + * s |------------------|
>> + *
>> + * x z w v
>> + * d |------|
>> + * s |------|
>> + *
>> + * x = source_offset
>> + * w = requested_offset
>> + * z = source_offset + src_len
>> + * v = requested_offset + dest_len
>> + *
>> + * w_offset_in_s = w - x = requested_offset - source_offset
>> + * z_offset_in_s = z - x = src_len
>> + * v_offset_in_s = v - x = request_offset + dest_len - src_len
>> + * u_offset_in_s = min(z_offset_in_s, v_offset_in_s)
>> + *
>> + * copy_len = u_offset_in_s - w_offset_in_s = min(z_offset_in_s, v_offset_in_s)
>> + * - w_offset_in_s
>
> Comments are great, especially for complicated code like this. But at a glance
> I don't actually understand what this comment is trying to tell me.
The function was composed via some number line logic. The comment tries
to explain what that logic is. The ascii art is various overlapping
buffers that we're copying between (the '+'s from the patch are messing
with the indenting some of the labels). The only major omission I'm
seeing is I failed to note that d=dest and s=src (though this could be
inferred from the comment about '.' indicating a write).
Is there anything specific That doesn't make sense in the comment? (it
may not be a comment that really can be read at a glance).
>
>> + */
>> +static ssize_t read_offset_data(void *dest, size_t dest_len,
>> + loff_t requested_offset, void *src,
>> + size_t src_len, loff_t source_offset)
>> +{
>> + size_t w_offset_in_s = requested_offset - source_offset;
>> + size_t z_offset_in_s = src_len;
>> + size_t v_offset_in_s = requested_offset + dest_len - src_len;
>> + size_t u_offset_in_s = min(z_offset_in_s, v_offset_in_s);
>> + size_t copy_len = u_offset_in_s - w_offset_in_s;
>> +
>> + if (requested_offset < 0 || source_offset < 0)
>> + return -EINVAL;
>> +
>> + if (z_offset_in_s <= w_offset_in_s)
>> + return 0;
>> +
>> + memcpy(dest, src + w_offset_in_s, copy_len);
>> + return copy_len;
>> +}
>> +
>> +static unsigned long h_get_24x7_catalog_page(char page[static 4096],
>> + u32 version, u32 index)
>> +{
>> + WARN_ON(!IS_ALIGNED((unsigned long)page, 4096));
>> + return plpar_hcall_norets(H_GET_24X7_CATALOG_PAGE,
>> + virt_to_phys(page),
>> + version,
>> + index);
>> +}
>> +
>> +static ssize_t catalog_read(struct file *filp, struct kobject *kobj,
>> + struct bin_attribute *bin_attr, char *buf,
>> + loff_t offset, size_t count)
>> +{
>> + unsigned long hret;
>> + ssize_t ret = 0;
>> + size_t catalog_len = 0, catalog_page_len = 0, page_count = 0;
>> + loff_t page_offset = 0;
>> + uint32_t catalog_version_num = 0;
>> + void *page = kmalloc(4096, GFP_USER);
>> + struct hv_24x7_catalog_page_0 *page_0 = page;
>> + if (!page)
>> + return -ENOMEM;
>> +
>> +
>> + hret = h_get_24x7_catalog_page(page, 0, 0);
>> + if (hret) {
>> + ret = -EIO;
>> + goto e_free;
>> + }
>> +
>> + catalog_version_num = be32_to_cpu(page_0->version);
>> + catalog_page_len = be32_to_cpu(page_0->length);
>> + catalog_len = catalog_page_len * 4096;
>> +
>> + page_offset = offset / 4096;
>> + page_count = count / 4096;
>> +
>> + if (page_offset >= catalog_page_len)
>> + goto e_free;
>> +
>> + if (page_offset != 0) {
>> + hret = h_get_24x7_catalog_page(page, catalog_version_num,
>> + page_offset);
>> + if (hret) {
>> + ret = -EIO;
>> + goto e_free;
>> + }
>> + }
>> +
>> + ret = read_offset_data(buf, count, offset,
>> + page, 4096, page_offset * 4096);
>> +e_free:
>> + if (hret)
>> + pr_err("h_get_24x7_catalog_page(ver=%d, page=%lld) failed: rc=%ld\n",
>> + catalog_version_num, page_offset, hret);
>> + kfree(page);
>> +
>> + pr_devel("catalog_read: offset=%lld(%lld) count=%zu(%zu) catalog_len=%zu(%zu) => %zd\n",
>> + offset, page_offset, count, page_count, catalog_len,
>> + catalog_page_len, ret);
>> +
>> + return ret;
>> +}
>> +
>> +#define PAGE_0_ATTR(_name, _fmt, _expr) \
>> +static ssize_t _name##_show(struct device *dev, \
>> + struct device_attribute *dev_attr, \
>> + char *buf) \
>> +{ \
>> + unsigned long hret; \
>> + ssize_t ret = 0; \
>> + void *page = kmalloc(4096, GFP_USER); \
>> + struct hv_24x7_catalog_page_0 *page_0 = page; \
>> + if (!page) \
>> + return -ENOMEM; \
>> + hret = h_get_24x7_catalog_page(page, 0, 0); \
>> + if (hret) { \
>> + ret = -EIO; \
>> + goto e_free; \
>> + } \
>> + ret = sprintf(buf, _fmt, _expr); \
>> +e_free: \
>> + kfree(page); \
>> + return ret; \
>> +} \
>> +static DEVICE_ATTR_RO(_name)
>> +
>> +PAGE_0_ATTR(catalog_version, "%lld\n",
>> + (unsigned long long)be32_to_cpu(page_0->version));
>> +PAGE_0_ATTR(catalog_len, "%lld\n",
>> + (unsigned long long)be32_to_cpu(page_0->length) * 4096);
>> +static BIN_ATTR_RO(catalog, 0/* real length varies */);
>
> So we're dumping the catalog out as a binary blob.
Yep
> Why do we want to do that?
Right now it's the only way to know what events are available.
Additionally, even when the kernel starts parsing events out (and
exposing them via sysfs), there is some additional powerpc specific
structuring ("groups" and "schemas" that some userspace applications may
want to take advantage of.
> It clearly violates the sysfs rule-of-sorts of ASCII and one value per file.
> Obviously there can be exceptions, but what's our justification?
Actual justification is above, but additionally:
I actually was looking at the acpi code that provides (among other
binary tables) the dsdt as a binary blob in sysfs when I was putting
this code together. The 24x7 catalog is, in the same manner, a binary
blob provided by firmware.
>> +static struct bin_attribute *if_bin_attrs[] = {
>> + &bin_attr_catalog,
>> + NULL,
>> +};
>> +
>> +static struct attribute *if_attrs[] = {
>> + &dev_attr_catalog_len.attr,
>> + &dev_attr_catalog_version.attr,
>> + NULL,
>> +};
>> +
>> +static struct attribute_group if_group = {
>> + .name = "interface",
>> + .bin_attrs = if_bin_attrs,
>> + .attrs = if_attrs,
>> +};
>
> Both pmus have an "interface" directory, but they don't seem to have anything
> in common? Its feels a little ad-hoc.
It is absolutely ad-hoc. The only similarity is that both groups named
"interface" provide some additional details about the firmware interface
they're using to provide the perf data. We could easily call them both
"misc", "details", put all the attributes in the device root, or call
them some other generic name. I ended up choosing "interface" because
we're provided details on the firmware interface, and it feels just a
bit less generic. Having device specific names for the attribute group
("24x7" and "gpci", for example) doesn't get us anything because the
devices themselves already have those names ("hv_24x7" and "hv_gpci"). I
don't see any reason to make them different.
On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:08 UTC, Cody P Schafer wrote:
>> Signed-off-by: Cody P Schafer <[email protected]>
>> ---
>> arch/powerpc/include/asm/hvcall.h | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
>> index d8b600b..652f7e4 100644
>> --- a/arch/powerpc/include/asm/hvcall.h
>> +++ b/arch/powerpc/include/asm/hvcall.h
>> @@ -274,6 +274,11 @@
>> /* Platform specific hcalls, used by KVM */
>> #define H_RTAS 0xf000
>>
>> +/* "Platform specific hcalls", provided by PHYP */
>> +#define H_GET_24X7_CATALOG_PAGE 0xF078
>> +#define H_GET_24X7_DATA 0xF07C
>> +#define H_GET_PERF_COUNTER_INFO 0xF080
>
> Some tabs some spaces, use tabs.
Ack.
On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> [PATCH v2 07/11] powerpc: add a shared interface to get gpci version and capabilities
>
> All the patches that touch perf should be "powerpc/perf: foo"
Ok.
> On Fri, 2014-14-02 at 22:02:11 UTC, Cody P Schafer wrote:
>> ...
>
> I realise this is a fairly small patch but a changelog is still nice. You could
> for example mention that we don't currently use .ga, .expanded or .lab but
> we're adding the logic anyway because ...
>
Well, we do use them to expose some more information to the user (via
sysfs attributes). Always nice to know what capabilities are enabled.
But sure, I can explain why each bit in that structure is a good idea.
>
>> Signed-off-by: Cody P Schafer <[email protected]>
>> ---
>> arch/powerpc/perf/hv-common.c | 39 +++++++++++++++++++++++++++++++++++++++
>> arch/powerpc/perf/hv-common.h | 17 +++++++++++++++++
>> 2 files changed, 56 insertions(+)
>> create mode 100644 arch/powerpc/perf/hv-common.c
>> create mode 100644 arch/powerpc/perf/hv-common.h
>>
>> diff --git a/arch/powerpc/perf/hv-common.c b/arch/powerpc/perf/hv-common.c
>> new file mode 100644
>> index 0000000..47e02b3
>> --- /dev/null
>> +++ b/arch/powerpc/perf/hv-common.c
>> @@ -0,0 +1,39 @@
>> +#include <asm/io.h>
>> +#include <asm/hvcall.h>
>> +
>> +#include "hv-gpci.h"
>> +#include "hv-common.h"
>> +
>> +unsigned long hv_perf_caps_get(struct hv_perf_caps *caps)
>> +{
>> + unsigned long r;
>> + struct p {
>> + struct hv_get_perf_counter_info_params params;
>> + struct cv_system_performance_capabilities caps;
>> + } __packed __aligned(sizeof(uint64_t));
>> +
>> + struct p arg = {
>> + .params = {
>> + .counter_request = cpu_to_be32(
>> + CIR_SYSTEM_PERFORMANCE_CAPABILITIES),
>> + .starting_index = cpu_to_be32(-1),
>> + .counter_info_version_in = 0,
>> + }
>> + };
>> +
>> + r = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
>> + virt_to_phys(&arg), sizeof(arg));
>> +
>> + if (r)
>> + return r;
>> +
>> + pr_devel("capability_mask: 0x%x\n", arg.caps.capability_mask);
>> +
>> + caps->version = arg.params.counter_info_version_out;
>> + caps->collect_privileged = !!arg.caps.perf_collect_privileged;
>> + caps->ga = !!(arg.caps.capability_mask & CV_CM_GA);
>> + caps->expanded = !!(arg.caps.capability_mask & CV_CM_EXPANDED);
>> + caps->lab = !!(arg.caps.capability_mask & CV_CM_LAB);
>> +
>> + return r;
>> +}
>> diff --git a/arch/powerpc/perf/hv-common.h b/arch/powerpc/perf/hv-common.h
>> new file mode 100644
>> index 0000000..7e615bd
>> --- /dev/null
>> +++ b/arch/powerpc/perf/hv-common.h
>> @@ -0,0 +1,17 @@
>> +#ifndef LINUX_POWERPC_PERF_HV_COMMON_H_
>> +#define LINUX_POWERPC_PERF_HV_COMMON_H_
>> +
>> +#include <linux/types.h>
>> +
>> +struct hv_perf_caps {
>> + u16 version;
>> + u16 collect_privileged:1,
>> + ga:1,
>> + expanded:1,
>> + lab:1,
>> + unused:12;
>> +};
>> +
>> +unsigned long hv_perf_caps_get(struct hv_perf_caps *caps);
>> +
>> +#endif
>> --
>> 1.8.5.4
>>
>>
>
On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:12 UTC, Cody P Schafer wrote:
>> This provides a basic link between perf and hv_gpci. Notably, it does
>> not yet support transactions and does not list any events (they can
>> still be manually composed).
>
> Can you explain how the HV_CAPS stuff ends up looking.
>
> I'm not against adding it, but I'd like to understand how we expect it to be
> used a bit better.
It's just a quick mechanism for me to expose some relevant information
to userspace via sysfs using the hv_perf_caps_get() function's returned
data. Documentation for this sysfs interface (and the rest) is in a
later patch.
I don't expect any more uses to show up unless the firmware decides to
add another capability bit (in which case I'll want to expose it as well).
>> diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
>> new file mode 100644
>> index 0000000..1f5d96d
>> --- /dev/null
>> +++ b/arch/powerpc/perf/hv-gpci.c
>> +
>> +static struct pmu h_gpci_pmu = {
>> + .task_ctx_nr = perf_invalid_context,
>> +
>> + .name = "hv_gpci",
>> + .attr_groups = attr_groups,
>> + .event_init = h_gpci_event_init,
>> + .add = h_gpci_event_add,
>> + .del = h_gpci_event_del,
> = h_gpci_event_stop,
>
>> + .start = h_gpci_event_start,
>> + .stop = h_gpci_event_stop,
>> + .read = h_gpci_event_read,
> = h_gpci_event_update
>
>> + .event_idx = perf_swevent_event_idx,
>> +};
whoops, thought I had fixed those 2 already.
On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> On Fri, 2014-14-02 at 22:02:14 UTC, Cody P Schafer wrote:
>> Signed-off-by: Cody P Schafer <[email protected]>
>> ---
>> arch/powerpc/perf/Makefile | 2 ++
>> arch/powerpc/platforms/Kconfig.cputype | 6 ++++++
>> 2 files changed, 8 insertions(+)
>>
>> diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
>> index 60d71ee..f9c083a 100644
>> --- a/arch/powerpc/perf/Makefile
>> +++ b/arch/powerpc/perf/Makefile
>> @@ -11,5 +11,7 @@ obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o
>> obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
>> obj-$(CONFIG_FSL_EMB_PERF_EVENT_E500) += e500-pmu.o e6500-pmu.o
>>
>> +obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o
>> +
>> obj-$(CONFIG_PPC64) += $(obj64-y)
>> obj-$(CONFIG_PPC32) += $(obj32-y)
>> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
>> index 434fda3..dcc67cd 100644
>> --- a/arch/powerpc/platforms/Kconfig.cputype
>> +++ b/arch/powerpc/platforms/Kconfig.cputype
>> @@ -364,6 +364,12 @@ config PPC_PERF_CTRS
>> help
>> This enables the powerpc-specific perf_event back-end.
>>
>> +config HV_PERF_CTRS
>> + def_bool y
>
> This was bool, why did you change it?
No, it wasn't. v1 also had def_bool. https://lkml.org/lkml/2014/1/16/518
Maybe you're confusing v2.1 and v2 of this patch?
>
>> + depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT
>
> Should be:
>
> depends on PERF_EVENTS && PPC_PSERIES
>
>> + help
>> + Enable access to perf counters provided by the hypervisor
>> +
Yep, the v2.1 patch (which I bungled and labeled as 9/11) already
changes both of these.
It'll end up rolled into v3.
On 02/25/2014 02:20 AM, Peter Zijlstra wrote:
> On Tue, Feb 25, 2014 at 02:33:26PM +1100, Michael Ellerman wrote:
>> On Fri, 2014-14-02 at 22:02:06 UTC, Cody P Schafer wrote:
>>> Export the swevent hrtimer helpers currently only used in events/core.c
>>> to allow the addition of architecture specific sw-like pmus.
>>
>> Peter, Ingo, can we get your ACK on this please?
>
> How are they used? I saw some usage in patch 9 or so; but its not
> explained anywhere. All patches have non-existent Changelogs and the few
> comments that are there are pretty hardware specific.
>
> So please do tell; what do you need this for?
From this patch's change log:
> Export the swevent hrtimer helpers currently only used in events/core.c to allow the addition of architecture specific sw-like pmus.
The key part here is "architecture specific sw-like pmus", where the
announcement explains why these pmus are sw-like:
> The counters supplied by these interfaces are continually counting and never
> need to be (and cannot be) disabled or enabled. They additionally do not
> generate any interrupts. This makes them in some regards similar to software
> counters, and as a result their implimentation shares some common code (which
> an initial patch exposes) with the sw counters.
Essentially, these pmus just provide access to a big array of counters
which don't generate interrupts, and are all 64bit (and assumed to never
overflow). Rather than duplicate the code that we already have for
managing timing when reading from counters that don't have interrupts
(the functions that are exposed by this patch), I've reused it.
On 02/25/2014 12:33 PM, Cody P Schafer wrote:
> On 02/24/2014 07:33 PM, Michael Ellerman wrote:
>> On Fri, 2014-14-02 at 22:02:05 UTC, Cody P Schafer wrote:
>>> Add PMU_RANGE_ATTR() and PMU_RANGE_RESV() (for reserved areas) which
>>> generate functions to extract the relevent bits from
>>> event->attr.config{,1,2} for use by sw-like pmus where the
>>> 'config{,1,2}' values don't map directly to hardware registers.
>>>
>>> Signed-off-by: Cody P Schafer <[email protected]>
>>> ---
>>> include/linux/perf_event.h | 17 +++++++++++++++++
>>> 1 file changed, 17 insertions(+)
>>>
>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>> index e56b07f..2702e91 100644
>>> --- a/include/linux/perf_event.h
>>> +++ b/include/linux/perf_event.h
>>> @@ -871,4 +871,21 @@ _name##_show(struct device
>>> *dev, \
>>> \
>>> static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
>>>
>>> +#define PMU_RANGE_ATTR(name, attr_var, bit_start, bit_end) \
>>> +PMU_FORMAT_ATTR(name, #attr_var ":" #bit_start "-" #bit_end); \
>>> +PMU_RANGE_RESV(name, attr_var, bit_start, bit_end)
>>> +
>>> +#define PMU_RANGE_RESV(name, attr_var, bit_start, bit_end) \
>>> +static u64 event_get_##name##_max(void) \
>>> +{ \
>>> + int bits = (bit_end) - (bit_start) + 1; \
>>> + return ((0x1ULL << (bits - 1ULL)) - 1ULL) | \
>>> + (0xFULL << (bits - 4ULL)); \
>>> +} \
>>> +static u64 event_get_##name(struct perf_event *event) \
>>> +{ \
>>> + return (event->attr.attr_var >> (bit_start)) & \
>>> + event_get_##name##_max(); \
>>> +}
>>
>> I still don't like the names.
>>
>> EVENT_GETTER_AND_FORMAT()
>
> EVENT_RANGE()
>
> I'd prefer to describe the intended usage rather than what is generated
> both in case we change some of the specifics later, and to provide
> additional information to the developers beyond what a simple code
> reading gives.
>
>> EVENT_RESERVED()
>
> Sure. The PMU_* naming was just based on the PMU_FORMAT_ATTR() naming,
> so I kept it for continuity with the existing API. Maybe
> EVENT_RANGE_RESERVED() would be more appropriate?
>
Thinking about this a bit more, EVENT_RANGE() and EVENT_RANGE_RESERVED()
aren't quite ideal either. The "EVENT" name collides with the files we
put in the event/ dir, which these macros generate files for the format/
dir. Maybe:
FORMAT_RANGE() and FORMAT_RANGE_RESERVED()
or
PMU_FORMAT_RANGE(), PMU_FORMAT_RANGE_RESERVED()
On Tue, 2014-02-25 at 13:31 -0800, Cody P Schafer wrote:
> On 02/24/2014 07:33 PM, Michael Ellerman wrote:
> > On Fri, 2014-14-02 at 22:02:14 UTC, Cody P Schafer wrote:
> >> Signed-off-by: Cody P Schafer <[email protected]>
> >> ---
> >> arch/powerpc/perf/Makefile | 2 ++
> >> arch/powerpc/platforms/Kconfig.cputype | 6 ++++++
> >> 2 files changed, 8 insertions(+)
> >>
> >> diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
> >> index 60d71ee..f9c083a 100644
> >> --- a/arch/powerpc/perf/Makefile
> >> +++ b/arch/powerpc/perf/Makefile
> >> @@ -11,5 +11,7 @@ obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o
> >> obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
> >> obj-$(CONFIG_FSL_EMB_PERF_EVENT_E500) += e500-pmu.o e6500-pmu.o
> >>
> >> +obj-$(CONFIG_HV_PERF_CTRS) += hv-24x7.o hv-gpci.o hv-common.o
> >> +
> >> obj-$(CONFIG_PPC64) += $(obj64-y)
> >> obj-$(CONFIG_PPC32) += $(obj32-y)
> >> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> >> index 434fda3..dcc67cd 100644
> >> --- a/arch/powerpc/platforms/Kconfig.cputype
> >> +++ b/arch/powerpc/platforms/Kconfig.cputype
> >> @@ -364,6 +364,12 @@ config PPC_PERF_CTRS
> >> help
> >> This enables the powerpc-specific perf_event back-end.
> >>
> >> +config HV_PERF_CTRS
> >> + def_bool y
> >
> > This was bool, why did you change it?
>
> No, it wasn't. v1 also had def_bool. https://lkml.org/lkml/2014/1/16/518
> Maybe you're confusing v2.1 and v2 of this patch?
Er yes. Point releases of a patch series confuse me :)
> >> + depends on PERF_EVENTS && PPC_HAVE_PMU_SUPPORT
> >
> > Should be:
> >
> > depends on PERF_EVENTS && PPC_PSERIES
> >
> >> + help
> >> + Enable access to perf counters provided by the hypervisor
> >> +
>
> Yep, the v2.1 patch (which I bungled and labeled as 9/11) already
> changes both of these.
> It'll end up rolled into v3.
Yes please.
cheers
On Tue, Feb 25, 2014 at 01:38:31PM -0800, Cody P Schafer wrote:
> On 02/25/2014 02:20 AM, Peter Zijlstra wrote:
> >On Tue, Feb 25, 2014 at 02:33:26PM +1100, Michael Ellerman wrote:
> >>On Fri, 2014-14-02 at 22:02:06 UTC, Cody P Schafer wrote:
> >>>Export the swevent hrtimer helpers currently only used in events/core.c
> >>>to allow the addition of architecture specific sw-like pmus.
> >>
> >>Peter, Ingo, can we get your ACK on this please?
> >
> >How are they used? I saw some usage in patch 9 or so; but its not
> >explained anywhere. All patches have non-existent Changelogs and the few
> >comments that are there are pretty hardware specific.
> >
> >So please do tell; what do you need this for?
>
> From this patch's change log:
>
> >Export the swevent hrtimer helpers currently only used in events/core.c to allow the addition of architecture specific sw-like pmus.
>
> The key part here is "architecture specific sw-like pmus", where the
> announcement explains why these pmus are sw-like:
I don't read announcements for crucial patch details; announcements are
lost and therefore unimportant.
> >The counters supplied by these interfaces are continually counting and never
> >need to be (and cannot be) disabled or enabled. They additionally do not
> >generate any interrupts. This makes them in some regards similar to software
> >counters, and as a result their implimentation shares some common code (which
> >an initial patch exposes) with the sw counters.
>
> Essentially, these pmus just provide access to a big array of counters which
> don't generate interrupts, and are all 64bit (and assumed to never
> overflow). Rather than duplicate the code that we already have for managing
> timing when reading from counters that don't have interrupts (the functions
> that are exposed by this patch), I've reused it.
So note that all the software counters generate interrupts in their own
measuring domain. The hrtimer ones measure time and generate time based
interrupts, the event based ones generate 'interrupts' on their events.
What you have here is a hw pmu without interrupt capability. That's
fine, they don't get to generate interrupt. We have plenty of those
already.
But what you propose to do is add interrupt in another domain entirely.
That's not fine. Don't do that.
You also try and conceal this information; so you suck.
On 02/26/2014 12:29 AM, Peter Zijlstra wrote:
> On Tue, Feb 25, 2014 at 01:38:31PM -0800, Cody P Schafer wrote:
>> On 02/25/2014 02:20 AM, Peter Zijlstra wrote:
>>> On Tue, Feb 25, 2014 at 02:33:26PM +1100, Michael Ellerman wrote:
>>>> On Fri, 2014-14-02 at 22:02:06 UTC, Cody P Schafer wrote:
>>>>> Export the swevent hrtimer helpers currently only used in events/core.c
>>>>> to allow the addition of architecture specific sw-like pmus.
>>>>
>>>> Peter, Ingo, can we get your ACK on this please?
>>>
>>> How are they used? I saw some usage in patch 9 or so; but its not
>>> explained anywhere. All patches have non-existent Changelogs and the few
>>> comments that are there are pretty hardware specific.
>>>
>>> So please do tell; what do you need this for?
>>
>> From this patch's change log:
>>
>>> Export the swevent hrtimer helpers currently only used in events/core.c to allow the addition of architecture specific sw-like pmus.
>>
>> The key part here is "architecture specific sw-like pmus", where the
>> announcement explains why these pmus are sw-like:
>
> I don't read announcements for crucial patch details; announcements are
> lost and therefore unimportant.
And I'll be sure to elaborate further in the changelog next time (if I
don't drop this change entirely).
This is the first comment I've got on this particular patch.
>>> The counters supplied by these interfaces are continually counting and never
>>> need to be (and cannot be) disabled or enabled. They additionally do not
>>> generate any interrupts. This makes them in some regards similar to software
>>> counters, and as a result their implimentation shares some common code (which
>>> an initial patch exposes) with the sw counters.
>>
>> Essentially, these pmus just provide access to a big array of counters which
>> don't generate interrupts, and are all 64bit (and assumed to never
>> overflow). Rather than duplicate the code that we already have for managing
>> timing when reading from counters that don't have interrupts (the functions
>> that are exposed by this patch), I've reused it.
>
> So note that all the software counters generate interrupts in their own
> measuring domain. The hrtimer ones measure time and generate time based
> interrupts, the event based ones generate 'interrupts' on their events.
>
> What you have here is a hw pmu without interrupt capability. That's
> fine, they don't get to generate interrupt. We have plenty of those
> already.
>
> But what you propose to do is add interrupt in another domain entirely.
> That's not fine. Don't do that.
Ok, so it looks like I misunderstood the need for an interrupt. The
intention in using the swevent_hrtimer code was to enable setting up the
events as frequency sampled. After taking another look at the gpci and
24x7 pmus, I'm forbidding sampling events anyhow in event init, so the
timer code isn't even taken advantage of. I'll drop this patch in the
next set.
>
> You also try and conceal this information; so you suck.
>