2017-04-03 14:59:56

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 00/11] IMC Instrumentation Support

Power9 has In-Memory-Collection (IMC) infrastructure which contains
various Performance Monitoring Units (PMUs) at Nest level (these are
on-chip but off-core), Core level and Thread level.

The Nest PMU counters are handled by a Nest IMC microcode which runs
in the OCC (On-Chip Controller) complex. The microcode collects the
counter data and moves the nest IMC counter data to memory.

The Core and Thread IMC PMU counters are handled in the core. Core
level PMU counters give us the IMC counters' data per core and thread
level PMU counters give us the IMC counters' data per CPU thread.

This patchset enables the nest IMC, core IMC and thread IMC
PMUs and is based on the initial work done by Madhavan Srinivasan.
"Nest Instrumentation Support" :
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132078.html

v1 for this patchset can be found here :
https://lwn.net/Articles/705475/

Nest events:
Per-chip nest instrumentation provides various per-chip metrics
such as memory, powerbus, Xlink and Alink bandwidth.

Core events:
Per-core IMC instrumentation provides various per-core metrics
such as non-idle cycles, non-idle instructions, various cache and
memory related metrics etc.

Thread events:
All the events for thread level are same as core level with the
difference being in the domain. These are per-cpu metrics.

PMU Events' Information:
OPAL obtains the IMC PMU and event information from the IMC Catalog
and passes on to the kernel via the device tree. The events' information
contains :
- Event name
- Event Offset
- Event description
and, maybe :
- Event scale
- Event unit

Some PMUs may have a common scale and unit values for all their
supported events. For those cases, the scale and unit properties for
those events must be inherited from the PMU.

The event offset in the memory is where the counter data gets
accumulated.

The OPAL-side patches are posted upstream :
https://lists.ozlabs.org/pipermail/skiboot/2017-March/006531.html

The kernel discovers the IMC counters information in the device tree
at the "imc-counters" device node which has a compatible field
"ibm,opal-in-memory-counters".

Parsing of the Events' information:
To parse the IMC PMUs and events information, the kernel has to
discover the "imc-counters" node and walk through the pmu and event
nodes.

Here is an excerpt of the dt showing the imc-counters with
mcs0 (nest), core and thread node:

https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts

/dts-v1/;

[...]

/dts-v1/;

/ {
name = "";
compatible = "ibm,opal-in-memory-counters";
#address-cells = <0x1>;
#size-cells = <0x1>;
imc-nest-offset = <0x320000>;
imc-nest-size = <0x30000>;
version-id = "";

NEST_MCS: nest-mcs-events {
#address-cells = <0x1>;
#size-cells = <0x1>;

event@0 {
event-name = "RRTO_QFULL_NO_DISP" ;
reg = <0x0 0x8>;
desc = "RRTO not dispatched in MCS0 due to capacity - pulses once for each time a valid RRTO op is not dispatched due to a command list full condition" ;
};
event@8 {
event-name = "WRTO_QFULL_NO_DISP" ;
reg = <0x8 0x8>;
desc = "WRTO not dispatched in MCS0 due to capacity - pulses once for each time a valid WRTO op is not dispatched due to a command list full condition" ;
};
[...]
mcs0 {
compatible = "ibm,imc-counters-nest";
events-prefix = "PM_MCS0_";
unit = "";
scale = "";
reg = <0x118 0x8>;
events = < &NEST_MCS >;
};

mcs1 {
compatible = "ibm,imc-counters-nest";
events-prefix = "PM_MCS1_";
unit = "";
scale = "";
reg = <0x198 0x8>;
events = < &NEST_MCS >;
};
[...]

CORE_EVENTS: core-events {
#address-cells = <0x1>;
#size-cells = <0x1>;

event@e0 {
event-name = "0THRD_NON_IDLE_PCYC" ;
reg = <0xe0 0x8>;
desc = "The number of processor cycles when all threads are idle" ;
};
event@120 {
event-name = "1THRD_NON_IDLE_PCYC" ;
reg = <0x120 0x8>;
desc = "The number of processor cycles when exactly one SMT thread is executing non-idle code" ;
};
[...]
core {
compatible = "ibm,imc-counters-core";
events-prefix = "CPM_";
unit = "";
scale = "";
reg = <0x0 0x8>;
events = < &CORE_EVENTS >;
};

thread {
compatible = "ibm,imc-counters-core";
events-prefix = "CPM_";
unit = "";
scale = "";
reg = <0x0 0x8>;
events = < &CORE_EVENTS >;
};
};

>From the device tree, the kernel parses the PMUs and their events'
information.

After parsing the IMC PMUs and their events, the PMUs and their
attributes are registered in the kernel.

This patchset (patches 9 and 10) configure the thread level IMC PMUs
to count for tasks, which give us the thread level metric values per
task.

Example Usage :
# perf list

[...]
nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/ [Kernel PMU event]
nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0_LAST_SAMPLE/ [Kernel PMU event]
[...]
core_imc/CPM_NON_IDLE_INST/ [Kernel PMU event]
core_imc/CPM_NON_IDLE_PCYC/ [Kernel PMU event]
[...]
thread_imc/CPM_NON_IDLE_INST/ [Kernel PMU event]
thread_imc/CPM_NON_IDLE_PCYC/ [Kernel PMU event]

To see per chip data for nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/ :
# perf stat -e "nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/" -a --per-socket

To see non-idle instructions for core 0 :
# ./perf stat -e "core_imc/CPM_NON_IDLE_INST/" -C 0 -I 1000

To see non-idle instructions for a "make" :
# ./perf stat -e "thread_imc/CPM_NON_IDLE_PCYC/" make

Comments/feedback/suggestions are welcome.

TODO:
1)Add a sysfs interface to disable the Core imc (both for ldbar and pdbar)
2)Add disable_imc_nest and diable_imc_core kernel parameters to help out in debug

Changelog:
v5 -> v6:
- merged few patches for the readability and code flow
- Updated the commit message and code comments.
- Added kdump check.
- updated cpuhotplug code and added checks for perf migration context
- Added READ_ONCE() when reading the counter data.
- replaced of_property_read_u32() with of_get_address() for "reg" property read
- replaced UNKNOWN_DOMAIN with IMC_DOMAIN_UNKNOWN
v4 -> v5:
- Updated opal call numbers
- Added a patch to disable Core-IMC device using shutdown callback
- Added patch to support cpuhotplug for thread-imc
- Added patch to disable and enable core imc engine in cpuhot plug path
v3 -> v4 :
- Changed the events parser code to discover the PMU and events because
of the changed format of the IMC DTS file (Patch 3).
- Implemented the two TODOs to include core and thread IMC support with
this patchset (Patches 7 through 10).
- Changed the CPU hotplug code of Nest IMC PMUs to include a new state
CPUHP_AP_PERF_POWERPC_NEST_ONLINE (Patch 6).
v2 -> v3 :
- Changed all references for IMA (In-Memory Accumulation) to IMC (In-Memory
Collection).
v1 -> v2 :
- Account for the cases where a PMU can have a common scale and unit
values for all its supported events (Patch 3/6).
- Fixed a Build error (for maple_defconfig) by enabling imc_pmu.o
only for CONFIG_PPC_POWERNV=y (Patch 4/6)
- Read from the "event-name" property instead of "name" for an event
node (Patch 3/6).

Anju T Sudhakar (1):
powerpc/perf: Thread imc cpuhotplug support

Hemant Kumar (10):
powerpc/powernv: Data structure and macros definitions
powerpc/powernv: Autoload IMC device driver module
powerpc/powernv: Detect supported IMC units and its events
powerpc/perf: Add event attribute and group to IMC pmus
powerpc/perf: Generic imc pmu event functions
powerpc/perf: IMC pmu cpumask and cpu hotplug support
powerpc/powernv: Core IMC events detection
powerpc/perf: PMU functions for Core IMC and hotplugging
powerpc/powernv: Thread IMC events detection
powerpc/perf: Thread IMC PMU functions

arch/powerpc/include/asm/imc-pmu.h | 81 +++
arch/powerpc/include/asm/opal-api.h | 21 +-
arch/powerpc/include/asm/opal.h | 5 +
arch/powerpc/perf/Makefile | 6 +-
arch/powerpc/perf/imc-pmu.c | 837 +++++++++++++++++++++++++
arch/powerpc/platforms/powernv/Makefile | 2 +-
arch/powerpc/platforms/powernv/opal-imc.c | 575 +++++++++++++++++
arch/powerpc/platforms/powernv/opal-wrappers.S | 2 +
arch/powerpc/platforms/powernv/opal.c | 14 +
include/linux/cpuhotplug.h | 3 +
10 files changed, 1543 insertions(+), 3 deletions(-)
create mode 100644 arch/powerpc/include/asm/imc-pmu.h
create mode 100644 arch/powerpc/perf/imc-pmu.c
create mode 100644 arch/powerpc/platforms/powernv/opal-imc.c

--
2.7.4


2017-04-03 14:57:49

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 05/11] powerpc/perf: Generic imc pmu event functions

From: Hemant Kumar <[email protected]>

Since, the IMC counters' data are periodically fed to a memory location,
the functions to read/update, start/stop, add/del can be generic and can
be used by all IMC PMU units.

This patch adds a set of generic imc pmu related event functions to be
used by each imc pmu unit. Add code to setup format attribute and to
register imc pmus. Add a event_init function for nest_imc events.

Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Hemant Kumar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/include/asm/imc-pmu.h | 1 +
arch/powerpc/perf/imc-pmu.c | 137 ++++++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/opal-imc.c | 30 ++++++-
3 files changed, 164 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
index a3d4f1bf9492..e13f51047522 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -65,4 +65,5 @@ struct imc_pmu {
#define IMC_DOMAIN_NEST 1
#define IMC_DOMAIN_UNKNOWN -1

+int imc_get_domain(struct device_node *pmu_dev);
#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index ec464c76b749..bd5bf66bd920 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -18,6 +18,132 @@
struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];

+/* Needed for sanity check */
+extern u64 nest_max_offset;
+
+PMU_FORMAT_ATTR(event, "config:0-20");
+static struct attribute *imc_format_attrs[] = {
+ &format_attr_event.attr,
+ NULL,
+};
+
+static struct attribute_group imc_format_group = {
+ .name = "format",
+ .attrs = imc_format_attrs,
+};
+
+static int nest_imc_event_init(struct perf_event *event)
+{
+ int chip_id;
+ u32 config = event->attr.config;
+ struct perchip_nest_info *pcni;
+
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /* Sampling not supported */
+ if (event->hw.sample_period)
+ return -EINVAL;
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest)
+ return -EINVAL;
+
+ if (event->cpu < 0)
+ return -EINVAL;
+
+ /* Sanity check for config (event offset) */
+ if (config > nest_max_offset)
+ return -EINVAL;
+
+ chip_id = topology_physical_package_id(event->cpu);
+ pcni = &nest_perchip_info[chip_id];
+
+ /*
+ * Memory for Nest HW counter data could be in multiple pages.
+ * Hence check and pick the right base page from the event offset,
+ * and then add to it.
+ */
+ event->hw.event_base = pcni->vbase[config/PAGE_SIZE] +
+ (config & ~PAGE_MASK);
+
+ return 0;
+}
+
+static void imc_read_counter(struct perf_event *event)
+{
+ u64 *addr, data;
+
+ addr = (u64 *)event->hw.event_base;
+ data = __be64_to_cpu(READ_ONCE(*addr));
+ local64_set(&event->hw.prev_count, data);
+}
+
+static void imc_perf_event_update(struct perf_event *event)
+{
+ u64 counter_prev, counter_new, final_count, *addr;
+
+ addr = (u64 *)event->hw.event_base;
+ counter_prev = local64_read(&event->hw.prev_count);
+ counter_new = __be64_to_cpu(READ_ONCE(*addr));
+ final_count = counter_new - counter_prev;
+
+ local64_set(&event->hw.prev_count, counter_new);
+ local64_add(final_count, &event->count);
+}
+
+static void imc_event_start(struct perf_event *event, int flags)
+{
+ /*
+ * In Memory Counters are free flowing counters. HW or the microcode
+ * keeps adding to the counter offset in memory. To get event
+ * counter value, we snapshot the value here and we calculate
+ * delta at later point.
+ */
+ imc_read_counter(event);
+}
+
+static void imc_event_stop(struct perf_event *event, int flags)
+{
+ /*
+ * Take a snapshot and calculate the delta and update
+ * the event counter values.
+ */
+ imc_perf_event_update(event);
+}
+
+static int imc_event_add(struct perf_event *event, int flags)
+{
+ if (flags & PERF_EF_START)
+ imc_event_start(event, flags);
+
+ return 0;
+}
+
+/* update_pmu_ops : Populate the appropriate operations for "pmu" */
+static int update_pmu_ops(struct imc_pmu *pmu)
+{
+ if (!pmu)
+ return -EINVAL;
+
+ pmu->pmu.task_ctx_nr = perf_invalid_context;
+ pmu->pmu.event_init = nest_imc_event_init;
+ pmu->pmu.add = imc_event_add;
+ pmu->pmu.del = imc_event_stop;
+ pmu->pmu.start = imc_event_start;
+ pmu->pmu.stop = imc_event_stop;
+ pmu->pmu.read = imc_perf_event_update;
+ pmu->attr_groups[1] = &imc_format_group;
+ pmu->pmu.attr_groups = pmu->attr_groups;
+
+ return 0;
+}
+
/* dev_str_attr : Populate event "name" and string "str" in attribute */
static struct attribute *dev_str_attr(const char *name, const char *str)
{
@@ -84,6 +210,17 @@ int init_imc_pmu(struct imc_events *events, int idx,
if (ret)
goto err_free;

+ ret = update_pmu_ops(pmu_ptr);
+ if (ret)
+ goto err_free;
+
+ ret = perf_pmu_register(&pmu_ptr->pmu, pmu_ptr->pmu.name, -1);
+ if (ret)
+ goto err_free;
+
+ pr_info("%s performance monitor hardware support registered\n",
+ pmu_ptr->pmu.name);
+
return 0;

err_free:
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
index 73c46703c2af..a98678266b0d 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -37,6 +37,7 @@ extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];

extern int init_imc_pmu(struct imc_events *events,
int idx, struct imc_pmu *pmu_ptr);
+u64 nest_max_offset;

static int imc_event_info(char *name, struct imc_events *events)
{
@@ -69,8 +70,25 @@ static int imc_event_info_str(struct property *pp, char *name,
return 0;
}

+/*
+ * Updates the maximum offset for an event in the pmu with domain
+ * "pmu_domain". Right now, only nest domain is supported.
+ */
+static void update_max_value(u32 value, int pmu_domain)
+{
+ switch (pmu_domain) {
+ case IMC_DOMAIN_NEST:
+ if (nest_max_offset < value)
+ nest_max_offset = value;
+ break;
+ default:
+ /* Unknown domain, return */
+ return;
+ }
+}
+
static int imc_event_info_val(char *name, u32 val,
- struct imc_events *events)
+ struct imc_events *events, int pmu_domain)
{
int ret;

@@ -78,6 +96,7 @@ static int imc_event_info_val(char *name, u32 val,
if (ret)
return ret;
sprintf(events->ev_value, "event=0x%x", val);
+ update_max_value(val, pmu_domain);

return 0;
}
@@ -114,7 +133,8 @@ static int imc_events_node_parser(struct device_node *dev,
struct property *event_scale,
struct property *event_unit,
struct property *name_prefix,
- u32 reg)
+ u32 reg,
+ int pmu_domain)
{
struct property *name, *pp;
char *ev_name;
@@ -159,7 +179,8 @@ static int imc_events_node_parser(struct device_node *dev,
if (strncmp(pp->name, "reg", 3) == 0) {
of_property_read_u32(dev, pp->name, &val);
val += reg;
- ret = imc_event_info_val(ev_name, val, &events[idx]);
+ ret = imc_event_info_val(ev_name, val, &events[idx],
+ pmu_domain);
if (ret) {
kfree(events[idx].ev_name);
kfree(events[idx].ev_value);
@@ -366,7 +387,8 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index)
/* Loop through event nodes */
for_each_child_of_node(dir, ev_node) {
ret = imc_events_node_parser(ev_node, &events[idx], scale_pp,
- unit_pp, name_prefix, reg);
+ unit_pp, name_prefix, reg,
+ pmu_ptr->domain);
if (ret < 0) {
/* Unable to parse this event */
if (ret == -ENOMEM)
--
2.7.4

2017-04-03 14:57:52

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events

From: Hemant Kumar <[email protected]>

Parse device tree to detect IMC units. Traverse through each IMC unit
node to find supported events and corresponding unit/scale files (if any).

Here is the DTS file for reference:

https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts

The device tree for IMC counters starts at the node "imc-counters".
This node contains all the IMC PMU nodes and event nodes
for these IMC PMUs. The PMU nodes have an "events" property which has a
phandle value for the actual events node. The events are separated from
the PMU nodes to abstract out the common events. For example, PMU node
"mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since,
the events are common between these PMUs. These events have a different
prefix based on their relation to different PMUs, and hence, the PMU
nodes themselves contain an "events-prefix" property. The value for this
property concatenated to the event name, forms the actual event
name. Also, the PMU have a "reg" field as the base offset for the events
which belong to this PMU. This "reg" field is added to an event in the
"events" node, which gives us the location of the counter data. Kernel
code uses this offset as event configuration value.

Device tree parser code also looks for scale/unit property in the event
node and passes on the value as an event attr for perf interface to use
in the post processing by the perf tool. Some PMUs may have common scale
and unit properties which implies that all events supported by this PMU
inherit the scale and unit properties of the PMU itself. For those
events, we need to set the common unit and scale values.

For failure to initialize any unit or any event, disable that unit and
continue setting up the rest of them.

Signed-off-by: Hemant Kumar <[email protected]>
Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/platforms/powernv/opal-imc.c | 386 ++++++++++++++++++++++++++++++
1 file changed, 386 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
index c476d596c6a8..ba0203e669c0 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -33,6 +33,388 @@
#include <asm/imc-pmu.h>

struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+
+static int imc_event_info(char *name, struct imc_events *events)
+{
+ char *buf;
+
+ /* memory for content */
+ buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ events->ev_name = name;
+ events->ev_value = buf;
+ return 0;
+}
+
+static int imc_event_info_str(struct property *pp, char *name,
+ struct imc_events *events)
+{
+ int ret;
+
+ ret = imc_event_info(name, events);
+ if (ret)
+ return ret;
+
+ if (!pp->value || (strnlen(pp->value, pp->length) == pp->length) ||
+ (pp->length > IMC_MAX_PMU_NAME_LEN))
+ return -EINVAL;
+ strncpy(events->ev_value, (const char *)pp->value, pp->length);
+
+ return 0;
+}
+
+static int imc_event_info_val(char *name, u32 val,
+ struct imc_events *events)
+{
+ int ret;
+
+ ret = imc_event_info(name, events);
+ if (ret)
+ return ret;
+ sprintf(events->ev_value, "event=0x%x", val);
+
+ return 0;
+}
+
+static int set_event_property(struct property *pp, char *event_prop,
+ struct imc_events *events, char *ev_name)
+{
+ char *buf;
+ int ret;
+
+ buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ sprintf(buf, "%s.%s", ev_name, event_prop);
+ ret = imc_event_info_str(pp, buf, events);
+ if (ret) {
+ kfree(events->ev_name);
+ kfree(events->ev_value);
+ }
+
+ return ret;
+}
+
+/*
+ * imc_events_node_parser: Parse the event node "dev" and assign the parsed
+ * information to event "events".
+ *
+ * Parses the "reg" property of this event. "reg" gives us the event offset.
+ * Also, parse the "scale" and "unit" properties, if any.
+ */
+static int imc_events_node_parser(struct device_node *dev,
+ struct imc_events *events,
+ struct property *event_scale,
+ struct property *event_unit,
+ struct property *name_prefix,
+ u32 reg)
+{
+ struct property *name, *pp;
+ char *ev_name;
+ u32 val;
+ int idx = 0, ret;
+
+ if (!dev)
+ return -EINVAL;
+
+ /*
+ * Loop through each property of an event node
+ */
+ name = of_find_property(dev, "event-name", NULL);
+ if (!name)
+ return -ENODEV;
+
+ if (!name->value ||
+ (strnlen(name->value, name->length) == name->length) ||
+ (name->length > IMC_MAX_PMU_NAME_LEN))
+ return -EINVAL;
+
+ ev_name = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
+ if (!ev_name)
+ return -ENOMEM;
+
+ snprintf(ev_name, IMC_MAX_PMU_NAME_LEN, "%s%s",
+ (char *)name_prefix->value,
+ (char *)name->value);
+
+ /*
+ * Parse each property of this event node "dev". Property "reg" has
+ * the offset which is assigned to the event name. Other properties
+ * like "scale" and "unit" are assigned to event.scale and event.unit
+ * accordingly.
+ */
+ for_each_property_of_node(dev, pp) {
+ /*
+ * If there is an issue in parsing a single property of
+ * this event, we just clean up the buffers, but we still
+ * continue to parse.
+ */
+ if (strncmp(pp->name, "reg", 3) == 0) {
+ of_property_read_u32(dev, pp->name, &val);
+ val += reg;
+ ret = imc_event_info_val(ev_name, val, &events[idx]);
+ if (ret) {
+ kfree(events[idx].ev_name);
+ kfree(events[idx].ev_value);
+ continue;
+ }
+ /*
+ * If the common scale and unit properties available,
+ * then, assign them to this event
+ */
+ if (event_scale) {
+ idx++;
+ ret = set_event_property(event_scale, "scale",
+ &events[idx],
+ ev_name);
+ if (ret)
+ continue;
+ idx++;
+ }
+ if (event_unit) {
+ ret = set_event_property(event_unit, "unit",
+ &events[idx],
+ ev_name);
+ if (ret)
+ continue;
+ }
+ idx++;
+ } else if (strncmp(pp->name, "unit", 4) == 0) {
+ ret = set_event_property(pp, "unit", &events[idx],
+ ev_name);
+ if (ret)
+ continue;
+ idx++;
+ } else if (strncmp(pp->name, "scale", 5) == 0) {
+ ret = set_event_property(pp, "scale", &events[idx],
+ ev_name);
+ if (ret)
+ continue;
+ idx++;
+ }
+ }
+
+ return idx;
+}
+
+/*
+ * imc_get_domain : Returns the domain for pmu "pmu_dev".
+ */
+int imc_get_domain(struct device_node *pmu_dev)
+{
+ if (of_device_is_compatible(pmu_dev, IMC_DTB_NEST_COMPAT))
+ return IMC_DOMAIN_NEST;
+ else
+ return IMC_DOMAIN_UNKNOWN;
+}
+
+/*
+ * get_nr_children : Returns the number of children for a pmu device node.
+ */
+static int get_nr_children(struct device_node *pmu_node)
+{
+ struct device_node *child;
+ int i = 0;
+
+ for_each_child_of_node(pmu_node, child)
+ i++;
+ return i;
+}
+
+/*
+ * imc_free_events : Cleanup the "events" list having "nr_entries" entries.
+ */
+static void imc_free_events(struct imc_events *events, int nr_entries)
+{
+ int i;
+
+ /* Nothing to clean, return */
+ if (!events)
+ return;
+ for (i = 0; i < nr_entries; i++) {
+ kfree(events[i].ev_name);
+ kfree(events[i].ev_value);
+ }
+
+ kfree(events);
+}
+
+/*
+ * imc_pmu_create : Takes the parent device which is the pmu unit and a
+ * pmu_index as the inputs.
+ * Allocates memory for the pmu, sets up its domain (NEST or CORE), and
+ * allocates memory for the events supported by this pmu. Assigns a name for
+ * the pmu. Calls imc_events_node_parser() to setup the individual events.
+ * If everything goes fine, it calls, init_imc_pmu() to setup the pmu device
+ * and register it.
+ */
+static int imc_pmu_create(struct device_node *parent, int pmu_index)
+{
+ struct device_node *ev_node = NULL, *dir = NULL;
+ struct imc_events *events;
+ struct imc_pmu *pmu_ptr;
+ u32 prop, reg;
+ struct property *pp, *scale_pp, *unit_pp, *name_prefix;
+ char *buf;
+ int idx = 0, ret = 0, nr_children = 0;
+
+ if (!parent)
+ return -EINVAL;
+
+ /* memory for pmu */
+ pmu_ptr = kzalloc(sizeof(struct imc_pmu), GFP_KERNEL);
+ if (!pmu_ptr)
+ return -ENOMEM;
+
+ pmu_ptr->domain = imc_get_domain(parent);
+ if (pmu_ptr->domain == IMC_DOMAIN_UNKNOWN)
+ goto free_pmu;
+
+ /* Needed for hotplug/migration */
+ per_nest_pmu_arr[pmu_index] = pmu_ptr;
+
+ /*
+ * "events" property inside a PMU node contains the phandle value
+ * for the actual events node. The "events" node for the IMC PMU
+ * is not in this node, rather inside "imc-counters" node, since,
+ * we want to factor out the common events (thereby, reducing the
+ * size of the device tree)
+ */
+ of_property_read_u32(parent, "events", &prop);
+ if (!prop)
+ return -EINVAL;
+
+ /*
+ * Fetch the actual node where the events for this PMU exist.
+ */
+ dir = of_find_node_by_phandle(prop);
+ if (!dir)
+ return -EINVAL;
+
+ /*
+ * Get the maximum no. of events in this node.
+ * Multiply by 3 to account for .scale and .unit properties
+ * This number suggests the amount of memory needed to setup the
+ * events for this pmu.
+ */
+ nr_children = get_nr_children(dir) * 3;
+
+ /* memory for pmu events */
+ events = kzalloc((sizeof(struct imc_events) * nr_children),
+ GFP_KERNEL);
+ if (!events) {
+ ret = -ENOMEM;
+ goto free_pmu;
+ }
+
+ pp = of_find_property(parent, "name", NULL);
+ if (!pp) {
+ ret = -ENODEV;
+ goto free_events;
+ }
+
+ if (!pp->value ||
+ (strnlen(pp->value, pp->length) == pp->length) ||
+ (pp->length > IMC_MAX_PMU_NAME_LEN)) {
+ ret = -EINVAL;
+ goto free_events;
+ }
+
+ buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
+ if (!buf) {
+ ret = -ENOMEM;
+ goto free_events;
+ }
+
+ /* Save the name to register it later */
+ sprintf(buf, "nest_%s", (char *)pp->value);
+ pmu_ptr->pmu.name = (char *)buf;
+
+ /*
+ * Check if there is a common "scale" and "unit" properties inside
+ * the PMU node for all the events supported by this PMU.
+ */
+ scale_pp = of_find_property(parent, "scale", NULL);
+ unit_pp = of_find_property(parent, "unit", NULL);
+
+ /*
+ * Get the event-prefix property from the PMU node
+ * which needs to be attached with the event names.
+ */
+ name_prefix = of_find_property(parent, "events-prefix", NULL);
+ if (!name_prefix)
+ return -ENODEV;
+
+ /*
+ * "reg" property gives out the base offset of the counters data
+ * for this PMU.
+ */
+ of_property_read_u32(parent, "reg", &reg);
+
+ if (!name_prefix->value ||
+ (strnlen(name_prefix->value, name_prefix->length) == name_prefix->length) ||
+ (name_prefix->length > IMC_MAX_PMU_NAME_LEN))
+ return -EINVAL;
+
+ /* Loop through event nodes */
+ for_each_child_of_node(dir, ev_node) {
+ ret = imc_events_node_parser(ev_node, &events[idx], scale_pp,
+ unit_pp, name_prefix, reg);
+ if (ret < 0) {
+ /* Unable to parse this event */
+ if (ret == -ENOMEM)
+ goto free_events;
+ continue;
+ }
+
+ /*
+ * imc_event_node_parser will return number of
+ * event entries created for this. This could include
+ * event scale and unit files also.
+ */
+ idx += ret;
+ }
+
+ return 0;
+
+free_events:
+ imc_free_events(events, idx);
+free_pmu:
+ kfree(pmu_ptr);
+ return ret;
+}
+
+/*
+ * imc_pmu_setup : Setup the IMC PMUs (children of "parent").
+ */
+static void imc_pmu_setup(struct device_node *parent)
+{
+ struct device_node *child;
+ int pmu_count = 0, rc = 0;
+ const struct property *pp;
+
+ if (!parent)
+ return;
+
+ /* Setup all the IMC pmus */
+ for_each_child_of_node(parent, child) {
+ pp = of_get_property(child, "compatible", NULL);
+ if (pp) {
+ /*
+ * If there is a node with a "compatible" field,
+ * that's a PMU node
+ */
+ rc = imc_pmu_create(child, pmu_count);
+ if (rc)
+ return;
+ pmu_count++;
+ }
+ }
+}

static int opal_imc_counters_probe(struct platform_device *pdev)
{
@@ -102,6 +484,10 @@ static int opal_imc_counters_probe(struct platform_device *pdev)
} while (i < (pcni->size / PAGE_SIZE));
}

+#ifdef CONFIG_PERF_EVENTS
+ imc_pmu_setup(imc_dev);
+#endif
+
return 0;
err:
return -ENODEV;
--
2.7.4

2017-04-03 14:57:50

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 04/11] powerpc/perf: Add event attribute and group to IMC pmus

From: Hemant Kumar <[email protected]>

Device tree IMC driver code parses the IMC units and their events. It
passes the information to IMC pmu code which is placed in powerpc/perf
as "imc-pmu.c".

This patch creates only event attributes and attribute groups for the
IMC pmus.

Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Hemant Kumar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/perf/Makefile | 6 +-
arch/powerpc/perf/imc-pmu.c | 97 +++++++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/opal-imc.c | 12 +++-
3 files changed, 112 insertions(+), 3 deletions(-)
create mode 100644 arch/powerpc/perf/imc-pmu.c

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 4d606b99a5cb..d0d1f04203c7 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -2,10 +2,14 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror

obj-$(CONFIG_PERF_EVENTS) += callchain.o perf_regs.o

+imc-$(CONFIG_PPC_POWERNV) += imc-pmu.o
+
obj-$(CONFIG_PPC_PERF_CTRS) += core-book3s.o bhrb.o
obj64-$(CONFIG_PPC_PERF_CTRS) += power4-pmu.o ppc970-pmu.o power5-pmu.o \
power5+-pmu.o power6-pmu.o power7-pmu.o \
- isa207-common.o power8-pmu.o power9-pmu.o
+ isa207-common.o power8-pmu.o power9-pmu.o \
+ $(imc-y)
+
obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o

obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
new file mode 100644
index 000000000000..ec464c76b749
--- /dev/null
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -0,0 +1,97 @@
+/*
+ * Nest Performance Monitor counter support.
+ *
+ * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
+ * (C) 2016 Hemant K Shaw, IBM Corporation.
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+#include <linux/perf_event.h>
+#include <linux/slab.h>
+#include <asm/opal.h>
+#include <asm/imc-pmu.h>
+#include <asm/cputhreads.h>
+#include <asm/smp.h>
+#include <linux/string.h>
+
+struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+
+/* dev_str_attr : Populate event "name" and string "str" in attribute */
+static struct attribute *dev_str_attr(const char *name, const char *str)
+{
+ struct perf_pmu_events_attr *attr;
+
+ attr = kzalloc(sizeof(*attr), GFP_KERNEL);
+
+ sysfs_attr_init(&attr->attr.attr);
+
+ attr->event_str = str;
+ attr->attr.attr.name = name;
+ attr->attr.attr.mode = 0444;
+ attr->attr.show = perf_event_sysfs_show;
+
+ return &attr->attr.attr;
+}
+
+/*
+ * update_events_in_group: Update the "events" information in an attr_group
+ * and assign the attr_group to the pmu "pmu".
+ */
+static int update_events_in_group(struct imc_events *events,
+ int idx, struct imc_pmu *pmu)
+{
+ struct attribute_group *attr_group;
+ struct attribute **attrs;
+ int i;
+
+ /* Allocate memory for attribute group */
+ attr_group = kzalloc(sizeof(*attr_group), GFP_KERNEL);
+ if (!attr_group)
+ return -ENOMEM;
+
+ /* Allocate memory for attributes */
+ attrs = kzalloc((sizeof(struct attribute *) * (idx + 1)), GFP_KERNEL);
+ if (!attrs) {
+ kfree(attr_group);
+ return -ENOMEM;
+ }
+
+ attr_group->name = "events";
+ attr_group->attrs = attrs;
+ for (i = 0; i < idx; i++, events++) {
+ attrs[i] = dev_str_attr((char *)events->ev_name,
+ (char *)events->ev_value);
+ }
+
+ pmu->attr_groups[0] = attr_group;
+ return 0;
+}
+
+/*
+ * init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events
+ * "events".
+ * Setup the cpu mask information for these pmus and setup the state machine
+ * hotplug notifiers as well.
+ */
+int init_imc_pmu(struct imc_events *events, int idx,
+ struct imc_pmu *pmu_ptr)
+{
+ int ret = -ENODEV;
+
+ ret = update_events_in_group(events, idx, pmu_ptr);
+ if (ret)
+ goto err_free;
+
+ return 0;
+
+err_free:
+ /* Only free the attr_groups which are dynamically allocated */
+ if (pmu_ptr->attr_groups[0]) {
+ kfree(pmu_ptr->attr_groups[0]->attrs);
+ kfree(pmu_ptr->attr_groups[0]);
+ }
+
+ return ret;
+}
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
index ba0203e669c0..73c46703c2af 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -32,8 +32,11 @@
#include <asm/cputable.h>
#include <asm/imc-pmu.h>

-struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
-struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+
+extern int init_imc_pmu(struct imc_events *events,
+ int idx, struct imc_pmu *pmu_ptr);

static int imc_event_info(char *name, struct imc_events *events)
{
@@ -379,6 +382,11 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index)
idx += ret;
}

+ ret = init_imc_pmu(events, idx, pmu_ptr);
+ if (ret) {
+ pr_err("IMC PMU %s Register failed\n", pmu_ptr->pmu.name);
+ goto free_events;
+ }
return 0;

free_events:
--
2.7.4

2017-04-03 14:57:47

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 06/11] powerpc/perf: IMC pmu cpumask and cpu hotplug support

From: Hemant Kumar <[email protected]>

Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any
online CPU) from each chip for nest PMUs is designated to read counters.

On CPU hotplug, dying CPU is checked to see whether it is one of the
designated cpus, if yes, next online cpu from the same chip (for nest
units) is designated as new cpu to read counters. For this purpose, we
introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE.

Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Hemant Kumar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/include/asm/opal-api.h | 13 ++-
arch/powerpc/include/asm/opal.h | 3 +
arch/powerpc/perf/imc-pmu.c | 155 ++++++++++++++++++++++++-
arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
include/linux/cpuhotplug.h | 1 +
5 files changed, 171 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index a0aa285869b5..23fc51e9d71d 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -168,7 +168,8 @@
#define OPAL_INT_SET_MFRR 125
#define OPAL_PCI_TCE_KILL 126
#define OPAL_NMMU_SET_PTCR 127
-#define OPAL_LAST 127
+#define OPAL_NEST_IMC_COUNTERS_CONTROL 149
+#define OPAL_LAST 149

/* Device tree flags */

@@ -928,6 +929,16 @@ enum {
OPAL_PCI_TCE_KILL_ALL,
};

+/* Operation argument to OPAL_NEST_IMC_COUNTERS_CONTROL */
+enum {
+ OPAL_NEST_IMC_PRODUCTION_MODE = 1,
+};
+
+enum {
+ OPAL_NEST_IMC_STOP,
+ OPAL_NEST_IMC_START,
+};
+
#endif /* __ASSEMBLY__ */

#endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 1ff03a6da76e..d93d08204243 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t kill_type,
uint64_t dma_addr, uint32_t npages);
int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);

+int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
+ uint64_t value2, uint64_t value3);
+
/* Internal functions */
extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
int depth, void *data);
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index bd5bf66bd920..b28835b861b3 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -17,6 +17,7 @@

struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+static cpumask_t nest_imc_cpumask;

/* Needed for sanity check */
extern u64 nest_max_offset;
@@ -32,6 +33,152 @@ static struct attribute_group imc_format_group = {
.attrs = imc_format_attrs,
};

+/* Get the cpumask printed to a buffer "buf" */
+static ssize_t imc_pmu_cpumask_get_attr(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ cpumask_t *active_mask;
+
+ active_mask = &nest_imc_cpumask;
+ return cpumap_print_to_pagebuf(true, buf, active_mask);
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, imc_pmu_cpumask_get_attr, NULL);
+
+static struct attribute *imc_pmu_cpumask_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL,
+};
+
+static struct attribute_group imc_pmu_cpumask_attr_group = {
+ .attrs = imc_pmu_cpumask_attrs,
+};
+
+/*
+ * nest_init : Initializes the nest imc engine for the current chip.
+ */
+static void nest_init(int *loc)
+{
+ int rc;
+
+ rc = opal_nest_imc_counters_control(OPAL_NEST_IMC_PRODUCTION_MODE,
+ OPAL_NEST_IMC_START, 0, 0);
+ if (rc)
+ loc[smp_processor_id()] = 1;
+}
+
+static void nest_change_cpu_context(int old_cpu, int new_cpu)
+{
+ int i;
+
+ for (i = 0;
+ (per_nest_pmu_arr[i] != NULL) && (i < IMC_MAX_PMUS); i++)
+ perf_pmu_migrate_context(&per_nest_pmu_arr[i]->pmu,
+ old_cpu, new_cpu);
+}
+
+static int ppc_nest_imc_cpu_online(unsigned int cpu)
+{
+ int nid;
+ const struct cpumask *l_cpumask;
+ struct cpumask tmp_mask;
+
+ /* Find the cpumask of this node */
+ nid = cpu_to_node(cpu);
+ l_cpumask = cpumask_of_node(nid);
+
+ /*
+ * If any of the cpu from this node is already present in the mask,
+ * just return, if not, then set this cpu in the mask.
+ */
+ if (!cpumask_and(&tmp_mask, l_cpumask, &nest_imc_cpumask)) {
+ cpumask_set_cpu(cpu, &nest_imc_cpumask);
+ return 0;
+ }
+
+ return 0;
+}
+
+static int ppc_nest_imc_cpu_offline(unsigned int cpu)
+{
+ int nid, target = -1;
+ const struct cpumask *l_cpumask;
+
+ /*
+ * Check in the designated list for this cpu. Dont bother
+ * if not one of them.
+ */
+ if (!cpumask_test_and_clear_cpu(cpu, &nest_imc_cpumask))
+ return 0;
+
+ /*
+ * Now that this cpu is one of the designated,
+ * find a next cpu a) which is online and b) in same chip.
+ */
+ nid = cpu_to_node(cpu);
+ l_cpumask = cpumask_of_node(nid);
+ target = cpumask_next(cpu, l_cpumask);
+
+ /*
+ * Update the cpumask with the target cpu and
+ * migrate the context if needed
+ */
+ if (target >= 0 && target < nr_cpu_ids) {
+ cpumask_set_cpu(target, &nest_imc_cpumask);
+ nest_change_cpu_context(cpu, target);
+ }
+ return 0;
+}
+
+static int nest_pmu_cpumask_init(void)
+{
+ const struct cpumask *l_cpumask;
+ int cpu, nid;
+ int *cpus_opal_rc;
+
+ if (!cpumask_empty(&nest_imc_cpumask))
+ return 0;
+
+ /*
+ * Memory for OPAL call return value.
+ */
+ cpus_opal_rc = kzalloc((sizeof(int) * nr_cpu_ids), GFP_KERNEL);
+ if (!cpus_opal_rc)
+ goto fail;
+
+ /*
+ * Nest PMUs are per-chip counters. So designate a cpu
+ * from each chip for counter collection.
+ */
+ for_each_online_node(nid) {
+ l_cpumask = cpumask_of_node(nid);
+
+ /* designate first online cpu in this node */
+ cpu = cpumask_first(l_cpumask);
+ cpumask_set_cpu(cpu, &nest_imc_cpumask);
+ }
+
+ /* Initialize Nest PMUs in each node using designated cpus */
+ on_each_cpu_mask(&nest_imc_cpumask, (smp_call_func_t)nest_init,
+ (void *)cpus_opal_rc, 1);
+
+ /* Check return value array for any OPAL call failure */
+ for_each_cpu(cpu, &nest_imc_cpumask) {
+ if (cpus_opal_rc[cpu])
+ goto fail;
+ }
+
+ cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
+ "POWER_NEST_IMC_ONLINE",
+ ppc_nest_imc_cpu_online,
+ ppc_nest_imc_cpu_offline);
+
+ return 0;
+
+fail:
+ return -ENODEV;
+}
+
static int nest_imc_event_init(struct perf_event *event)
{
int chip_id;
@@ -70,7 +217,7 @@ static int nest_imc_event_init(struct perf_event *event)
* and then add to it.
*/
event->hw.event_base = pcni->vbase[config/PAGE_SIZE] +
- (config & ~PAGE_MASK);
+ (config & ~PAGE_MASK);

return 0;
}
@@ -139,6 +286,7 @@ static int update_pmu_ops(struct imc_pmu *pmu)
pmu->pmu.stop = imc_event_stop;
pmu->pmu.read = imc_perf_event_update;
pmu->attr_groups[1] = &imc_format_group;
+ pmu->attr_groups[2] = &imc_pmu_cpumask_attr_group;
pmu->pmu.attr_groups = pmu->attr_groups;

return 0;
@@ -206,6 +354,11 @@ int init_imc_pmu(struct imc_events *events, int idx,
{
int ret = -ENODEV;

+ /* Add cpumask and register for hotplug notification */
+ ret = nest_pmu_cpumask_init();
+ if (ret)
+ return ret;
+
ret = update_events_in_group(events, idx, pmu_ptr);
if (ret)
goto err_free;
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index da8a0f7a035c..b7208d8e6cc0 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -301,3 +301,4 @@ OPAL_CALL(opal_int_eoi, OPAL_INT_EOI);
OPAL_CALL(opal_int_set_mfrr, OPAL_INT_SET_MFRR);
OPAL_CALL(opal_pci_tce_kill, OPAL_PCI_TCE_KILL);
OPAL_CALL(opal_nmmu_set_ptcr, OPAL_NMMU_SET_PTCR);
+OPAL_CALL(opal_nest_imc_counters_control, OPAL_NEST_IMC_COUNTERS_CONTROL);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 62d240e962f0..cfb0cedc72af 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -136,6 +136,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_ARM_CCI_ONLINE,
CPUHP_AP_PERF_ARM_CCN_ONLINE,
CPUHP_AP_PERF_ARM_L2X0_ONLINE,
+ CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
--
2.7.4

2017-04-03 14:57:46

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 07/11] powerpc/powernv: Core IMC events detection

From: Hemant Kumar <[email protected]>

This patch adds support for detection of core IMC events along with the
Nest IMC events. It adds a new domain IMC_DOMAIN_CORE and its determined
with the help of the compatibility string "ibm,imc-counters-core" based
on the IMC device tree.

Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Hemant Kumar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/include/asm/imc-pmu.h | 2 ++
arch/powerpc/perf/imc-pmu.c | 3 +++
arch/powerpc/platforms/powernv/opal-imc.c | 18 ++++++++++++++++--
3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
index e13f51047522..2c39603ff3e7 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -26,6 +26,7 @@

#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
#define IMC_DTB_NEST_COMPAT "ibm,imc-counters-nest"
+#define IMC_DTB_CORE_COMPAT "ibm,imc-counters-core"

/*
* Structure to hold per chip specific memory address
@@ -63,6 +64,7 @@ struct imc_pmu {
* Domains for IMC PMUs
*/
#define IMC_DOMAIN_NEST 1
+#define IMC_DOMAIN_CORE 2
#define IMC_DOMAIN_UNKNOWN -1

int imc_get_domain(struct device_node *pmu_dev);
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index b28835b861b3..728c67e139e0 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -19,8 +19,11 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
static cpumask_t nest_imc_cpumask;

+struct imc_pmu *core_imc_pmu;
+
/* Needed for sanity check */
extern u64 nest_max_offset;
+extern u64 core_max_offset;

PMU_FORMAT_ATTR(event, "config:0-20");
static struct attribute *imc_format_attrs[] = {
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
index a98678266b0d..f6f63399ab06 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -34,10 +34,12 @@

extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+extern struct imc_pmu *core_imc_pmu;

extern int init_imc_pmu(struct imc_events *events,
int idx, struct imc_pmu *pmu_ptr);
u64 nest_max_offset;
+u64 core_max_offset;

static int imc_event_info(char *name, struct imc_events *events)
{
@@ -81,6 +83,10 @@ static void update_max_value(u32 value, int pmu_domain)
if (nest_max_offset < value)
nest_max_offset = value;
break;
+ case IMC_DOMAIN_CORE:
+ if (core_max_offset < value)
+ core_max_offset = value;
+ break;
default:
/* Unknown domain, return */
return;
@@ -232,6 +238,8 @@ int imc_get_domain(struct device_node *pmu_dev)
{
if (of_device_is_compatible(pmu_dev, IMC_DTB_NEST_COMPAT))
return IMC_DOMAIN_NEST;
+ if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT))
+ return IMC_DOMAIN_CORE;
else
return IMC_DOMAIN_UNKNOWN;
}
@@ -299,7 +307,10 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index)
goto free_pmu;

/* Needed for hotplug/migration */
- per_nest_pmu_arr[pmu_index] = pmu_ptr;
+ if (pmu_ptr->domain == IMC_DOMAIN_CORE)
+ core_imc_pmu = pmu_ptr;
+ else if (pmu_ptr->domain == IMC_DOMAIN_NEST)
+ per_nest_pmu_arr[pmu_index] = pmu_ptr;

/*
* "events" property inside a PMU node contains the phandle value
@@ -355,7 +366,10 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index)
}

/* Save the name to register it later */
- sprintf(buf, "nest_%s", (char *)pp->value);
+ if (pmu_ptr->domain == IMC_DOMAIN_NEST)
+ sprintf(buf, "nest_%s", (char *)pp->value);
+ else
+ sprintf(buf, "%s_imc", (char *)pp->value);
pmu_ptr->pmu.name = (char *)buf;

/*
--
2.7.4

2017-04-03 14:57:44

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 08/11] powerpc/perf: PMU functions for Core IMC and hotplugging

From: Hemant Kumar <[email protected]>

This patch adds the PMU function to initialize a core IMC event. It also
adds cpumask initialization function for core IMC PMU. For
initialization, a 8KB of memory is allocated per core where the data
for core IMC counters will be accumulated. The base address for this
page is sent to OPAL via an OPAL call which initializes various SCOMs
related to Core IMC initialization. Upon any errors, the pages are
free'ed and core IMC counters are disabled using the same OPAL call.

For CPU hotplugging, a cpumask is initialized which contains an online
CPU from each core. If a cpu goes offline, we check whether that cpu
belongs to the core imc cpumask, if yes, then, we migrate the PMU
context to any other online cpu (if available) in that core. If a cpu
comes back online, then this cpu will be added to the core imc cpumask
only if there was no other cpu from that core in the previous cpumask.

To register the hotplug functions for core_imc, a new state
CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE is added to the list of existing
states.

Patch also adds OPAL device shutdown callback. Needed to disable the
IMC core engine to handle kexec.

Signed-off-by: Hemant Kumar <[email protected]>
[Anju: Changed the condition for setting cpumask for core
in imc_pmu_cpumask_get_attr() ]
Signed-off-by: Anju T Sudhakar <[email protected]>

Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/include/asm/imc-pmu.h | 3 +
arch/powerpc/include/asm/opal-api.h | 10 +-
arch/powerpc/include/asm/opal.h | 2 +
arch/powerpc/perf/imc-pmu.c | 267 ++++++++++++++++++++++++-
arch/powerpc/platforms/powernv/opal-imc.c | 13 +-
arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
include/linux/cpuhotplug.h | 1 +
7 files changed, 287 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
index 2c39603ff3e7..4aa63191456a 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -21,8 +21,10 @@
#define IMC_MAX_CHIPS 32
#define IMC_MAX_PMUS 32
#define IMC_MAX_PMU_NAME_LEN 256
+#define IMC_MAX_CORES 32

#define IMC_NEST_MAX_PAGES 16
+#define IMC_CORE_COUNTER_MEM 8192

#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
#define IMC_DTB_NEST_COMPAT "ibm,imc-counters-nest"
@@ -68,4 +70,5 @@ struct imc_pmu {
#define IMC_DOMAIN_UNKNOWN -1

int imc_get_domain(struct device_node *pmu_dev);
+void core_imc_disable(void);
#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 23fc51e9d71d..971918deb793 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -169,7 +169,8 @@
#define OPAL_PCI_TCE_KILL 126
#define OPAL_NMMU_SET_PTCR 127
#define OPAL_NEST_IMC_COUNTERS_CONTROL 149
-#define OPAL_LAST 149
+#define OPAL_CORE_IMC_COUNTERS_CONTROL 150
+#define OPAL_LAST 150

/* Device tree flags */

@@ -939,6 +940,13 @@ enum {
OPAL_NEST_IMC_START,
};

+/* Operation argument to Core IMC */
+enum {
+ OPAL_CORE_IMC_DISABLE,
+ OPAL_CORE_IMC_ENABLE,
+ OPAL_CORE_IMC_INIT,
+};
+
#endif /* __ASSEMBLY__ */

#endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index d93d08204243..c4baa6d32037 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -229,6 +229,8 @@ int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);

int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
uint64_t value2, uint64_t value3);
+int64_t opal_core_imc_counters_control(uint64_t operation, uint64_t addr,
+ uint64_t value2, uint64_t value3);

/* Internal functions */
extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 728c67e139e0..45f9b35142a7 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1,5 +1,5 @@
/*
- * Nest Performance Monitor counter support.
+ * IMC Performance Monitor counter support.
*
* Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
* (C) 2016 Hemant K Shaw, IBM Corporation.
@@ -19,6 +19,15 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
static cpumask_t nest_imc_cpumask;

+/*
+ * Maintains base addresses for all the cores.
+ * MAX chip and core are defined as 32. So we
+ * statically allocate 8K for this structure.
+ *
+ * TODO -- Could be made dynamic
+ */
+static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES];
+static cpumask_t core_imc_cpumask;
struct imc_pmu *core_imc_pmu;

/* Needed for sanity check */
@@ -38,11 +47,18 @@ static struct attribute_group imc_format_group = {

/* Get the cpumask printed to a buffer "buf" */
static ssize_t imc_pmu_cpumask_get_attr(struct device *dev,
- struct device_attribute *attr, char *buf)
+ struct device_attribute *attr,
+ char *buf)
{
+ struct pmu *pmu = dev_get_drvdata(dev);
cpumask_t *active_mask;

- active_mask = &nest_imc_cpumask;
+ if (!strncmp(pmu->name, "nest_", strlen("nest_")))
+ active_mask = &nest_imc_cpumask;
+ else if (!strncmp(pmu->name, "core_", strlen("core_")))
+ active_mask = &core_imc_cpumask;
+ else
+ return 0;
return cpumap_print_to_pagebuf(true, buf, active_mask);
}

@@ -58,6 +74,101 @@ static struct attribute_group imc_pmu_cpumask_attr_group = {
};

/*
+ * core_imc_mem_init : Initializes memory for the current core.
+ *
+ * Uses alloc_pages_exact_nid() and uses the returned address as an argument to
+ * an opal call to configure the pdbar. The address sent as an argument is
+ * converted to physical address before the opal call is made. This is the
+ * base address at which the core imc counters are populated.
+ */
+static int core_imc_mem_init(void)
+{
+ int core_id, phys_id;
+ int rc = -1;
+
+ phys_id = topology_physical_package_id(smp_processor_id());
+ core_id = smp_processor_id() / threads_per_core;
+
+ per_core_pdbar_add[phys_id][core_id] = (u64) alloc_pages_exact_nid(phys_id,
+ (size_t) IMC_CORE_COUNTER_MEM, GFP_KERNEL | __GFP_ZERO);
+ rc = opal_core_imc_counters_control(OPAL_CORE_IMC_INIT,
+ (u64)virt_to_phys((void *)per_core_pdbar_add[phys_id][core_id]),
+ 0, 0);
+
+ return rc;
+}
+
+/*
+ * Calls core_imc_mem_init and checks the return value.
+ */
+static void core_imc_init(int *loc)
+{
+ int rc = 0;
+
+ rc = core_imc_mem_init();
+ if (rc)
+ loc[smp_processor_id()] = 1;
+}
+
+static void core_imc_change_cpu_context(int old_cpu, int new_cpu)
+{
+ /* Sanity check before we migrate */
+ if (!core_imc_pmu)
+ return;
+
+ perf_pmu_migrate_context(&core_imc_pmu->pmu, old_cpu, new_cpu);
+}
+
+
+static int ppc_core_imc_cpu_online(unsigned int cpu)
+{
+ int ret;
+
+ /* If a cpu for this core is already set, then, don't do anything */
+ ret = cpumask_any_and(&core_imc_cpumask,
+ cpu_sibling_mask(cpu));
+ if (ret < nr_cpu_ids)
+ return 0;
+
+ /*
+ * else, first cpu in this core, so set the cpu in the mask
+ * and enable Core imc.
+ */
+ cpumask_set_cpu(cpu, &core_imc_cpumask);
+ opal_core_imc_counters_control(OPAL_CORE_IMC_ENABLE, 0, 0, 0);
+ return 0;
+}
+
+static int ppc_core_imc_cpu_offline(unsigned int cpu)
+{
+ int target;
+ unsigned int ncpu;
+
+ /*
+ * clear this cpu out of the mask, if not present in the mask,
+ * don't bother doing anything.
+ */
+ if (!cpumask_test_and_clear_cpu(cpu, &core_imc_cpumask))
+ return 0;
+
+ /* Find any online cpu in that core except the current "cpu" */
+ ncpu = cpumask_any_but(cpu_sibling_mask(cpu), cpu);
+
+ if (ncpu < nr_cpu_ids) {
+ target = ncpu;
+ cpumask_set_cpu(target, &core_imc_cpumask);
+
+ /* migrate the context */
+ core_imc_change_cpu_context(cpu, target);
+ } else {
+ /* No online cpus in this core, disable core imc and return */
+ opal_core_imc_counters_control(OPAL_CORE_IMC_DISABLE, 0, 0, 0);
+ }
+
+ return 0;
+}
+
+/*
* nest_init : Initializes the nest imc engine for the current chip.
*/
static void nest_init(int *loc)
@@ -182,6 +293,92 @@ static int nest_pmu_cpumask_init(void)
return -ENODEV;
}

+static void cleanup_core_imc_memory(void)
+{
+ int phys_id, core_id;
+ u64 addr;
+
+ phys_id = topology_physical_package_id(smp_processor_id());
+ core_id = smp_processor_id() / threads_per_core;
+
+ addr = per_core_pdbar_add[phys_id][core_id];
+
+ /* Only if the address is non-zero shall, we free it */
+ if (addr)
+ free_pages(addr, 0);
+}
+
+static void cleanup_all_core_imc_memory(void)
+{
+ on_each_cpu_mask(&core_imc_cpumask,
+ (smp_call_func_t)cleanup_core_imc_memory, NULL, 1);
+}
+
+/*
+ * Disabling of IMC Core Engine needs a scom operation
+ */
+static void core_imc_control_disable(void)
+{
+ opal_core_imc_counters_control(OPAL_CORE_IMC_DISABLE, 0, 0, 0);
+}
+
+/*
+ * Function to diable the IMC Core engine using core imc cpumask
+ */
+void core_imc_disable(void)
+{
+ on_each_cpu_mask(&core_imc_cpumask,
+ (smp_call_func_t)core_imc_control_disable, NULL, 1);
+}
+
+static int core_imc_pmu_cpumask_init(void)
+{
+ int cpu, *cpus_opal_rc;
+
+ /*
+ * Get the mask of first online cpus for every core.
+ */
+ core_imc_cpumask = cpu_online_cores_map();
+
+ /*
+ * Memory for OPAL call return value.
+ */
+ cpus_opal_rc = kzalloc((sizeof(int) * nr_cpu_ids), GFP_KERNEL);
+ if (!cpus_opal_rc)
+ goto fail;
+
+ /*
+ * Initialize the core IMC PMU on each core using the
+ * core_imc_cpumask by calling core_imc_init().
+ */
+ on_each_cpu_mask(&core_imc_cpumask, (smp_call_func_t)core_imc_init,
+ (void *)cpus_opal_rc, 1);
+
+ /* Check return value array for any OPAL call failure */
+ for_each_cpu(cpu, &core_imc_cpumask) {
+ if (cpus_opal_rc[cpu]) {
+ kfree(cpus_opal_rc);
+ goto fail;
+ }
+ }
+
+ kfree(cpus_opal_rc);
+
+ cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE,
+ "POWER_CORE_IMC_ONLINE",
+ ppc_core_imc_cpu_online,
+ ppc_core_imc_cpu_offline);
+
+ return 0;
+
+fail:
+ /* First, disable the core imc engine */
+ core_imc_disable();
+ /* Then, free up the allocated pages */
+ cleanup_all_core_imc_memory();
+ return -ENODEV;
+}
+
static int nest_imc_event_init(struct perf_event *event)
{
int chip_id;
@@ -225,6 +422,44 @@ static int nest_imc_event_init(struct perf_event *event)
return 0;
}

+static int core_imc_event_init(struct perf_event *event)
+{
+ int core_id, phys_id;
+ u64 config = event->attr.config;
+
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /* Sampling not supported */
+ if (event->hw.sample_period)
+ return -EINVAL;
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest)
+ return -EINVAL;
+
+ if (event->cpu < 0)
+ return -EINVAL;
+
+ event->hw.idx = -1;
+
+ /* Sanity check for config (event offset) */
+ if (config > core_max_offset)
+ return -EINVAL;
+
+ core_id = event->cpu / threads_per_core;
+ phys_id = topology_physical_package_id(event->cpu);
+ event->hw.event_base =
+ per_core_pdbar_add[phys_id][core_id] + config;
+
+ return 0;
+}
+
static void imc_read_counter(struct perf_event *event)
{
u64 *addr, data;
@@ -282,7 +517,11 @@ static int update_pmu_ops(struct imc_pmu *pmu)
return -EINVAL;

pmu->pmu.task_ctx_nr = perf_invalid_context;
- pmu->pmu.event_init = nest_imc_event_init;
+ if (pmu->domain == IMC_DOMAIN_NEST) {
+ pmu->pmu.event_init = nest_imc_event_init;
+ } else if (pmu->domain == IMC_DOMAIN_CORE) {
+ pmu->pmu.event_init = core_imc_event_init;
+ }
pmu->pmu.add = imc_event_add;
pmu->pmu.del = imc_event_stop;
pmu->pmu.start = imc_event_start;
@@ -358,9 +597,20 @@ int init_imc_pmu(struct imc_events *events, int idx,
int ret = -ENODEV;

/* Add cpumask and register for hotplug notification */
- ret = nest_pmu_cpumask_init();
- if (ret)
- return ret;
+ switch (pmu_ptr->domain) {
+ case IMC_DOMAIN_NEST:
+ ret = nest_pmu_cpumask_init();
+ if (ret)
+ return ret;
+ break;
+ case IMC_DOMAIN_CORE:
+ ret = core_imc_pmu_cpumask_init();
+ if (ret)
+ return ret;
+ break;
+ default:
+ return -1; /* Unknown domain */
+ }

ret = update_events_in_group(events, idx, pmu_ptr);
if (ret)
@@ -385,6 +635,9 @@ int init_imc_pmu(struct imc_events *events, int idx,
kfree(pmu_ptr->attr_groups[0]->attrs);
kfree(pmu_ptr->attr_groups[0]);
}
+ /* For core_imc, we have allocated memory, we need to free it */
+ if (pmu_ptr->domain == IMC_DOMAIN_CORE)
+ cleanup_all_core_imc_memory();

return ret;
}
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
index f6f63399ab06..f261fc933959 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -192,12 +192,12 @@ static int imc_events_node_parser(struct device_node *dev,
kfree(events[idx].ev_value);
continue;
}
+ idx++;
/*
* If the common scale and unit properties available,
* then, assign them to this event
*/
if (event_scale) {
- idx++;
ret = set_event_property(event_scale, "scale",
&events[idx],
ev_name);
@@ -211,8 +211,8 @@ static int imc_events_node_parser(struct device_node *dev,
ev_name);
if (ret)
continue;
+ idx++;
}
- idx++;
} else if (strncmp(pp->name, "unit", 4) == 0) {
ret = set_event_property(pp, "unit", &events[idx],
ev_name);
@@ -537,6 +537,14 @@ static int opal_imc_counters_probe(struct platform_device *pdev)
return -ENODEV;
}

+static void opal_imc_counters_shutdown(struct platform_device *pdev)
+{
+#ifdef CONFIG_PERF_EVENTS
+ /* Disable the IMC Core functions */
+ core_imc_disable();
+#endif
+}
+
static const struct of_device_id opal_imc_match[] = {
{ .compatible = IMC_DTB_COMPAT },
{},
@@ -548,6 +556,7 @@ static struct platform_driver opal_imc_driver = {
.of_match_table = opal_imc_match,
},
.probe = opal_imc_counters_probe,
+ .shutdown = opal_imc_counters_shutdown,
};

MODULE_DEVICE_TABLE(of, opal_imc_match);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index b7208d8e6cc0..672d26ba94b7 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -302,3 +302,4 @@ OPAL_CALL(opal_int_set_mfrr, OPAL_INT_SET_MFRR);
OPAL_CALL(opal_pci_tce_kill, OPAL_PCI_TCE_KILL);
OPAL_CALL(opal_nmmu_set_ptcr, OPAL_NMMU_SET_PTCR);
OPAL_CALL(opal_nest_imc_counters_control, OPAL_NEST_IMC_COUNTERS_CONTROL);
+OPAL_CALL(opal_core_imc_counters_control, OPAL_CORE_IMC_COUNTERS_CONTROL);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index cfb0cedc72af..abde85d9511a 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -137,6 +137,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_ARM_CCN_ONLINE,
CPUHP_AP_PERF_ARM_L2X0_ONLINE,
CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
+ CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE,
CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
--
2.7.4

2017-04-03 14:57:42

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 09/11] powerpc/powernv: Thread IMC events detection

From: Hemant Kumar <[email protected]>

Patch adds support for detection of thread IMC events. It adds a new
domain IMC_DOMAIN_THREAD and it is determined with the help of the
compatibility string "ibm,imc-counters-thread" based on the IMC device
tree.

Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Hemant Kumar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/include/asm/imc-pmu.h | 2 ++
arch/powerpc/perf/imc-pmu.c | 1 +
arch/powerpc/platforms/powernv/opal-imc.c | 9 ++++++++-
3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
index 4aa63191456a..c63bc78fd6f6 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -29,6 +29,7 @@
#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
#define IMC_DTB_NEST_COMPAT "ibm,imc-counters-nest"
#define IMC_DTB_CORE_COMPAT "ibm,imc-counters-core"
+#define IMC_DTB_THREAD_COMPAT "ibm,imc-counters-thread"

/*
* Structure to hold per chip specific memory address
@@ -67,6 +68,7 @@ struct imc_pmu {
*/
#define IMC_DOMAIN_NEST 1
#define IMC_DOMAIN_CORE 2
+#define IMC_DOMAIN_THREAD 3
#define IMC_DOMAIN_UNKNOWN -1

int imc_get_domain(struct device_node *pmu_dev);
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 45f9b35142a7..35b3564747e2 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -33,6 +33,7 @@ struct imc_pmu *core_imc_pmu;
/* Needed for sanity check */
extern u64 nest_max_offset;
extern u64 core_max_offset;
+extern u64 thread_max_offset;

PMU_FORMAT_ATTR(event, "config:0-20");
static struct attribute *imc_format_attrs[] = {
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
index f261fc933959..ac625cf13875 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -40,6 +40,7 @@ extern int init_imc_pmu(struct imc_events *events,
int idx, struct imc_pmu *pmu_ptr);
u64 nest_max_offset;
u64 core_max_offset;
+u64 thread_max_offset;

static int imc_event_info(char *name, struct imc_events *events)
{
@@ -87,6 +88,10 @@ static void update_max_value(u32 value, int pmu_domain)
if (core_max_offset < value)
core_max_offset = value;
break;
+ case IMC_DOMAIN_THREAD:
+ if (thread_max_offset < value)
+ thread_max_offset = value;
+ break;
default:
/* Unknown domain, return */
return;
@@ -240,6 +245,8 @@ int imc_get_domain(struct device_node *pmu_dev)
return IMC_DOMAIN_NEST;
if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT))
return IMC_DOMAIN_CORE;
+ if (of_device_is_compatible(pmu_dev, IMC_DTB_THREAD_COMPAT))
+ return IMC_DOMAIN_THREAD;
else
return IMC_DOMAIN_UNKNOWN;
}
@@ -278,7 +285,7 @@ static void imc_free_events(struct imc_events *events, int nr_entries)
/*
* imc_pmu_create : Takes the parent device which is the pmu unit and a
* pmu_index as the inputs.
- * Allocates memory for the pmu, sets up its domain (NEST or CORE), and
+ * Allocates memory for the pmu, sets up its domain (NEST/CORE/THREAD), and
* allocates memory for the events supported by this pmu. Assigns a name for
* the pmu. Calls imc_events_node_parser() to setup the individual events.
* If everything goes fine, it calls, init_imc_pmu() to setup the pmu device
--
2.7.4

2017-04-03 14:57:41

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 10/11] powerpc/perf: Thread IMC PMU functions

From: Hemant Kumar <[email protected]>

This patch adds the PMU functions required for event initialization,
read, update, add, del etc. for thread IMC PMU. Thread IMC PMUs are used
for per-task monitoring. These PMUs don't need any hotplugging support.

For each CPU, a page of memory is allocated and is kept static i.e.,
these pages will exist till the machine shuts down. The base address of
this page is assigned to the ldbar of that cpu. As soon as we do that,
the thread IMC counters start running for that cpu and the data of these
counters are assigned to the page allocated. But we use this for
per-task monitoring. Whenever we start monitoring a task, the event is
added is onto the task. At that point, we read the initial value of the
event. Whenever, we stop monitoring the task, the final value is taken
and the difference is the event data.

Now, a task can move to a different cpu. Suppose a task X is moving from
cpu A to cpu B. When the task is scheduled out of A, we get an
event_del for A, and hence, the event data is updated. And, we stop
updating the X's event data. As soon as X moves on to B, event_add is
called for B, and we again update the event_data. And this is how it
keeps on updating the event data even when the task is scheduled on to
different cpus.

Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Hemant Kumar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/include/asm/imc-pmu.h | 5 +
arch/powerpc/perf/imc-pmu.c | 173 +++++++++++++++++++++++++++++-
arch/powerpc/platforms/powernv/opal-imc.c | 3 +
3 files changed, 180 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
index c63bc78fd6f6..42f0149886b4 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -25,12 +25,16 @@

#define IMC_NEST_MAX_PAGES 16
#define IMC_CORE_COUNTER_MEM 8192
+#define IMC_THREAD_COUNTER_MEM 8192

#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
#define IMC_DTB_NEST_COMPAT "ibm,imc-counters-nest"
#define IMC_DTB_CORE_COMPAT "ibm,imc-counters-core"
#define IMC_DTB_THREAD_COMPAT "ibm,imc-counters-thread"

+#define THREAD_IMC_LDBAR_MASK 0x0003ffffffffe000
+#define THREAD_IMC_ENABLE 0x8000000000000000
+
/*
* Structure to hold per chip specific memory address
* information for nest pmus. Nest Counter data are exported
@@ -73,4 +77,5 @@ struct imc_pmu {

int imc_get_domain(struct device_node *pmu_dev);
void core_imc_disable(void);
+void thread_imc_disable(void);
#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 35b3564747e2..3db637c6d3f4 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -30,6 +30,9 @@ static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES];
static cpumask_t core_imc_cpumask;
struct imc_pmu *core_imc_pmu;

+/* Maintains base address for all the cpus */
+static u64 per_cpu_add[NR_CPUS];
+
/* Needed for sanity check */
extern u64 nest_max_offset;
extern u64 core_max_offset;
@@ -461,6 +464,56 @@ static int core_imc_event_init(struct perf_event *event)
return 0;
}

+static int thread_imc_event_init(struct perf_event *event)
+{
+ struct task_struct *target;
+
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /* Sampling not supported */
+ if (event->hw.sample_period)
+ return -EINVAL;
+
+ event->hw.idx = -1;
+
+ /* Sanity check for config (event offset) */
+ if (event->attr.config > thread_max_offset)
+ return -EINVAL;
+
+ target = event->hw.target;
+
+ if (!target)
+ return -EINVAL;
+
+ event->pmu->task_ctx_nr = perf_sw_context;
+ return 0;
+}
+
+static void thread_imc_read_counter(struct perf_event *event)
+{
+ u64 *addr, data;
+ int cpu_id = smp_processor_id();
+
+ addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config);
+ data = __be64_to_cpu(READ_ONCE(*addr));
+ local64_set(&event->hw.prev_count, data);
+}
+
+static void thread_imc_perf_event_update(struct perf_event *event)
+{
+ u64 counter_prev, counter_new, final_count, *addr;
+ int cpu_id = smp_processor_id();
+
+ addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config);
+ counter_prev = local64_read(&event->hw.prev_count);
+ counter_new = __be64_to_cpu(READ_ONCE(*addr));
+ final_count = counter_new - counter_prev;
+
+ local64_set(&event->hw.prev_count, counter_new);
+ local64_add(final_count, &event->count);
+}
+
static void imc_read_counter(struct perf_event *event)
{
u64 *addr, data;
@@ -511,6 +564,53 @@ static int imc_event_add(struct perf_event *event, int flags)
return 0;
}

+static void thread_imc_event_start(struct perf_event *event, int flags)
+{
+ thread_imc_read_counter(event);
+}
+
+static void thread_imc_event_stop(struct perf_event *event, int flags)
+{
+ thread_imc_perf_event_update(event);
+}
+
+static void thread_imc_event_del(struct perf_event *event, int flags)
+{
+ thread_imc_perf_event_update(event);
+}
+
+static int thread_imc_event_add(struct perf_event *event, int flags)
+{
+ thread_imc_event_start(event, flags);
+
+ return 0;
+}
+
+static void thread_imc_pmu_start_txn(struct pmu *pmu,
+ unsigned int txn_flags)
+{
+ if (txn_flags & ~PERF_PMU_TXN_ADD)
+ return;
+ perf_pmu_disable(pmu);
+}
+
+static void thread_imc_pmu_cancel_txn(struct pmu *pmu)
+{
+ perf_pmu_enable(pmu);
+}
+
+static int thread_imc_pmu_commit_txn(struct pmu *pmu)
+{
+ perf_pmu_enable(pmu);
+ return 0;
+}
+
+static void thread_imc_pmu_sched_task(struct perf_event_context *ctx,
+ bool sched_in)
+{
+ return;
+}
+
/* update_pmu_ops : Populate the appropriate operations for "pmu" */
static int update_pmu_ops(struct imc_pmu *pmu)
{
@@ -520,17 +620,31 @@ static int update_pmu_ops(struct imc_pmu *pmu)
pmu->pmu.task_ctx_nr = perf_invalid_context;
if (pmu->domain == IMC_DOMAIN_NEST) {
pmu->pmu.event_init = nest_imc_event_init;
+ pmu->attr_groups[2] = &imc_pmu_cpumask_attr_group;
} else if (pmu->domain == IMC_DOMAIN_CORE) {
pmu->pmu.event_init = core_imc_event_init;
+ pmu->attr_groups[2] = &imc_pmu_cpumask_attr_group;
}
+
pmu->pmu.add = imc_event_add;
pmu->pmu.del = imc_event_stop;
pmu->pmu.start = imc_event_start;
pmu->pmu.stop = imc_event_stop;
pmu->pmu.read = imc_perf_event_update;
pmu->attr_groups[1] = &imc_format_group;
- pmu->attr_groups[2] = &imc_pmu_cpumask_attr_group;
pmu->pmu.attr_groups = pmu->attr_groups;
+ if (pmu->domain == IMC_DOMAIN_THREAD) {
+ pmu->pmu.event_init = thread_imc_event_init;
+ pmu->pmu.start = thread_imc_event_start;
+ pmu->pmu.add = thread_imc_event_add;
+ pmu->pmu.del = thread_imc_event_del;
+ pmu->pmu.stop = thread_imc_event_stop;
+ pmu->pmu.read = thread_imc_perf_event_update;
+ pmu->pmu.start_txn = thread_imc_pmu_start_txn;
+ pmu->pmu.cancel_txn = thread_imc_pmu_cancel_txn;
+ pmu->pmu.commit_txn = thread_imc_pmu_commit_txn;
+ pmu->pmu.sched_task = thread_imc_pmu_sched_task;
+ }

return 0;
}
@@ -586,6 +700,56 @@ static int update_events_in_group(struct imc_events *events,
return 0;
}

+static void thread_imc_ldbar_disable(void *dummy)
+{
+ /* LDBAR spr is a per-thread */
+ mtspr(SPRN_LDBAR, 0);
+}
+
+void thread_imc_disable(void)
+{
+ on_each_cpu(thread_imc_ldbar_disable, NULL, 1);
+}
+
+static void cleanup_thread_imc_memory(void *dummy)
+{
+ int cpu_id = smp_processor_id();
+ u64 addr = per_cpu_add[cpu_id];
+
+ /* Only if the address is non-zero, shall we free it */
+ if (addr)
+ free_pages(addr, 0);
+}
+
+static void cleanup_all_thread_imc_memory(void)
+{
+ on_each_cpu(cleanup_thread_imc_memory, NULL, 1);
+}
+
+/*
+ * Allocates a page of memory for each of the online cpus, and, writes the
+ * physical base address of that page to the LDBAR for that cpu. This starts
+ * the thread IMC counters.
+ */
+static void thread_imc_mem_alloc(void *dummy)
+{
+ u64 ldbar_addr, ldbar_value;
+ int cpu_id = smp_processor_id();
+ int phys_id = topology_physical_package_id(smp_processor_id());
+
+ per_cpu_add[cpu_id] = (u64)alloc_pages_exact_nid(phys_id,
+ (size_t)IMC_THREAD_COUNTER_MEM, GFP_KERNEL | __GFP_ZERO);
+ ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);
+ ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
+ (u64)THREAD_IMC_ENABLE;
+ mtspr(SPRN_LDBAR, ldbar_value);
+}
+
+void thread_imc_cpu_init(void)
+{
+ on_each_cpu(thread_imc_mem_alloc, NULL, 1);
+}
+
/*
* init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events
* "events".
@@ -609,6 +773,9 @@ int init_imc_pmu(struct imc_events *events, int idx,
if (ret)
return ret;
break;
+ case IMC_DOMAIN_THREAD:
+ thread_imc_cpu_init();
+ break;
default:
return -1; /* Unknown domain */
}
@@ -640,5 +807,9 @@ int init_imc_pmu(struct imc_events *events, int idx,
if (pmu_ptr->domain == IMC_DOMAIN_CORE)
cleanup_all_core_imc_memory();

+ /* For thread_imc, we have allocated memory, we need to free it */
+ if (pmu_ptr->domain == IMC_DOMAIN_THREAD)
+ cleanup_all_thread_imc_memory();
+
return ret;
}
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
index ac625cf13875..a0cc4467fb30 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -549,6 +549,9 @@ static void opal_imc_counters_shutdown(struct platform_device *pdev)
#ifdef CONFIG_PERF_EVENTS
/* Disable the IMC Core functions */
core_imc_disable();
+
+ /* Disable the IMC Thread functions */
+ thread_imc_disable();
#endif
}

--
2.7.4

2017-04-03 14:57:39

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 11/11] powerpc/perf: Thread imc cpuhotplug support

From: Anju T Sudhakar <[email protected]>

This patch adds support for thread IMC on cpuhotplug.

When a cpu goes offline, the LDBAR for that cpu is disabled, and when it comes
back online the previous ldbar value is written back to the LDBAR for that cpu.

To register the hotplug functions for thread_imc, a new state
CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE is added to the list of existing
states.

Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/perf/imc-pmu.c | 32 +++++++++++++++++++++++++++-----
include/linux/cpuhotplug.h | 1 +
2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 3db637c6d3f4..944e568b52ff 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -726,6 +726,16 @@ static void cleanup_all_thread_imc_memory(void)
on_each_cpu(cleanup_thread_imc_memory, NULL, 1);
}

+static void thread_imc_update_ldbar(unsigned int cpu_id)
+{
+ u64 ldbar_addr, ldbar_value;
+
+ ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);
+ ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
+ (u64)THREAD_IMC_ENABLE;
+ mtspr(SPRN_LDBAR, ldbar_value);
+}
+
/*
* Allocates a page of memory for each of the online cpus, and, writes the
* physical base address of that page to the LDBAR for that cpu. This starts
@@ -733,21 +743,33 @@ static void cleanup_all_thread_imc_memory(void)
*/
static void thread_imc_mem_alloc(void *dummy)
{
- u64 ldbar_addr, ldbar_value;
int cpu_id = smp_processor_id();
int phys_id = topology_physical_package_id(smp_processor_id());

per_cpu_add[cpu_id] = (u64)alloc_pages_exact_nid(phys_id,
(size_t)IMC_THREAD_COUNTER_MEM, GFP_KERNEL | __GFP_ZERO);
- ldbar_addr = (u64)virt_to_phys((void *)per_cpu_add[cpu_id]);
- ldbar_value = (ldbar_addr & (u64)THREAD_IMC_LDBAR_MASK) |
- (u64)THREAD_IMC_ENABLE;
- mtspr(SPRN_LDBAR, ldbar_value);
+ thread_imc_update_ldbar(cpu_id);
+}
+
+static int ppc_thread_imc_cpu_online(unsigned int cpu)
+{
+ thread_imc_update_ldbar(cpu);
+ return 0;
}

+static int ppc_thread_imc_cpu_offline(unsigned int cpu)
+{
+ mtspr(SPRN_LDBAR, 0);
+ return 0;
+ }
+
void thread_imc_cpu_init(void)
{
on_each_cpu(thread_imc_mem_alloc, NULL, 1);
+ cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE,
+ "POWER_THREAD_IMC_ONLINE",
+ ppc_thread_imc_cpu_online,
+ ppc_thread_imc_cpu_offline);
}

/*
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index abde85d9511a..724df46b2c3c 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -138,6 +138,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_ARM_L2X0_ONLINE,
CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE,
+ CPUHP_AP_PERF_POWERPC_THREADIMC_ONLINE,
CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
--
2.7.4

2017-04-03 14:59:55

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions

From: Hemant Kumar <[email protected]>

Create new header file "imc-pmu.h" to add the data structures
and macros needed for IMC pmu support.

Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Hemant Kumar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/include/asm/imc-pmu.h | 68 ++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)
create mode 100644 arch/powerpc/include/asm/imc-pmu.h

diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
new file mode 100644
index 000000000000..a3d4f1bf9492
--- /dev/null
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -0,0 +1,68 @@
+#ifndef PPC_POWERNV_IMC_PMU_DEF_H
+#define PPC_POWERNV_IMC_PMU_DEF_H
+
+/*
+ * IMC Nest Performance Monitor counter support.
+ *
+ * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
+ * (C) 2016 Hemant K Shaw, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include <linux/perf_event.h>
+#include <linux/slab.h>
+#include <linux/of.h>
+#include <linux/io.h>
+#include <asm/opal.h>
+
+#define IMC_MAX_CHIPS 32
+#define IMC_MAX_PMUS 32
+#define IMC_MAX_PMU_NAME_LEN 256
+
+#define IMC_NEST_MAX_PAGES 16
+
+#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
+#define IMC_DTB_NEST_COMPAT "ibm,imc-counters-nest"
+
+/*
+ * Structure to hold per chip specific memory address
+ * information for nest pmus. Nest Counter data are exported
+ * in per-chip reserved memory region by the PORE Engine.
+ */
+struct perchip_nest_info {
+ u32 chip_id;
+ u64 pbase;
+ u64 vbase[IMC_NEST_MAX_PAGES];
+ u64 size;
+};
+
+/*
+ * Place holder for nest pmu events and values.
+ */
+struct imc_events {
+ char *ev_name;
+ char *ev_value;
+};
+
+/*
+ * Device tree parser code detects IMC pmu support and
+ * registers new IMC pmus. This structure will
+ * hold the pmu functions and attrs for each imc pmu and
+ * will be referenced at the time of pmu registration.
+ */
+struct imc_pmu {
+ struct pmu pmu;
+ int domain;
+ const struct attribute_group *attr_groups[4];
+};
+
+/*
+ * Domains for IMC PMUs
+ */
+#define IMC_DOMAIN_NEST 1
+#define IMC_DOMAIN_UNKNOWN -1
+
+#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
--
2.7.4

2017-04-03 14:59:53

by Madhavan Srinivasan

[permalink] [raw]
Subject: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module

From: Hemant Kumar <[email protected]>

This patch does three things :
- Enables "opal.c" to create a platform device for the IMC interface
according to the appropriate compatibility string.
- Find the reserved-memory region details from the system device tree
and get the base address of HOMER (Reserved memory) region address for each chip.
- We also get the Nest PMU counter data offsets (in the HOMER region)
and their sizes. The offsets for the counters' data are fixed and
won't change from chip to chip.

The device tree parsing logic is separated from the PMU creation
functions (which is done in subsequent patches).

Signed-off-by: Anju T Sudhakar <[email protected]>
Signed-off-by: Hemant Kumar <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
---
arch/powerpc/platforms/powernv/Makefile | 2 +-
arch/powerpc/platforms/powernv/opal-imc.c | 126 ++++++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/opal.c | 14 ++++
3 files changed, 141 insertions(+), 1 deletion(-)
create mode 100644 arch/powerpc/platforms/powernv/opal-imc.c

diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index b5d98cb3f482..44909fec1121 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -2,7 +2,7 @@ obj-y += setup.o opal-wrappers.o opal.o opal-async.o idle.o
obj-y += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
obj-y += rng.o opal-elog.o opal-dump.o opal-sysparam.o opal-sensor.o
obj-y += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
-obj-y += opal-kmsg.o
+obj-y += opal-kmsg.o opal-imc.o

obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o
obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
new file mode 100644
index 000000000000..c476d596c6a8
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -0,0 +1,126 @@
+/*
+ * OPAL IMC interface detection driver
+ * Supported on POWERNV platform
+ *
+ * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
+ * (C) 2016 Hemant K Shaw, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_platform.h>
+#include <linux/poll.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/crash_dump.h>
+#include <asm/opal.h>
+#include <asm/io.h>
+#include <asm/uaccess.h>
+#include <asm/cputable.h>
+#include <asm/imc-pmu.h>
+
+struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+
+static int opal_imc_counters_probe(struct platform_device *pdev)
+{
+ struct device_node *child, *imc_dev, *rm_node = NULL;
+ struct perchip_nest_info *pcni;
+ u32 pages, nest_offset, nest_size, idx;
+ int i = 0;
+ const char *node_name;
+ const __be32 *addrp;
+ u64 reg_addr, reg_size;
+
+ if (!pdev || !pdev->dev.of_node)
+ return -ENODEV;
+
+ /*
+ * Check whether this kdump kernel. If yes, just return.
+ */
+ if (is_kdump_kernel())
+ return -ENODEV;
+
+ imc_dev = pdev->dev.of_node;
+
+ /*
+ * nest_offset : where the nest-counters' data start.
+ * size : size of the entire nest-counters region
+ */
+ if (of_property_read_u32(imc_dev, "imc-nest-offset", &nest_offset))
+ goto err;
+
+ if (of_property_read_u32(imc_dev, "imc-nest-size", &nest_size))
+ goto err;
+
+ /* Find the "homer region" for each chip */
+ rm_node = of_find_node_by_path("/reserved-memory");
+ if (!rm_node)
+ goto err;
+
+ for_each_child_of_node(rm_node, child) {
+ if (of_property_read_string_index(child, "name", 0,
+ &node_name))
+ continue;
+ if (strncmp("ibm,homer-image", node_name,
+ strlen("ibm,homer-image")))
+ continue;
+
+ /* Get the chip id to which the above homer region belongs to */
+ if (of_property_read_u32(child, "ibm,chip-id", &idx))
+ goto err;
+
+ pcni = &nest_perchip_info[idx];
+ addrp = of_get_address(child, 0, &reg_size, NULL);
+ if (!addrp)
+ goto err;
+
+ /* Fetch the homer region base address */
+ reg_addr = of_read_number(addrp, 2);
+ pcni->pbase = reg_addr;
+ /* Add the nest IMC Base offset */
+ pcni->pbase = pcni->pbase + nest_offset;
+ /* Fetch the size of the homer region */
+ pcni->size = nest_size;
+
+ do {
+ pages = PAGE_SIZE * i;
+ pcni->vbase[i++] = (u64)phys_to_virt(pcni->pbase +
+ pages);
+ } while (i < (pcni->size / PAGE_SIZE));
+ }
+
+ return 0;
+err:
+ return -ENODEV;
+}
+
+static const struct of_device_id opal_imc_match[] = {
+ { .compatible = IMC_DTB_COMPAT },
+ {},
+};
+
+static struct platform_driver opal_imc_driver = {
+ .driver = {
+ .name = "opal-imc-counters",
+ .of_match_table = opal_imc_match,
+ },
+ .probe = opal_imc_counters_probe,
+};
+
+MODULE_DEVICE_TABLE(of, opal_imc_match);
+module_platform_driver(opal_imc_driver);
+MODULE_DESCRIPTION("PowerNV OPAL IMC driver");
+MODULE_LICENSE("GPL");
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index e0f856bfbfe8..85ea1296f030 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -14,6 +14,7 @@
#include <linux/printk.h>
#include <linux/types.h>
#include <linux/of.h>
+#include <linux/of_address.h>
#include <linux/of_fdt.h>
#include <linux/of_platform.h>
#include <linux/interrupt.h>
@@ -30,6 +31,7 @@
#include <asm/opal.h>
#include <asm/firmware.h>
#include <asm/mce.h>
+#include <asm/imc-pmu.h>

#include "powernv.h"

@@ -631,6 +633,15 @@ static void opal_pdev_init(const char *compatible)
of_platform_device_create(np, NULL, NULL);
}

+static void opal_imc_init_dev(void)
+{
+ struct device_node *np;
+
+ np = of_find_compatible_node(NULL, NULL, IMC_DTB_COMPAT);
+ if (np)
+ of_platform_device_create(np, NULL, NULL);
+}
+
static int kopald(void *unused)
{
unsigned long timeout = msecs_to_jiffies(opal_heartbeat) + 1;
@@ -704,6 +715,9 @@ static int __init opal_init(void)
/* Setup a heatbeat thread if requested by OPAL */
opal_init_heartbeat();

+ /* Detect IMC pmu counters support and create PMUs */
+ opal_imc_init_dev();
+
/* Create leds platform devices */
leds = of_find_node_by_path("/ibm,opal/leds");
if (leds) {
--
2.7.4

2017-04-04 00:58:54

by Daniel Axtens

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module

Hi all,

I'm trying to get my head around these patches - at this point I'm just
doing a first pass, so I may have more substantive structural comments
later on. In the mean time - here are some minor C nits:

> + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
> + * (C) 2016 Hemant K Shaw, IBM Corporation.

Should these be bumped to 2017?

> +
> + do {
> + pages = PAGE_SIZE * i;
> + pcni->vbase[i++] = (u64)phys_to_virt(pcni->pbase +
> + pages);
> + } while (i < (pcni->size / PAGE_SIZE));
> + }
I had to scroll back up to the top of this function to make sure I
understood what this loop does. Would it be better to write it as:

for (i = 0; i < (pcni->size / PAGE_SIZE); i++) {
pages = PAGE_SIZE * i;
pcni->vbase[i] = (u64)....
}

And, just checking - this is expected to work on both 4 and 64kB pages?

> +
> + return 0;
> +err:
> + return -ENODEV;

You're not releasing any resources here - would it be better to just
replace the gotos with this return? I haven't checked to see if you
change the function later on to allocate memory - if so please ignore :)

> +}
> +
> +static const struct of_device_id opal_imc_match[] = {
> + { .compatible = IMC_DTB_COMPAT },
> + {},
> +};
> +
> +static struct platform_driver opal_imc_driver = {
> + .driver = {
> + .name = "opal-imc-counters",
> + .of_match_table = opal_imc_match,
> + },
> + .probe = opal_imc_counters_probe,
> +};
> +
> +MODULE_DEVICE_TABLE(of, opal_imc_match);
> +module_platform_driver(opal_imc_driver);
> +MODULE_DESCRIPTION("PowerNV OPAL IMC driver");
> +MODULE_LICENSE("GPL");
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index e0f856bfbfe8..85ea1296f030 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -14,6 +14,7 @@
> #include <linux/printk.h>
> #include <linux/types.h>
> #include <linux/of.h>
> +#include <linux/of_address.h>
> #include <linux/of_fdt.h>
> #include <linux/of_platform.h>
> #include <linux/interrupt.h>
> @@ -30,6 +31,7 @@
> #include <asm/opal.h>
> #include <asm/firmware.h>
> #include <asm/mce.h>
> +#include <asm/imc-pmu.h>
>
> #include "powernv.h"
>
> @@ -631,6 +633,15 @@ static void opal_pdev_init(const char *compatible)
> of_platform_device_create(np, NULL, NULL);
> }
>
> +static void opal_imc_init_dev(void)
> +{
> + struct device_node *np;
> +
> + np = of_find_compatible_node(NULL, NULL, IMC_DTB_COMPAT);
> + if (np)
> + of_platform_device_create(np, NULL, NULL);
> +}

Should this function be tagged __init?

> +
> static int kopald(void *unused)
> {
> unsigned long timeout = msecs_to_jiffies(opal_heartbeat) + 1;
> @@ -704,6 +715,9 @@ static int __init opal_init(void)
> /* Setup a heatbeat thread if requested by OPAL */
> opal_init_heartbeat();
>
> + /* Detect IMC pmu counters support and create PMUs */
> + opal_imc_init_dev();
> +
> /* Create leds platform devices */
> leds = of_find_node_by_path("/ibm,opal/leds");
> if (leds) {
> --
> 2.7.4

2017-04-04 01:42:00

by Daniel Axtens

[permalink] [raw]
Subject: Re: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events

Madhavan Srinivasan <[email protected]> writes:

> From: Hemant Kumar <[email protected]>
>
> Parse device tree to detect IMC units. Traverse through each IMC unit
> node to find supported events and corresponding unit/scale files (if any).
>
> Here is the DTS file for reference:
>
> https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts
>
> The device tree for IMC counters starts at the node "imc-counters".
> This node contains all the IMC PMU nodes and event nodes
> for these IMC PMUs. The PMU nodes have an "events" property which has a
> phandle value for the actual events node. The events are separated from
> the PMU nodes to abstract out the common events. For example, PMU node
> "mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since,
> the events are common between these PMUs. These events have a different
> prefix based on their relation to different PMUs, and hence, the PMU
> nodes themselves contain an "events-prefix" property. The value for this
> property concatenated to the event name, forms the actual event
> name. Also, the PMU have a "reg" field as the base offset for the events
> which belong to this PMU. This "reg" field is added to an event in the
> "events" node, which gives us the location of the counter data. Kernel
> code uses this offset as event configuration value.

This is probably because I haven't looked at this area recently, but
what do you mean by 'event configuration value'? If I understand
correctly, you're saying something like "kernel code stores this
calculated location in the event configuration data" - is that about
right?

Apologies if this is a dumb question.

>
> Device tree parser code also looks for scale/unit property in the event
> node and passes on the value as an event attr for perf interface to use
> in the post processing by the perf tool. Some PMUs may have common scale
> and unit properties which implies that all events supported by this PMU
> inherit the scale and unit properties of the PMU itself. For those
> events, we need to set the common unit and scale values.
>
> For failure to initialize any unit or any event, disable that unit and
> continue setting up the rest of them.
>
> Signed-off-by: Hemant Kumar <[email protected]>
> Signed-off-by: Anju T Sudhakar <[email protected]>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> ---
> arch/powerpc/platforms/powernv/opal-imc.c | 386 ++++++++++++++++++++++++++++++
> 1 file changed, 386 insertions(+)
>
> diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
> index c476d596c6a8..ba0203e669c0 100644
> --- a/arch/powerpc/platforms/powernv/opal-imc.c
> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> @@ -33,6 +33,388 @@
> #include <asm/imc-pmu.h>
>
> struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
> +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
> +
> +static int imc_event_info(char *name, struct imc_events *events)
This seems to be an allocation/init function: should that be reflected
in the name?

> +{
> + char *buf;
> +
> + /* memory for content */
> + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + events->ev_name = name;
We trust the name will never become a dangling pointer? We don't need to
duplicate it?

Should the pointer be const?

> + events->ev_value = buf;
> + return 0;
> +}
> +
> +static int imc_event_info_str(struct property *pp, char *name,
> + struct imc_events *events)
This creates and sets an initial value based on a property, ...

> +static int imc_event_info_val(char *name, u32 val,
> + struct imc_events *events)
... this creates and sets based on a u32.

Could you call them something that suggests their purpose a bit better?
Maybe event_info_from_{property,val}?

> +{
> + int ret;
> +
> + ret = imc_event_info(name, events);
> + if (ret)
> + return ret;
> + sprintf(events->ev_value, "event=0x%x", val);
> +
> + return 0;
> +}
> +
> +static int set_event_property(struct property *pp, char *event_prop,
> + struct imc_events *events, char *ev_name)
> +{
> + char *buf;
> + int ret;
> +
> + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + sprintf(buf, "%s.%s", ev_name, event_prop);
> + ret = imc_event_info_str(pp, buf, events);
> + if (ret) {
> + kfree(events->ev_name);
> + kfree(events->ev_value);
How do you know the buffer for ev_value was successfully allocated? Can
you be sure it is safe to free? (Consider imc_event_info returning -ENOMEM)

> + }
> +
> + return ret;
> +}
> +
> +/*
> + * imc_events_node_parser: Parse the event node "dev" and assign the parsed
> + * information to event "events".
> + *
> + * Parses the "reg" property of this event. "reg" gives us the event offset.
> + * Also, parse the "scale" and "unit" properties, if any.
> + */
> +static int imc_events_node_parser(struct device_node *dev,
> + struct imc_events *events,
> + struct property *event_scale,
> + struct property *event_unit,
> + struct property *name_prefix,
> + u32 reg)
> +{
> + struct property *name, *pp;
> + char *ev_name;
> + u32 val;
> + int idx = 0, ret;
> +
> + if (!dev)
> + return -EINVAL;
> +
> + /*
> + * Loop through each property of an event node
> + */
> + name = of_find_property(dev, "event-name", NULL);
> + if (!name)
> + return -ENODEV;
> +
> + if (!name->value ||
> + (strnlen(name->value, name->length) == name->length) ||
> + (name->length > IMC_MAX_PMU_NAME_LEN))
> + return -EINVAL;
> +
> + ev_name = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
> + if (!ev_name)
> + return -ENOMEM;
I see you allocating memory here and in set_event_property for the
name. Would it be better to move that into imc_event_info and make that
function responsible for the whole set of allocations regarding the
event?

(I'm happy to be talked out of this - it just seems like it would make
it much easier to reason about the lifecycles of the memory allocations.)

> +
> + snprintf(ev_name, IMC_MAX_PMU_NAME_LEN, "%s%s",
> + (char *)name_prefix->value,
> + (char *)name->value);
> +
> + /*
> + * Parse each property of this event node "dev". Property "reg" has
> + * the offset which is assigned to the event name. Other properties
> + * like "scale" and "unit" are assigned to event.scale and event.unit
> + * accordingly.
> + */
> + for_each_property_of_node(dev, pp) {
> + /*
> + * If there is an issue in parsing a single property of
> + * this event, we just clean up the buffers, but we still
> + * continue to parse.
> + */
> + if (strncmp(pp->name, "reg", 3) == 0) {
> + of_property_read_u32(dev, pp->name, &val);
> + val += reg;
> + ret = imc_event_info_val(ev_name, val, &events[idx]);
> + if (ret) {
> + kfree(events[idx].ev_name);
> + kfree(events[idx].ev_value);
> + continue;
> + }
> + /*
> + * If the common scale and unit properties available,
> + * then, assign them to this event
> + */
> + if (event_scale) {
> + idx++;
> + ret = set_event_property(event_scale, "scale",
> + &events[idx],
> + ev_name);
> + if (ret)
> + continue;
> + idx++;

Why do we increment the idx an extra two times if event_scale is
provided? We don't do that for the other ones...

> + }
> + if (event_unit) {
> + ret = set_event_property(event_unit, "unit",
> + &events[idx],
> + ev_name);
> + if (ret)
> + continue;
> + }
> + idx++;
> + } else if (strncmp(pp->name, "unit", 4) == 0) {
> + ret = set_event_property(pp, "unit", &events[idx],
> + ev_name);
> + if (ret)
> + continue;
> + idx++;
> + } else if (strncmp(pp->name, "scale", 5) == 0) {
> + ret = set_event_property(pp, "scale", &events[idx],
> + ev_name);
> + if (ret)
> + continue;
> + idx++;
> + }
> + }

I find it really difficult to convince myself of the control flow in
that loop.

In particular, the comment at the top claims:
> + * If there is an issue in parsing a single property of
> + * this event, we just clean up the buffers, but we still
> + * continue to parse.

I am not quite sure precisely what that means. I first thought it meant
that if you cannot parse one of the properties of the event
(scale/unit), you would clean up the entire event, and parse the next
entire event. It looks like the code in fact simply cleans up that
*property* and continues to parse other properties.

Does that make sense to do? If you can't parse the scale of an event for
instance, does it make sense to continue with the rest of that event?

Initially I was concerned that 'continue' wasn't sufficient to even
clean up the property, but I worked through the error conditions of
set_event_property and now I am satisfied.

> +
> + return idx;
> +}
> +
> +/*
> + * imc_get_domain : Returns the domain for pmu "pmu_dev".
> + */
> +int imc_get_domain(struct device_node *pmu_dev)
> +{
> + if (of_device_is_compatible(pmu_dev, IMC_DTB_NEST_COMPAT))
> + return IMC_DOMAIN_NEST;
> + else
> + return IMC_DOMAIN_UNKNOWN;
> +}
> +
> +/*
> + * get_nr_children : Returns the number of children for a pmu device node.
> + */
> +static int get_nr_children(struct device_node *pmu_node)
> +{
> + struct device_node *child;
> + int i = 0;
> +
> + for_each_child_of_node(pmu_node, child)
> + i++;
> + return i;
> +}
> +
> +/*
> + * imc_free_events : Cleanup the "events" list having "nr_entries" entries.
> + */
> +static void imc_free_events(struct imc_events *events, int nr_entries)
> +{
> + int i;
> +
> + /* Nothing to clean, return */
> + if (!events)
> + return;
> + for (i = 0; i < nr_entries; i++) {
> + kfree(events[i].ev_name);
> + kfree(events[i].ev_value);
> + }
> +
> + kfree(events);
> +}
> +
> +/*
> + * imc_pmu_create : Takes the parent device which is the pmu unit and a
> + * pmu_index as the inputs.
> + * Allocates memory for the pmu, sets up its domain (NEST or CORE), and

> + * allocates memory for the events supported by this pmu. Assigns a name for
> + * the pmu. Calls imc_events_node_parser() to setup the individual events.
> + * If everything goes fine, it calls, init_imc_pmu() to setup the pmu device
> + * and register it.
> + */
> +static int imc_pmu_create(struct device_node *parent, int pmu_index)
> +{
> + struct device_node *ev_node = NULL, *dir = NULL;
> + struct imc_events *events;
> + struct imc_pmu *pmu_ptr;
> + u32 prop, reg;
> + struct property *pp, *scale_pp, *unit_pp, *name_prefix;
> + char *buf;
> + int idx = 0, ret = 0, nr_children = 0;
> +
> + if (!parent)
> + return -EINVAL;
> +
> + /* memory for pmu */
> + pmu_ptr = kzalloc(sizeof(struct imc_pmu), GFP_KERNEL);
> + if (!pmu_ptr)
> + return -ENOMEM;
> +
> + pmu_ptr->domain = imc_get_domain(parent);
> + if (pmu_ptr->domain == IMC_DOMAIN_UNKNOWN)
> + goto free_pmu;
> +
> + /* Needed for hotplug/migration */
> + per_nest_pmu_arr[pmu_index] = pmu_ptr;
> +
> + /*
> + * "events" property inside a PMU node contains the phandle value
> + * for the actual events node. The "events" node for the IMC PMU
> + * is not in this node, rather inside "imc-counters" node, since,
> + * we want to factor out the common events (thereby, reducing the
> + * size of the device tree)
> + */
> + of_property_read_u32(parent, "events", &prop);
> + if (!prop)
> + return -EINVAL;
> +
> + /*
> + * Fetch the actual node where the events for this PMU exist.
> + */
> + dir = of_find_node_by_phandle(prop);
> + if (!dir)
> + return -EINVAL;
> +
> + /*
> + * Get the maximum no. of events in this node.
> + * Multiply by 3 to account for .scale and .unit properties
> + * This number suggests the amount of memory needed to setup the
> + * events for this pmu.
> + */
> + nr_children = get_nr_children(dir) * 3;
> +
> + /* memory for pmu events */
> + events = kzalloc((sizeof(struct imc_events) * nr_children),
> + GFP_KERNEL);
> + if (!events) {
> + ret = -ENOMEM;
> + goto free_pmu;
> + }
> +
> + pp = of_find_property(parent, "name", NULL);
> + if (!pp) {
> + ret = -ENODEV;
> + goto free_events;
> + }
> +
> + if (!pp->value ||
> + (strnlen(pp->value, pp->length) == pp->length) ||
> + (pp->length > IMC_MAX_PMU_NAME_LEN)) {
> + ret = -EINVAL;
> + goto free_events;
> + }
> +
> + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
> + if (!buf) {
> + ret = -ENOMEM;
> + goto free_events;
> + }
> +
> + /* Save the name to register it later */
> + sprintf(buf, "nest_%s", (char *)pp->value);
> + pmu_ptr->pmu.name = (char *)buf;
> +
> + /*
> + * Check if there is a common "scale" and "unit" properties inside
> + * the PMU node for all the events supported by this PMU.
> + */
> + scale_pp = of_find_property(parent, "scale", NULL);
> + unit_pp = of_find_property(parent, "unit", NULL);
> +
> + /*
> + * Get the event-prefix property from the PMU node
> + * which needs to be attached with the event names.
> + */
> + name_prefix = of_find_property(parent, "events-prefix", NULL);
> + if (!name_prefix)
> + return -ENODEV;
> +
> + /*
> + * "reg" property gives out the base offset of the counters data
> + * for this PMU.
> + */
> + of_property_read_u32(parent, "reg", &reg);
> +
> + if (!name_prefix->value ||
> + (strnlen(name_prefix->value, name_prefix->length) == name_prefix->length) ||
> + (name_prefix->length > IMC_MAX_PMU_NAME_LEN))
> + return -EINVAL;
> +
> + /* Loop through event nodes */
> + for_each_child_of_node(dir, ev_node) {
> + ret = imc_events_node_parser(ev_node, &events[idx], scale_pp,
> + unit_pp, name_prefix, reg);
> + if (ret < 0) {
> + /* Unable to parse this event */
> + if (ret == -ENOMEM)
> + goto free_events;
> + continue;
> + }
> +
> + /*
> + * imc_event_node_parser will return number of
> + * event entries created for this. This could include
> + * event scale and unit files also.
> + */
> + idx += ret;
> + }
> +
> + return 0;
> +
> +free_events:
> + imc_free_events(events, idx);
> +free_pmu:
> + kfree(pmu_ptr);
> + return ret;
> +}
> +
> +/*
> + * imc_pmu_setup : Setup the IMC PMUs (children of "parent").
> + */
> +static void imc_pmu_setup(struct device_node *parent)

Should this be marked __init?
> +{
> + struct device_node *child;
> + int pmu_count = 0, rc = 0;
> + const struct property *pp;
> +
> + if (!parent)
> + return;
> +
> + /* Setup all the IMC pmus */
> + for_each_child_of_node(parent, child) {
> + pp = of_get_property(child, "compatible", NULL);
> + if (pp) {
> + /*
> + * If there is a node with a "compatible" field,
> + * that's a PMU node
> + */
> + rc = imc_pmu_create(child, pmu_count);
> + if (rc)
> + return;
> + pmu_count++;
> + }
> + }
> +}
>
> static int opal_imc_counters_probe(struct platform_device *pdev)
> {
> @@ -102,6 +484,10 @@ static int opal_imc_counters_probe(struct platform_device *pdev)
> } while (i < (pcni->size / PAGE_SIZE));
> }
>
> +#ifdef CONFIG_PERF_EVENTS
> + imc_pmu_setup(imc_dev);
> +#endif
> +
> return 0;
> err:
> return -ENODEV;
> --
> 2.7.4

Regards,
Daniel

2017-04-04 01:48:42

by Daniel Axtens

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module

Hi,

> + do {
> + pages = PAGE_SIZE * i;
> + pcni->vbase[i++] = (u64)phys_to_virt(pcni->pbase +
> + pages);
> + } while (i < (pcni->size / PAGE_SIZE));

I also just noticed that there's no check here against
IMC_NEST_MAX_PAGES: should that be inserted? (If for no other reason
than to stop every static analysis tool complaining!)

Daniel

> + }
> +
> + return 0;
> +err:
> + return -ENODEV;
> +}
> +
> +static const struct of_device_id opal_imc_match[] = {
> + { .compatible = IMC_DTB_COMPAT },
> + {},
> +};
> +
> +static struct platform_driver opal_imc_driver = {
> + .driver = {
> + .name = "opal-imc-counters",
> + .of_match_table = opal_imc_match,
> + },
> + .probe = opal_imc_counters_probe,
> +};
> +
> +MODULE_DEVICE_TABLE(of, opal_imc_match);
> +module_platform_driver(opal_imc_driver);
> +MODULE_DESCRIPTION("PowerNV OPAL IMC driver");
> +MODULE_LICENSE("GPL");
> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
> index e0f856bfbfe8..85ea1296f030 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -14,6 +14,7 @@
> #include <linux/printk.h>
> #include <linux/types.h>
> #include <linux/of.h>
> +#include <linux/of_address.h>
> #include <linux/of_fdt.h>
> #include <linux/of_platform.h>
> #include <linux/interrupt.h>
> @@ -30,6 +31,7 @@
> #include <asm/opal.h>
> #include <asm/firmware.h>
> #include <asm/mce.h>
> +#include <asm/imc-pmu.h>
>
> #include "powernv.h"
>
> @@ -631,6 +633,15 @@ static void opal_pdev_init(const char *compatible)
> of_platform_device_create(np, NULL, NULL);
> }
>
> +static void opal_imc_init_dev(void)
> +{
> + struct device_node *np;
> +
> + np = of_find_compatible_node(NULL, NULL, IMC_DTB_COMPAT);
> + if (np)
> + of_platform_device_create(np, NULL, NULL);
> +}
> +
> static int kopald(void *unused)
> {
> unsigned long timeout = msecs_to_jiffies(opal_heartbeat) + 1;
> @@ -704,6 +715,9 @@ static int __init opal_init(void)
> /* Setup a heatbeat thread if requested by OPAL */
> opal_init_heartbeat();
>
> + /* Detect IMC pmu counters support and create PMUs */
> + opal_imc_init_dev();
> +
> /* Create leds platform devices */
> leds = of_find_node_by_path("/ibm,opal/leds");
> if (leds) {
> --
> 2.7.4

2017-04-04 01:49:00

by Daniel Axtens

[permalink] [raw]
Subject: Re: [PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions

Hi,

> +#define IMC_MAX_CHIPS 32
> +#define IMC_MAX_PMUS 32
> +#define IMC_MAX_PMU_NAME_LEN 256
I've noticed this is used as both the maximum length for event names and
event value strings. Would another name suit better?

> +
> +#define IMC_NEST_MAX_PAGES 16
> +
> +#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
> +#define IMC_DTB_NEST_COMPAT "ibm,imc-counters-nest"
> +
> +/*
> + * Structure to hold per chip specific memory address
> + * information for nest pmus. Nest Counter data are exported
> + * in per-chip reserved memory region by the PORE Engine.
> + */
> +struct perchip_nest_info {
> + u32 chip_id;
> + u64 pbase;
> + u64 vbase[IMC_NEST_MAX_PAGES];
> + u64 size;
> +};
> +
> +/*
> + * Place holder for nest pmu events and values.
> + */
> +struct imc_events {
> + char *ev_name;
> + char *ev_value;
> +};
> +
> +/*
> + * Device tree parser code detects IMC pmu support and
> + * registers new IMC pmus. This structure will
> + * hold the pmu functions and attrs for each imc pmu and
> + * will be referenced at the time of pmu registration.
> + */
> +struct imc_pmu {
> + struct pmu pmu;
> + int domain;
> + const struct attribute_group *attr_groups[4];
> +};
> +
> +/*
> + * Domains for IMC PMUs
> + */
> +#define IMC_DOMAIN_NEST 1
> +#define IMC_DOMAIN_UNKNOWN -1
> +
> +#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
> --
> 2.7.4

2017-04-04 02:11:32

by Daniel Axtens

[permalink] [raw]
Subject: Re: [PATCH v6 04/11] powerpc/perf: Add event attribute and group to IMC pmus

Hi,

> Device tree IMC driver code parses the IMC units and their events. It
> passes the information to IMC pmu code which is placed in powerpc/perf
> as "imc-pmu.c".
>
> This patch creates only event attributes and attribute groups for the
> IMC pmus.
>
> Signed-off-by: Anju T Sudhakar <[email protected]>
> Signed-off-by: Hemant Kumar <[email protected]>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> ---
> arch/powerpc/perf/Makefile | 6 +-
> arch/powerpc/perf/imc-pmu.c | 97 +++++++++++++++++++++++++++++++
> arch/powerpc/platforms/powernv/opal-imc.c | 12 +++-
> 3 files changed, 112 insertions(+), 3 deletions(-)
> create mode 100644 arch/powerpc/perf/imc-pmu.c
>
> diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
> index 4d606b99a5cb..d0d1f04203c7 100644
> --- a/arch/powerpc/perf/Makefile
> +++ b/arch/powerpc/perf/Makefile
> @@ -2,10 +2,14 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
>
> obj-$(CONFIG_PERF_EVENTS) += callchain.o perf_regs.o
>
> +imc-$(CONFIG_PPC_POWERNV) += imc-pmu.o
> +
> obj-$(CONFIG_PPC_PERF_CTRS) += core-book3s.o bhrb.o
> obj64-$(CONFIG_PPC_PERF_CTRS) += power4-pmu.o ppc970-pmu.o power5-pmu.o \
> power5+-pmu.o power6-pmu.o power7-pmu.o \
> - isa207-common.o power8-pmu.o power9-pmu.o
> + isa207-common.o power8-pmu.o power9-pmu.o \
> + $(imc-y)

Hmm, this seems... obtuse. Will the IMC stuff fail to compile on
non-powerNV? Can we just link it in like we do with every other sort of
performance counter and let runtime detection do its thing?

> +
> obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o
>
> obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
> new file mode 100644
> index 000000000000..ec464c76b749
> --- /dev/null
> +++ b/arch/powerpc/perf/imc-pmu.c
> @@ -0,0 +1,97 @@
> +/*
> + * Nest Performance Monitor counter support.
> + *
> + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
> + * (C) 2016 Hemant K Shaw, IBM Corporation.
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +#include <linux/perf_event.h>
> +#include <linux/slab.h>
> +#include <asm/opal.h>
> +#include <asm/imc-pmu.h>
> +#include <asm/cputhreads.h>
> +#include <asm/smp.h>
> +#include <linux/string.h>
> +
> +struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
> +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
> +
> +/* dev_str_attr : Populate event "name" and string "str" in attribute */
> +static struct attribute *dev_str_attr(const char *name, const char *str)
> +{
> + struct perf_pmu_events_attr *attr;
> +
> + attr = kzalloc(sizeof(*attr), GFP_KERNEL);
> +
> + sysfs_attr_init(&attr->attr.attr);
> +
> + attr->event_str = str;
> + attr->attr.attr.name = name;
> + attr->attr.attr.mode = 0444;
> + attr->attr.show = perf_event_sysfs_show;
> +
> + return &attr->attr.attr;
> +}
> +
> +/*
> + * update_events_in_group: Update the "events" information in an attr_group
> + * and assign the attr_group to the pmu "pmu".
> + */
> +static int update_events_in_group(struct imc_events *events,
> + int idx, struct imc_pmu *pmu)

So I've probably missed a key point in the earlier patches, but for the
life of me I cannot figure out what idx is supposed to represent.

> +{
> + struct attribute_group *attr_group;
> + struct attribute **attrs;
> + int i;
> +
> + /* Allocate memory for attribute group */
> + attr_group = kzalloc(sizeof(*attr_group), GFP_KERNEL);
> + if (!attr_group)
> + return -ENOMEM;
> +
> + /* Allocate memory for attributes */
> + attrs = kzalloc((sizeof(struct attribute *) * (idx + 1)), GFP_KERNEL);

Also, idx is an int, so:
- things will go wrong if idx is -1
- things will go very wrong if idx is -2
- do you need to do some sort of validation to make sure it's within
bounds? If not in this function, what function ensures the necessary
'sanity check' preconditions for idx?

> + if (!attrs) {
> + kfree(attr_group);
> + return -ENOMEM;
> + }
> +
> + attr_group->name = "events";
> + attr_group->attrs = attrs;
> + for (i = 0; i < idx; i++, events++) {
> + attrs[i] = dev_str_attr((char *)events->ev_name,
> + (char *)events->ev_value);
> + }
> +
> + pmu->attr_groups[0] = attr_group;

Again, I may have missed something, but what's special about group 0?
Patch 1 lets you have up to 4 groups - are they used elsewhere?

> + return 0;
> +}
> +
> +/*
> + * init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events
> + * "events".
> + * Setup the cpu mask information for these pmus and setup the state machine
> + * hotplug notifiers as well.

I cannot figure out how this function sets cpu mask information or sets
up hotplug notifiers. Is this done in a later patch? (looking at the
subject lines - perhaps patch 6?)

> + */
> +int init_imc_pmu(struct imc_events *events, int idx,
> + struct imc_pmu *pmu_ptr)

Should this be marked __init, or can it be called in a hotplug path?

> +{
> + int ret = -ENODEV;
> +
> + ret = update_events_in_group(events, idx, pmu_ptr);
> + if (ret)
> + goto err_free;
> +
> + return 0;
> +
> +err_free:
> + /* Only free the attr_groups which are dynamically allocated */
> + if (pmu_ptr->attr_groups[0]) {
> + kfree(pmu_ptr->attr_groups[0]->attrs);
> + kfree(pmu_ptr->attr_groups[0]);
> + }
> +
> + return ret;
> +}
> diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
> index ba0203e669c0..73c46703c2af 100644
> --- a/arch/powerpc/platforms/powernv/opal-imc.c
> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> @@ -32,8 +32,11 @@
> #include <asm/cputable.h>
> #include <asm/imc-pmu.h>
>
> -struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
> -struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
> +extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
> +extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];

Why are these definitions being moved?
If they're still needed in this file, should the extern lines live in a
header file?

> +
> +extern int init_imc_pmu(struct imc_events *events,
> + int idx, struct imc_pmu *pmu_ptr);
>
> static int imc_event_info(char *name, struct imc_events *events)
> {
> @@ -379,6 +382,11 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index)
> idx += ret;
> }
>
> + ret = init_imc_pmu(events, idx, pmu_ptr);
> + if (ret) {
> + pr_err("IMC PMU %s Register failed\n", pmu_ptr->pmu.name);
> + goto free_events;
> + }
> return 0;
>
> free_events:
> --
> 2.7.4

Regards,
Daniel

2017-04-04 03:55:35

by Daniel Axtens

[permalink] [raw]
Subject: Re: [PATCH v6 05/11] powerpc/perf: Generic imc pmu event functions

Hi,

So a major complaint I have is that you're changing prototypes of
functions from earlier patches.

This makes my life a lot harder: I get my head around what a function is
and does and then suddenly the prototype changes, the behaviour changes,
and I have to re-evaluate everything I thought I knew about the
function. The diffs are also usually quite unhelpful.

It would be far better, from my point of view, to squash related commits
together. You're adding a large-ish driver: we might as well review one
large, complete commit rather than several smaller incomplete commits.

There are a lot of interrelated benefits to this - just from the patches
I've reviewed so far, if there were fewer, larger commits then:

- I would only have to read a function once to get a full picture of
what it does

- comments in function headers wouldn't get out of sync with function
bodies

- there would be less movement of variables from file to file

- earlier comments wouldn't be invalidated by things I learn later. For
example I suggested different names for imc_event_info_{str,val}
based on their behaviour when first added in patch 3. Here you change
the behaviour of imc_event_info_val - it would have been helpful to
see the fuller picture when I was first reviewing as I would have
been able to make more helpful suggestions.

and so on.

Anyway, further comments in line.

> From: Hemant Kumar <[email protected]>
>
> Since, the IMC counters' data are periodically fed to a memory location,
> the functions to read/update, start/stop, add/del can be generic and can
> be used by all IMC PMU units.
>
> This patch adds a set of generic imc pmu related event functions to be
> used by each imc pmu unit. Add code to setup format attribute and to
> register imc pmus. Add a event_init function for nest_imc events.
>
> Signed-off-by: Anju T Sudhakar <[email protected]>
> Signed-off-by: Hemant Kumar <[email protected]>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> ---
> arch/powerpc/include/asm/imc-pmu.h | 1 +
> arch/powerpc/perf/imc-pmu.c | 137 ++++++++++++++++++++++++++++++
> arch/powerpc/platforms/powernv/opal-imc.c | 30 ++++++-
> 3 files changed, 164 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
> index a3d4f1bf9492..e13f51047522 100644
> --- a/arch/powerpc/include/asm/imc-pmu.h
> +++ b/arch/powerpc/include/asm/imc-pmu.h
> @@ -65,4 +65,5 @@ struct imc_pmu {
> #define IMC_DOMAIN_NEST 1
> #define IMC_DOMAIN_UNKNOWN -1
>
> +int imc_get_domain(struct device_node *pmu_dev);
> #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
> index ec464c76b749..bd5bf66bd920 100644
> --- a/arch/powerpc/perf/imc-pmu.c
> +++ b/arch/powerpc/perf/imc-pmu.c
> @@ -18,6 +18,132 @@
> struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
> struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
>
> +/* Needed for sanity check */
> +extern u64 nest_max_offset;
Should this be in a header file?

> +
> +PMU_FORMAT_ATTR(event, "config:0-20");
> +static struct attribute *imc_format_attrs[] = {
> + &format_attr_event.attr,
> + NULL,
> +};
> +
> +static struct attribute_group imc_format_group = {
> + .name = "format",
> + .attrs = imc_format_attrs,
> +};
> +
> +static int nest_imc_event_init(struct perf_event *event)
> +{
> + int chip_id;
> + u32 config = event->attr.config;
> + struct perchip_nest_info *pcni;
> +
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;
> +
> + /* Sampling not supported */
> + if (event->hw.sample_period)
> + return -EINVAL;
> +
> + /* unsupported modes and filters */
> + if (event->attr.exclude_user ||
> + event->attr.exclude_kernel ||
> + event->attr.exclude_hv ||
> + event->attr.exclude_idle ||
> + event->attr.exclude_host ||
> + event->attr.exclude_guest)
> + return -EINVAL;
> +
> + if (event->cpu < 0)
> + return -EINVAL;
> +
> + /* Sanity check for config (event offset) */
> + if (config > nest_max_offset)
> + return -EINVAL;
config is a u32, nest_max_offset is a u64. Is there a reason for this?
(Also, config is an unintuitive name for the local variable - the data
is stored in the attribute config space but the value represents an
offset into the reserved memory region.)

> +
> + chip_id = topology_physical_package_id(event->cpu);
> + pcni = &nest_perchip_info[chip_id];
> +
> + /*
> + * Memory for Nest HW counter data could be in multiple pages.
> + * Hence check and pick the right base page from the event offset,
> + * and then add to it.
> + */
> + event->hw.event_base = pcni->vbase[config/PAGE_SIZE] +
> + (config & ~PAGE_MASK);
> +
> + return 0;
> +}
> +
> +static void imc_read_counter(struct perf_event *event)
> +{
> + u64 *addr, data;
> +
> + addr = (u64 *)event->hw.event_base;
> + data = __be64_to_cpu(READ_ONCE(*addr));

It looks like this would trigger a sparse violation.
I would have thought addr is (__be64 *) and data is a u64.
Likewise all following uses of addr.

Have you checked this code for sparse warnings? I wrote a script to make
it a bit simpler: https://github.com/daxtens/smart-sparse-diff
Usage is simple - build your kernel without patches with C=2, apply
patches, build with C=2 again, use script to compare logs. (Be aware
that you will need a pretty recent version of python to run the script -
my apologies, I haven't got around to fixing that.)

> + local64_set(&event->hw.prev_count, data);
> +}
> +
> +static void imc_perf_event_update(struct perf_event *event)
> +{
> + u64 counter_prev, counter_new, final_count, *addr;
> +
> + addr = (u64 *)event->hw.event_base;
> + counter_prev = local64_read(&event->hw.prev_count);
> + counter_new = __be64_to_cpu(READ_ONCE(*addr));
> + final_count = counter_new - counter_prev;
> +
> + local64_set(&event->hw.prev_count, counter_new);
> + local64_add(final_count, &event->count);
> +}
> +
> +static void imc_event_start(struct perf_event *event, int flags)
> +{
> + /*
> + * In Memory Counters are free flowing counters. HW or the microcode
> + * keeps adding to the counter offset in memory. To get event
> + * counter value, we snapshot the value here and we calculate
> + * delta at later point.
> + */
> + imc_read_counter(event);
> +}
> +
> +static void imc_event_stop(struct perf_event *event, int flags)
> +{
> + /*
> + * Take a snapshot and calculate the delta and update
> + * the event counter values.
> + */
> + imc_perf_event_update(event);
> +}

These functions confused me for a bit. I think I mostly understand them
now, but I have a few questions:

0) everything deals with u64s. What happens when an in-memory value
overflows? I assume it all works, but there's just a little bit too
much modular arithmetic for me to be comfortable.

1) shouldn't imc_event_start() set event->count to 0? Or is this done
implicitly somewhere?

2) I took quite a while to get understand imc_event_update(), largely
because of the use of final_count as a variable name. If I
understand correctly, final_count represents the delta between the
previous value and the current value. And the logic of _update is:
- read the previous value from local storage
- read the current value from reserved memory
- calculate the delta
- save the measured value to local storage as the new prev_value
- add the delta to the accumulated event count

I think update is free from accounting errors - I don't think it will
ever miss events that occur during calculation, but it took me a while
to get there. I'm still not convinced it wouldn't be better to do this
instead:

- on start, save the current value to local storage
- on update:
* read the local storage
* read the current value from hw
* _set_ event->count to (current value - local storage)
* do _not_ save back the current value to local storage

This saves a bunch of writes to local storage; not sure if there's any
reason it would be a worse design. I'm not 100% convinced of its
behaviour in the case of a long-running high-volume counter where the
count exceeds 2^64, but I think it would share any such issues with the
current design.

I'm not saying you have to adopt this design, I was just wondering if
you'd considered it and if there was a reason not to do it.

> +
> +static int imc_event_add(struct perf_event *event, int flags)
> +{
> + if (flags & PERF_EF_START)
> + imc_event_start(event, flags);
> +
> + return 0;
> +}
> +
> +/* update_pmu_ops : Populate the appropriate operations for "pmu" */
> +static int update_pmu_ops(struct imc_pmu *pmu)
> +{
> + if (!pmu)
> + return -EINVAL;
> +
> + pmu->pmu.task_ctx_nr = perf_invalid_context;
> + pmu->pmu.event_init = nest_imc_event_init;
> + pmu->pmu.add = imc_event_add;
> + pmu->pmu.del = imc_event_stop;
> + pmu->pmu.start = imc_event_start;
> + pmu->pmu.stop = imc_event_stop;
> + pmu->pmu.read = imc_perf_event_update;
> + pmu->attr_groups[1] = &imc_format_group;
> + pmu->pmu.attr_groups = pmu->attr_groups;
> +
> + return 0;
> +}
> +
> /* dev_str_attr : Populate event "name" and string "str" in attribute */
> static struct attribute *dev_str_attr(const char *name, const char *str)
> {
> @@ -84,6 +210,17 @@ int init_imc_pmu(struct imc_events *events, int idx,
> if (ret)
> goto err_free;
>
> + ret = update_pmu_ops(pmu_ptr);
> + if (ret)
> + goto err_free;
> +
> + ret = perf_pmu_register(&pmu_ptr->pmu, pmu_ptr->pmu.name, -1);
> + if (ret)
> + goto err_free;
> +
> + pr_info("%s performance monitor hardware support registered\n",
> + pmu_ptr->pmu.name);
Do we need to be (or should we be) this chatty?

> +
> return 0;
>
> err_free:
> diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
> index 73c46703c2af..a98678266b0d 100644
> --- a/arch/powerpc/platforms/powernv/opal-imc.c
> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> @@ -37,6 +37,7 @@ extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
>
> extern int init_imc_pmu(struct imc_events *events,
> int idx, struct imc_pmu *pmu_ptr);
> +u64 nest_max_offset;
>
> static int imc_event_info(char *name, struct imc_events *events)
> {
> @@ -69,8 +70,25 @@ static int imc_event_info_str(struct property *pp, char *name,
> return 0;
> }
>
> +/*
> + * Updates the maximum offset for an event in the pmu with domain
> + * "pmu_domain". Right now, only nest domain is supported.
> + */
> +static void update_max_value(u32 value, int pmu_domain)
> +{
> + switch (pmu_domain) {
> + case IMC_DOMAIN_NEST:
> + if (nest_max_offset < value)
> + nest_max_offset = value;
> + break;
> + default:
> + /* Unknown domain, return */
> + return;
Should this get a warning? WARN_ON_ONCE might be a bit much but maybe
pr_warn_ratelimited? pr_debug perhaps? It seems like something we should
know about as it would indicate a programming error...
> + }
> +}
> +
> static int imc_event_info_val(char *name, u32 val,
> - struct imc_events *events)
> + struct imc_events *events, int pmu_domain)
> {
> int ret;
>
> @@ -78,6 +96,7 @@ static int imc_event_info_val(char *name, u32 val,
> if (ret)
> return ret;
> sprintf(events->ev_value, "event=0x%x", val);
> + update_max_value(val, pmu_domain);
>

I'm not sure this is the best place to call update_max_value. It's quite
an unexpected side-effect of a function which otherwise creates an event.

> return 0;
> }
> @@ -114,7 +133,8 @@ static int imc_events_node_parser(struct device_node *dev,
> struct property *event_scale,
> struct property *event_unit,
> struct property *name_prefix,
> - u32 reg)
> + u32 reg,
> + int pmu_domain)
> {
> struct property *name, *pp;
> char *ev_name;
> @@ -159,7 +179,8 @@ static int imc_events_node_parser(struct device_node *dev,
> if (strncmp(pp->name, "reg", 3) == 0) {
> of_property_read_u32(dev, pp->name, &val);
> val += reg;
> - ret = imc_event_info_val(ev_name, val, &events[idx]);
> + ret = imc_event_info_val(ev_name, val, &events[idx],
> + pmu_domain);
I would just put the call to update_max_value here.

> if (ret) {
> kfree(events[idx].ev_name);
> kfree(events[idx].ev_value);
> @@ -366,7 +387,8 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index)
> /* Loop through event nodes */
> for_each_child_of_node(dir, ev_node) {
> ret = imc_events_node_parser(ev_node, &events[idx], scale_pp,
> - unit_pp, name_prefix, reg);
> + unit_pp, name_prefix, reg,
> + pmu_ptr->domain);
> if (ret < 0) {
> /* Unable to parse this event */
> if (ret == -ENOMEM)
> --
> 2.7.4

Regards,
Daniel

2017-04-04 04:34:02

by Daniel Axtens

[permalink] [raw]
Subject: Re: [PATCH v6 06/11] powerpc/perf: IMC pmu cpumask and cpu hotplug support

Madhavan Srinivasan <[email protected]> writes:

> From: Hemant Kumar <[email protected]>
>
> Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any
> online CPU) from each chip for nest PMUs is designated to read counters.
>
> On CPU hotplug, dying CPU is checked to see whether it is one of the
> designated cpus, if yes, next online cpu from the same chip (for nest
> units) is designated as new cpu to read counters. For this purpose, we
> introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE.
>
> Signed-off-by: Anju T Sudhakar <[email protected]>
> Signed-off-by: Hemant Kumar <[email protected]>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> ---
> arch/powerpc/include/asm/opal-api.h | 13 ++-
> arch/powerpc/include/asm/opal.h | 3 +
> arch/powerpc/perf/imc-pmu.c | 155 ++++++++++++++++++++++++-
> arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
> include/linux/cpuhotplug.h | 1 +
> 5 files changed, 171 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index a0aa285869b5..23fc51e9d71d 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -168,7 +168,8 @@
> #define OPAL_INT_SET_MFRR 125
> #define OPAL_PCI_TCE_KILL 126
> #define OPAL_NMMU_SET_PTCR 127
> -#define OPAL_LAST 127
> +#define OPAL_NEST_IMC_COUNTERS_CONTROL 149

A couple of things:

- why is this 149 rather than 128?

- CONTROL seems like it's inviting a very broad and under-specified
API. I notice most of the other opal calls are reasonably narrow:
often defining two calls for get/set rather than a single control
call.

I don't have cycles to review the OPAL/skiboot side and I'm very much
open to being convinced here, I just want to avoid the situation where
we're passing around complicated data structures in a way that is
difficult to synchronise if we could avoid it.

> +#define OPAL_LAST 149
>
> /* Device tree flags */
>
> @@ -928,6 +929,16 @@ enum {
> OPAL_PCI_TCE_KILL_ALL,
> };
>
> +/* Operation argument to OPAL_NEST_IMC_COUNTERS_CONTROL */
> +enum {
> + OPAL_NEST_IMC_PRODUCTION_MODE = 1,
> +};
If there's only one mode, why do we need to specify it? As far as I can
tell no extra mode is added later in the series...

... looking at the opal-side patches, would it be better to just specify
the modes you intend to support in future here, and rely on opal
returning OPAL_PARAMETER when they're not supported?
> +
> +enum {
> + OPAL_NEST_IMC_STOP,
> + OPAL_NEST_IMC_START,
> +};
Again, would it be better to have a stop/start call rather than a
generic control call?

Also, the enum values (STOP=0, START=1) seem to be switched vs OPAL,
where https://patchwork.ozlabs.org/patch/746292/ has STOP=1, START=0. Or
am I missing something?

> +
> #endif /* __ASSEMBLY__ */
>
> #endif /* __OPAL_API_H */
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 1ff03a6da76e..d93d08204243 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t kill_type,
> uint64_t dma_addr, uint32_t npages);
> int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
>
> +int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
> + uint64_t value2, uint64_t value3);
> +

This prototype does not make me feel better about the state of the API.

Looking at the opal side, I would be much more comfortable if you could
nail down what you intend to have these support now, even if OPAL bails
with OPAL_PARAMETER if they're specified.

Also I think u64 and u32 are preferred, although I see the rest of the
file uses the long form.

> /* Internal functions */
> extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
> int depth, void *data);
> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
> index bd5bf66bd920..b28835b861b3 100644
> --- a/arch/powerpc/perf/imc-pmu.c
> +++ b/arch/powerpc/perf/imc-pmu.c
> @@ -17,6 +17,7 @@
>
> struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
> struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
> +static cpumask_t nest_imc_cpumask;
>
> /* Needed for sanity check */
> extern u64 nest_max_offset;
> @@ -32,6 +33,152 @@ static struct attribute_group imc_format_group = {
> .attrs = imc_format_attrs,
> };
>
> +/* Get the cpumask printed to a buffer "buf" */
> +static ssize_t imc_pmu_cpumask_get_attr(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + cpumask_t *active_mask;
> +
> + active_mask = &nest_imc_cpumask;
> + return cpumap_print_to_pagebuf(true, buf, active_mask);
> +}
> +
> +static DEVICE_ATTR(cpumask, S_IRUGO, imc_pmu_cpumask_get_attr, NULL);
> +
Out of curiosity, how do you know imc_pmu_cpumask_get_attr is called
with a large enough buffer? (How does cpumap_print_to_pagebuf know it's
not overrunning the buffer when it's not taking a length parameter? I
realise this is not your code but it's midly terrifying.)

> +static struct attribute *imc_pmu_cpumask_attrs[] = {
> + &dev_attr_cpumask.attr,
> + NULL,
> +};
> +
> +static struct attribute_group imc_pmu_cpumask_attr_group = {
> + .attrs = imc_pmu_cpumask_attrs,
> +};
> +
> +/*
> + * nest_init : Initializes the nest imc engine for the current chip.
> + */
> +static void nest_init(int *loc)
> +{
> + int rc;
> +
> + rc = opal_nest_imc_counters_control(OPAL_NEST_IMC_PRODUCTION_MODE,
> + OPAL_NEST_IMC_START, 0, 0);
So OPAL is supposed to figure out which CPU to start based on the CPU
that is currently running when you call into OPAL? I assume that's fine,
but you probably want to document that. (Perhaps on the OPAL side - I
spent a while looking for a parameter indicating the chip to work on!)

> + if (rc)
> + loc[smp_processor_id()] = 1;

loc is not a very helpful name here. Looking further down the patch it
seems you pass cpus_opal_rc in as a parameter; maybe something like
cpu_rcs or something?

> +}
> +
> +static void nest_change_cpu_context(int old_cpu, int new_cpu)
> +{
> + int i;
> +
> + for (i = 0;
> + (per_nest_pmu_arr[i] != NULL) && (i < IMC_MAX_PMUS); i++)
> + perf_pmu_migrate_context(&per_nest_pmu_arr[i]->pmu,
> + old_cpu, new_cpu);
> +}

This may be an incredibly misinformed question, but if you have 2
sockets, will this migrate things from both nests onto one? (I'm happy
to take you on trust if you just want to say 'no' without a detailed
explanation.)

> +
> +static int ppc_nest_imc_cpu_online(unsigned int cpu)
> +{
> + int nid;
> + const struct cpumask *l_cpumask;
> + struct cpumask tmp_mask;
> +
> + /* Find the cpumask of this node */
> + nid = cpu_to_node(cpu);
> + l_cpumask = cpumask_of_node(nid);
> +
> + /*
> + * If any of the cpu from this node is already present in the mask,
> + * just return, if not, then set this cpu in the mask.
> + */
> + if (!cpumask_and(&tmp_mask, l_cpumask, &nest_imc_cpumask)) {
> + cpumask_set_cpu(cpu, &nest_imc_cpumask);
> + return 0;
> + }
> +
> + return 0;
> +}
> +
> +static int ppc_nest_imc_cpu_offline(unsigned int cpu)
> +{
> + int nid, target = -1;
> + const struct cpumask *l_cpumask;
> +
> + /*
> + * Check in the designated list for this cpu. Dont bother
> + * if not one of them.
> + */
> + if (!cpumask_test_and_clear_cpu(cpu, &nest_imc_cpumask))
> + return 0;
> +
> + /*
> + * Now that this cpu is one of the designated,
> + * find a next cpu a) which is online and b) in same chip.
> + */
> + nid = cpu_to_node(cpu);
> + l_cpumask = cpumask_of_node(nid);
> + target = cpumask_next(cpu, l_cpumask);
> +
> + /*
> + * Update the cpumask with the target cpu and
> + * migrate the context if needed
> + */
> + if (target >= 0 && target < nr_cpu_ids) {
> + cpumask_set_cpu(target, &nest_imc_cpumask);
> + nest_change_cpu_context(cpu, target);
> + }
> + return 0;
> +}
> +
> +static int nest_pmu_cpumask_init(void)
> +{
> + const struct cpumask *l_cpumask;
> + int cpu, nid;
> + int *cpus_opal_rc;
> +
> + if (!cpumask_empty(&nest_imc_cpumask))
> + return 0;
> +
> + /*
> + * Memory for OPAL call return value.
> + */
> + cpus_opal_rc = kzalloc((sizeof(int) * nr_cpu_ids), GFP_KERNEL);
> + if (!cpus_opal_rc)
> + goto fail;
> +
> + /*
> + * Nest PMUs are per-chip counters. So designate a cpu
> + * from each chip for counter collection.
> + */
> + for_each_online_node(nid) {
> + l_cpumask = cpumask_of_node(nid);
> +
> + /* designate first online cpu in this node */
> + cpu = cpumask_first(l_cpumask);
> + cpumask_set_cpu(cpu, &nest_imc_cpumask);
> + }
> +
> + /* Initialize Nest PMUs in each node using designated cpus */
> + on_each_cpu_mask(&nest_imc_cpumask, (smp_call_func_t)nest_init,
> + (void *)cpus_opal_rc, 1);
> +
> + /* Check return value array for any OPAL call failure */
> + for_each_cpu(cpu, &nest_imc_cpumask) {
> + if (cpus_opal_rc[cpu])
> + goto fail;
> + }
> +
> + cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
> + "POWER_NEST_IMC_ONLINE",
> + ppc_nest_imc_cpu_online,
> + ppc_nest_imc_cpu_offline);
> +
Shouldn't you be freeing cpus_opal_rc here?
> + return 0;
> +
> +fail:
Also, shouldn't you be freeing cpus_opal_rc here? (if allocated)

> + return -ENODEV;
> +}
> +
> static int nest_imc_event_init(struct perf_event *event)
> {
> int chip_id;
> @@ -70,7 +217,7 @@ static int nest_imc_event_init(struct perf_event *event)
> * and then add to it.
> */
> event->hw.event_base = pcni->vbase[config/PAGE_SIZE] +
> - (config & ~PAGE_MASK);
> + (config & ~PAGE_MASK);

This hunk should either be dropped or squashed.

>
> return 0;
> }
> @@ -139,6 +286,7 @@ static int update_pmu_ops(struct imc_pmu *pmu)
> pmu->pmu.stop = imc_event_stop;
> pmu->pmu.read = imc_perf_event_update;
> pmu->attr_groups[1] = &imc_format_group;
> + pmu->attr_groups[2] = &imc_pmu_cpumask_attr_group;
> pmu->pmu.attr_groups = pmu->attr_groups;
>
> return 0;
> @@ -206,6 +354,11 @@ int init_imc_pmu(struct imc_events *events, int idx,
> {
> int ret = -ENODEV;
>
> + /* Add cpumask and register for hotplug notification */
> + ret = nest_pmu_cpumask_init();
If cpumask_init() also registers for hotplug, perhaps it should have a
more inclusive name.

Or, maybe better to move hotplug setup out of cpumask_init.
> + if (ret)
> + return ret;
> +
> ret = update_events_in_group(events, idx, pmu_ptr);
> if (ret)
> goto err_free;
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index da8a0f7a035c..b7208d8e6cc0 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -301,3 +301,4 @@ OPAL_CALL(opal_int_eoi, OPAL_INT_EOI);
> OPAL_CALL(opal_int_set_mfrr, OPAL_INT_SET_MFRR);
> OPAL_CALL(opal_pci_tce_kill, OPAL_PCI_TCE_KILL);
> OPAL_CALL(opal_nmmu_set_ptcr, OPAL_NMMU_SET_PTCR);
> +OPAL_CALL(opal_nest_imc_counters_control, OPAL_NEST_IMC_COUNTERS_CONTROL);
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 62d240e962f0..cfb0cedc72af 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -136,6 +136,7 @@ enum cpuhp_state {
> CPUHP_AP_PERF_ARM_CCI_ONLINE,
> CPUHP_AP_PERF_ARM_CCN_ONLINE,
> CPUHP_AP_PERF_ARM_L2X0_ONLINE,
> + CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
This powerpc enum is in the middle of a set of ARM ones. Should it be
after all the arm ones?

If it's an enum, hopefully order doesn't matter...

> CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE,
> CPUHP_AP_WORKQUEUE_ONLINE,
> CPUHP_AP_RCUTREE_ONLINE,
> --
> 2.7.4

Regards,
Daniel

2017-04-05 04:22:38

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions



On Tuesday 04 April 2017 07:18 AM, Daniel Axtens wrote:
> Hi,
>
>> +#define IMC_MAX_CHIPS 32
>> +#define IMC_MAX_PMUS 32
>> +#define IMC_MAX_PMU_NAME_LEN 256
> I've noticed this is used as both the maximum length for event names and
> event value strings. Would another name suit better?

This is used in the value string length comparison also. So yes, will
change the name to suit better.

Thanks for review
Maddy

>
>> +
>> +#define IMC_NEST_MAX_PAGES 16
>> +
>> +#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
>> +#define IMC_DTB_NEST_COMPAT "ibm,imc-counters-nest"
>> +
>> +/*
>> + * Structure to hold per chip specific memory address
>> + * information for nest pmus. Nest Counter data are exported
>> + * in per-chip reserved memory region by the PORE Engine.
>> + */
>> +struct perchip_nest_info {
>> + u32 chip_id;
>> + u64 pbase;
>> + u64 vbase[IMC_NEST_MAX_PAGES];
>> + u64 size;
>> +};
>> +
>> +/*
>> + * Place holder for nest pmu events and values.
>> + */
>> +struct imc_events {
>> + char *ev_name;
>> + char *ev_value;
>> +};
>> +
>> +/*
>> + * Device tree parser code detects IMC pmu support and
>> + * registers new IMC pmus. This structure will
>> + * hold the pmu functions and attrs for each imc pmu and
>> + * will be referenced at the time of pmu registration.
>> + */
>> +struct imc_pmu {
>> + struct pmu pmu;
>> + int domain;
>> + const struct attribute_group *attr_groups[4];
>> +};
>> +
>> +/*
>> + * Domains for IMC PMUs
>> + */
>> +#define IMC_DOMAIN_NEST 1
>> +#define IMC_DOMAIN_UNKNOWN -1
>> +
>> +#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
>> --
>> 2.7.4

2017-04-05 06:35:29

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module



On Tuesday 04 April 2017 06:28 AM, Daniel Axtens wrote:
> Hi all,
>
> I'm trying to get my head around these patches - at this point I'm just
> doing a first pass, so I may have more substantive structural comments
> later on. In the mean time - here are some minor C nits:
>
>> + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
>> + * (C) 2016 Hemant K Shaw, IBM Corporation.
> Should these be bumped to 2017?

Facepalm. my bad. Will fix it.
>> +
>> + do {
>> + pages = PAGE_SIZE * i;
>> + pcni->vbase[i++] = (u64)phys_to_virt(pcni->pbase +
>> + pages);
>> + } while (i < (pcni->size / PAGE_SIZE));
>> + }
> I had to scroll back up to the top of this function to make sure I
> understood what this loop does. Would it be better to write it as:
>
> for (i = 0; i < (pcni->size / PAGE_SIZE); i++) {
> pages = PAGE_SIZE * i;
> pcni->vbase[i] = (u64)....
> }
Idea is to map all of the nest counter area since event
offset could be anywhere within this region.
Will document that here.

>
> And, just checking - this is expected to work on both 4 and 64kB pages?
Yes. thats the intended. That said, i need to fix the
IMC_NEST_MAX_PAGES value for 4K page size.
Reason being, there was a recent change in the
size of memory allocated for nest counters in the
HOMER region for the microcode.

>
>> +
>> + return 0;
>> +err:
>> + return -ENODEV;
> You're not releasing any resources here - would it be better to just
> replace the gotos with this return? I haven't checked to see if you
> change the function later on to allocate memory - if so please ignore :)

We check in multiple places in the function and return on fail.
Thats why we made it as a generic return with goto.


>> +}
>> +
>> +static const struct of_device_id opal_imc_match[] = {
>> + { .compatible = IMC_DTB_COMPAT },
>> + {},
>> +};
>> +
>> +static struct platform_driver opal_imc_driver = {
>> + .driver = {
>> + .name = "opal-imc-counters",
>> + .of_match_table = opal_imc_match,
>> + },
>> + .probe = opal_imc_counters_probe,
>> +};
>> +
>> +MODULE_DEVICE_TABLE(of, opal_imc_match);
>> +module_platform_driver(opal_imc_driver);
>> +MODULE_DESCRIPTION("PowerNV OPAL IMC driver");
>> +MODULE_LICENSE("GPL");
>> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
>> index e0f856bfbfe8..85ea1296f030 100644
>> --- a/arch/powerpc/platforms/powernv/opal.c
>> +++ b/arch/powerpc/platforms/powernv/opal.c
>> @@ -14,6 +14,7 @@
>> #include <linux/printk.h>
>> #include <linux/types.h>
>> #include <linux/of.h>
>> +#include <linux/of_address.h>
>> #include <linux/of_fdt.h>
>> #include <linux/of_platform.h>
>> #include <linux/interrupt.h>
>> @@ -30,6 +31,7 @@
>> #include <asm/opal.h>
>> #include <asm/firmware.h>
>> #include <asm/mce.h>
>> +#include <asm/imc-pmu.h>
>>
>> #include "powernv.h"
>>
>> @@ -631,6 +633,15 @@ static void opal_pdev_init(const char *compatible)
>> of_platform_device_create(np, NULL, NULL);
>> }
>>
>> +static void opal_imc_init_dev(void)
>> +{
>> + struct device_node *np;
>> +
>> + np = of_find_compatible_node(NULL, NULL, IMC_DTB_COMPAT);
>> + if (np)
>> + of_platform_device_create(np, NULL, NULL);
>> +}
> Should this function be tagged __init?

Yes. Thats right. Will make the changes.

Thanks for review
Maddy

>
>> +
>> static int kopald(void *unused)
>> {
>> unsigned long timeout = msecs_to_jiffies(opal_heartbeat) + 1;
>> @@ -704,6 +715,9 @@ static int __init opal_init(void)
>> /* Setup a heatbeat thread if requested by OPAL */
>> opal_init_heartbeat();
>>
>> + /* Detect IMC pmu counters support and create PMUs */
>> + opal_imc_init_dev();
>> +
>> /* Create leds platform devices */
>> leds = of_find_node_by_path("/ibm,opal/leds");
>> if (leds) {
>> --
>> 2.7.4

2017-04-05 06:37:51

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module



On Tuesday 04 April 2017 07:18 AM, Daniel Axtens wrote:
> Hi,
>
>> + do {
>> + pages = PAGE_SIZE * i;
>> + pcni->vbase[i++] = (u64)phys_to_virt(pcni->pbase +
>> + pages);
>> + } while (i < (pcni->size / PAGE_SIZE));
> I also just noticed that there's no check here against
> IMC_NEST_MAX_PAGES: should that be inserted? (If for no other reason
> than to stop every static analysis tool complaining!)
Yes make sense. Can add that in the next version.

Thanks for review
Maddy


> Daniel
>
>> + }
>> +
>> + return 0;
>> +err:
>> + return -ENODEV;
>> +}
>> +
>> +static const struct of_device_id opal_imc_match[] = {
>> + { .compatible = IMC_DTB_COMPAT },
>> + {},
>> +};
>> +
>> +static struct platform_driver opal_imc_driver = {
>> + .driver = {
>> + .name = "opal-imc-counters",
>> + .of_match_table = opal_imc_match,
>> + },
>> + .probe = opal_imc_counters_probe,
>> +};
>> +
>> +MODULE_DEVICE_TABLE(of, opal_imc_match);
>> +module_platform_driver(opal_imc_driver);
>> +MODULE_DESCRIPTION("PowerNV OPAL IMC driver");
>> +MODULE_LICENSE("GPL");
>> diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
>> index e0f856bfbfe8..85ea1296f030 100644
>> --- a/arch/powerpc/platforms/powernv/opal.c
>> +++ b/arch/powerpc/platforms/powernv/opal.c
>> @@ -14,6 +14,7 @@
>> #include <linux/printk.h>
>> #include <linux/types.h>
>> #include <linux/of.h>
>> +#include <linux/of_address.h>
>> #include <linux/of_fdt.h>
>> #include <linux/of_platform.h>
>> #include <linux/interrupt.h>
>> @@ -30,6 +31,7 @@
>> #include <asm/opal.h>
>> #include <asm/firmware.h>
>> #include <asm/mce.h>
>> +#include <asm/imc-pmu.h>
>>
>> #include "powernv.h"
>>
>> @@ -631,6 +633,15 @@ static void opal_pdev_init(const char *compatible)
>> of_platform_device_create(np, NULL, NULL);
>> }
>>
>> +static void opal_imc_init_dev(void)
>> +{
>> + struct device_node *np;
>> +
>> + np = of_find_compatible_node(NULL, NULL, IMC_DTB_COMPAT);
>> + if (np)
>> + of_platform_device_create(np, NULL, NULL);
>> +}
>> +
>> static int kopald(void *unused)
>> {
>> unsigned long timeout = msecs_to_jiffies(opal_heartbeat) + 1;
>> @@ -704,6 +715,9 @@ static int __init opal_init(void)
>> /* Setup a heatbeat thread if requested by OPAL */
>> opal_init_heartbeat();
>>
>> + /* Detect IMC pmu counters support and create PMUs */
>> + opal_imc_init_dev();
>> +
>> /* Create leds platform devices */
>> leds = of_find_node_by_path("/ibm,opal/leds");
>> if (leds) {
>> --
>> 2.7.4

2017-04-05 12:30:18

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events



On Tuesday 04 April 2017 07:11 AM, Daniel Axtens wrote:
> Madhavan Srinivasan <[email protected]> writes:
>
>> From: Hemant Kumar <[email protected]>
>>
>> Parse device tree to detect IMC units. Traverse through each IMC unit
>> node to find supported events and corresponding unit/scale files (if any).
>>
>> Here is the DTS file for reference:
>>
>> https://github.com/open-power/ima-catalog/blob/master/81E00612.4E0100.dts
>>
>> The device tree for IMC counters starts at the node "imc-counters".
>> This node contains all the IMC PMU nodes and event nodes
>> for these IMC PMUs. The PMU nodes have an "events" property which has a
>> phandle value for the actual events node. The events are separated from
>> the PMU nodes to abstract out the common events. For example, PMU node
>> "mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since,
>> the events are common between these PMUs. These events have a different
>> prefix based on their relation to different PMUs, and hence, the PMU
>> nodes themselves contain an "events-prefix" property. The value for this
>> property concatenated to the event name, forms the actual event
>> name. Also, the PMU have a "reg" field as the base offset for the events
>> which belong to this PMU. This "reg" field is added to an event in the
>> "events" node, which gives us the location of the counter data. Kernel
>> code uses this offset as event configuration value.
> This is probably because I haven't looked at this area recently, but
> what do you mean by 'event configuration value'? If I understand
> correctly, you're saying something like "kernel code stores this
> calculated location in the event configuration data" - is that about
> right?
Yes. Thats right. So incase of IMC, event are stored in the
memory and each event has specific offset in the memory.
We are using this event offset as the "event configuration value".

Here is the example, consider event
/sys/devices/nest_mcs0/events# cat PM_MCS0_DOWN_128B_DATA_XFER
event=0x198
#

>
> Apologies if this is a dumb question.
>
>> Device tree parser code also looks for scale/unit property in the event
>> node and passes on the value as an event attr for perf interface to use
>> in the post processing by the perf tool. Some PMUs may have common scale
>> and unit properties which implies that all events supported by this PMU
>> inherit the scale and unit properties of the PMU itself. For those
>> events, we need to set the common unit and scale values.
>>
>> For failure to initialize any unit or any event, disable that unit and
>> continue setting up the rest of them.
>>
>> Signed-off-by: Hemant Kumar <[email protected]>
>> Signed-off-by: Anju T Sudhakar <[email protected]>
>> Signed-off-by: Madhavan Srinivasan <[email protected]>
>> ---
>> arch/powerpc/platforms/powernv/opal-imc.c | 386 ++++++++++++++++++++++++++++++
>> 1 file changed, 386 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
>> index c476d596c6a8..ba0203e669c0 100644
>> --- a/arch/powerpc/platforms/powernv/opal-imc.c
>> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
>> @@ -33,6 +33,388 @@
>> #include <asm/imc-pmu.h>
>>
>> struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
>> +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
>> +
>> +static int imc_event_info(char *name, struct imc_events *events)
> This seems to be an allocation/init function: should that be reflected
> in the name?
>
>> +{
>> + char *buf;
>> +
>> + /* memory for content */
>> + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
>> + if (!buf)
>> + return -ENOMEM;
>> +
>> + events->ev_name = name;
> We trust the name will never become a dangling pointer? We don't need to
> duplicate it?
Yes. Event names are from the DTS and can add a check
for it. Secondly we allocate memory for this in imc_events_node_parser()
and duplicate it.


>
> Should the pointer be const?
>
>> + events->ev_value = buf;
>> + return 0;
>> +}
>> +
>> +static int imc_event_info_str(struct property *pp, char *name,
>> + struct imc_events *events)
> This creates and sets an initial value based on a property, ...
>
>> +static int imc_event_info_val(char *name, u32 val,
>> + struct imc_events *events)
> ... this creates and sets based on a u32.
>
> Could you call them something that suggests their purpose a bit better?
> Maybe event_info_from_{property,val}?

Yes. Will fix the name.

>
>> +{
>> + int ret;
>> +
>> + ret = imc_event_info(name, events);
>> + if (ret)
>> + return ret;
>> + sprintf(events->ev_value, "event=0x%x", val);
>> +
>> + return 0;
>> +}
>> +
>> +static int set_event_property(struct property *pp, char *event_prop,
>> + struct imc_events *events, char *ev_name)
>> +{
>> + char *buf;
>> + int ret;
>> +
>> + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
>> + if (!buf)
>> + return -ENOMEM;
>> +
>> + sprintf(buf, "%s.%s", ev_name, event_prop);
>> + ret = imc_event_info_str(pp, buf, events);
>> + if (ret) {
>> + kfree(events->ev_name);
>> + kfree(events->ev_value);
> How do you know the buffer for ev_value was successfully allocated? Can
> you be sure it is safe to free? (Consider imc_event_info returning -ENOMEM)

Nice catch. Right, cant assume it. Will rework this code.

>
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +/*
>> + * imc_events_node_parser: Parse the event node "dev" and assign the parsed
>> + * information to event "events".
>> + *
>> + * Parses the "reg" property of this event. "reg" gives us the event offset.
>> + * Also, parse the "scale" and "unit" properties, if any.
>> + */
>> +static int imc_events_node_parser(struct device_node *dev,
>> + struct imc_events *events,
>> + struct property *event_scale,
>> + struct property *event_unit,
>> + struct property *name_prefix,
>> + u32 reg)
>> +{
>> + struct property *name, *pp;
>> + char *ev_name;
>> + u32 val;
>> + int idx = 0, ret;
>> +
>> + if (!dev)
>> + return -EINVAL;
>> +
>> + /*
>> + * Loop through each property of an event node
>> + */
>> + name = of_find_property(dev, "event-name", NULL);
>> + if (!name)
>> + return -ENODEV;
>> +
>> + if (!name->value ||
>> + (strnlen(name->value, name->length) == name->length) ||
>> + (name->length > IMC_MAX_PMU_NAME_LEN))
>> + return -EINVAL;
>> +
>> + ev_name = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
>> + if (!ev_name)
>> + return -ENOMEM;
> I see you allocating memory here and in set_event_property for the
> name. Would it be better to move that into imc_event_info and make that
> function responsible for the whole set of allocations regarding the
> event?
No, I would prefer the current way. Reason being, incase of
allocation in set_event_property is only incase of scale and
unit property since perf tool expects scale file to be
exported as <eventname>.scale. And secondly memory
allocated in imc_event_info is to carry the event configuretion
value and it is not for event_name.


>
> (I'm happy to be talked out of this - it just seems like it would make
> it much easier to reason about the lifecycles of the memory allocations.)
>
>> +
>> + snprintf(ev_name, IMC_MAX_PMU_NAME_LEN, "%s%s",
>> + (char *)name_prefix->value,
>> + (char *)name->value);
>> +
>> + /*
>> + * Parse each property of this event node "dev". Property "reg" has
>> + * the offset which is assigned to the event name. Other properties
>> + * like "scale" and "unit" are assigned to event.scale and event.unit
>> + * accordingly.
>> + */
>> + for_each_property_of_node(dev, pp) {
>> + /*
>> + * If there is an issue in parsing a single property of
>> + * this event, we just clean up the buffers, but we still
>> + * continue to parse.
>> + */
>> + if (strncmp(pp->name, "reg", 3) == 0) {
>> + of_property_read_u32(dev, pp->name, &val);
>> + val += reg;
>> + ret = imc_event_info_val(ev_name, val, &events[idx]);
>> + if (ret) {
>> + kfree(events[idx].ev_name);
>> + kfree(events[idx].ev_value);
>> + continue;
>> + }
>> + /*
>> + * If the common scale and unit properties available,
>> + * then, assign them to this event
>> + */
>> + if (event_scale) {
>> + idx++;
>> + ret = set_event_property(event_scale, "scale",
>> + &events[idx],
>> + ev_name);
>> + if (ret)
>> + continue;
>> + idx++;
> Why do we increment the idx an extra two times if event_scale is
> provided? We don't do that for the other ones...
OK, patch 8 fixed this. Let me pull that hunk and
merge it here.


>
>> + }
>> + if (event_unit) {
>> + ret = set_event_property(event_unit, "unit",
>> + &events[idx],
>> + ev_name);
>> + if (ret)
>> + continue;
>> + }
>> + idx++;
>> + } else if (strncmp(pp->name, "unit", 4) == 0) {
>> + ret = set_event_property(pp, "unit", &events[idx],
>> + ev_name);
>> + if (ret)
>> + continue;
>> + idx++;
>> + } else if (strncmp(pp->name, "scale", 5) == 0) {
>> + ret = set_event_property(pp, "scale", &events[idx],
>> + ev_name);
>> + if (ret)
>> + continue;
>> + idx++;
>> + }
>> + }
> I find it really difficult to convince myself of the control flow in
> that loop.
>
> In particular, the comment at the top claims:
>> + * If there is an issue in parsing a single property of
>> + * this event, we just clean up the buffers, but we still
>> + * continue to parse.
> I am not quite sure precisely what that means. I first thought it meant
> that if you cannot parse one of the properties of the event
> (scale/unit), you would clean up the entire event, and parse the next
> entire event. It looks like the code in fact simply cleans up that
> *property* and continues to parse other properties.
>
> Does that make sense to do? If you can't parse the scale of an event for
> instance, does it make sense to continue with the rest of that event?
OK this is definitely true if we cant parse the "reg" property.
Will fix it.



>
> Initially I was concerned that 'continue' wasn't sufficient to even
> clean up the property, but I worked through the error conditions of
> set_event_property and now I am satisfied.
>
>> +
>> + return idx;
>> +}
>> +
>> +/*
>> + * imc_get_domain : Returns the domain for pmu "pmu_dev".
>> + */
>> +int imc_get_domain(struct device_node *pmu_dev)
>> +{
>> + if (of_device_is_compatible(pmu_dev, IMC_DTB_NEST_COMPAT))
>> + return IMC_DOMAIN_NEST;
>> + else
>> + return IMC_DOMAIN_UNKNOWN;
>> +}
>> +
>> +/*
>> + * get_nr_children : Returns the number of children for a pmu device node.
>> + */
>> +static int get_nr_children(struct device_node *pmu_node)
>> +{
>> + struct device_node *child;
>> + int i = 0;
>> +
>> + for_each_child_of_node(pmu_node, child)
>> + i++;
>> + return i;
>> +}
>> +
>> +/*
>> + * imc_free_events : Cleanup the "events" list having "nr_entries" entries.
>> + */
>> +static void imc_free_events(struct imc_events *events, int nr_entries)
>> +{
>> + int i;
>> +
>> + /* Nothing to clean, return */
>> + if (!events)
>> + return;
>> + for (i = 0; i < nr_entries; i++) {
>> + kfree(events[i].ev_name);
>> + kfree(events[i].ev_value);
>> + }
>> +
>> + kfree(events);
>> +}
>> +
>> +/*
>> + * imc_pmu_create : Takes the parent device which is the pmu unit and a
>> + * pmu_index as the inputs.
>> + * Allocates memory for the pmu, sets up its domain (NEST or CORE), and
>> + * allocates memory for the events supported by this pmu. Assigns a name for
>> + * the pmu. Calls imc_events_node_parser() to setup the individual events.
>> + * If everything goes fine, it calls, init_imc_pmu() to setup the pmu device
>> + * and register it.
>> + */
>> +static int imc_pmu_create(struct device_node *parent, int pmu_index)
>> +{
>> + struct device_node *ev_node = NULL, *dir = NULL;
>> + struct imc_events *events;
>> + struct imc_pmu *pmu_ptr;
>> + u32 prop, reg;
>> + struct property *pp, *scale_pp, *unit_pp, *name_prefix;
>> + char *buf;
>> + int idx = 0, ret = 0, nr_children = 0;
>> +
>> + if (!parent)
>> + return -EINVAL;
>> +
>> + /* memory for pmu */
>> + pmu_ptr = kzalloc(sizeof(struct imc_pmu), GFP_KERNEL);
>> + if (!pmu_ptr)
>> + return -ENOMEM;
>> +
>> + pmu_ptr->domain = imc_get_domain(parent);
>> + if (pmu_ptr->domain == IMC_DOMAIN_UNKNOWN)
>> + goto free_pmu;
>> +
>> + /* Needed for hotplug/migration */
>> + per_nest_pmu_arr[pmu_index] = pmu_ptr;
>> +
>> + /*
>> + * "events" property inside a PMU node contains the phandle value
>> + * for the actual events node. The "events" node for the IMC PMU
>> + * is not in this node, rather inside "imc-counters" node, since,
>> + * we want to factor out the common events (thereby, reducing the
>> + * size of the device tree)
>> + */
>> + of_property_read_u32(parent, "events", &prop);
>> + if (!prop)
>> + return -EINVAL;
>> +
>> + /*
>> + * Fetch the actual node where the events for this PMU exist.
>> + */
>> + dir = of_find_node_by_phandle(prop);
>> + if (!dir)
>> + return -EINVAL;
>> +
>> + /*
>> + * Get the maximum no. of events in this node.
>> + * Multiply by 3 to account for .scale and .unit properties
>> + * This number suggests the amount of memory needed to setup the
>> + * events for this pmu.
>> + */
>> + nr_children = get_nr_children(dir) * 3;
>> +
>> + /* memory for pmu events */
>> + events = kzalloc((sizeof(struct imc_events) * nr_children),
>> + GFP_KERNEL);
>> + if (!events) {
>> + ret = -ENOMEM;
>> + goto free_pmu;
>> + }
>> +
>> + pp = of_find_property(parent, "name", NULL);
>> + if (!pp) {
>> + ret = -ENODEV;
>> + goto free_events;
>> + }
>> +
>> + if (!pp->value ||
>> + (strnlen(pp->value, pp->length) == pp->length) ||
>> + (pp->length > IMC_MAX_PMU_NAME_LEN)) {
>> + ret = -EINVAL;
>> + goto free_events;
>> + }
>> +
>> + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
>> + if (!buf) {
>> + ret = -ENOMEM;
>> + goto free_events;
>> + }
>> +
>> + /* Save the name to register it later */
>> + sprintf(buf, "nest_%s", (char *)pp->value);
>> + pmu_ptr->pmu.name = (char *)buf;
>> +
>> + /*
>> + * Check if there is a common "scale" and "unit" properties inside
>> + * the PMU node for all the events supported by this PMU.
>> + */
>> + scale_pp = of_find_property(parent, "scale", NULL);
>> + unit_pp = of_find_property(parent, "unit", NULL);
>> +
>> + /*
>> + * Get the event-prefix property from the PMU node
>> + * which needs to be attached with the event names.
>> + */
>> + name_prefix = of_find_property(parent, "events-prefix", NULL);
>> + if (!name_prefix)
>> + return -ENODEV;
>> +
>> + /*
>> + * "reg" property gives out the base offset of the counters data
>> + * for this PMU.
>> + */
>> + of_property_read_u32(parent, "reg", &reg);
>> +
>> + if (!name_prefix->value ||
>> + (strnlen(name_prefix->value, name_prefix->length) == name_prefix->length) ||
>> + (name_prefix->length > IMC_MAX_PMU_NAME_LEN))
>> + return -EINVAL;
>> +
>> + /* Loop through event nodes */
>> + for_each_child_of_node(dir, ev_node) {
>> + ret = imc_events_node_parser(ev_node, &events[idx], scale_pp,
>> + unit_pp, name_prefix, reg);
>> + if (ret < 0) {
>> + /* Unable to parse this event */
>> + if (ret == -ENOMEM)
>> + goto free_events;
>> + continue;
>> + }
>> +
>> + /*
>> + * imc_event_node_parser will return number of
>> + * event entries created for this. This could include
>> + * event scale and unit files also.
>> + */
>> + idx += ret;
>> + }
>> +
>> + return 0;
>> +
>> +free_events:
>> + imc_free_events(events, idx);
>> +free_pmu:
>> + kfree(pmu_ptr);
>> + return ret;
>> +}
>> +
>> +/*
>> + * imc_pmu_setup : Setup the IMC PMUs (children of "parent").
>> + */
>> +static void imc_pmu_setup(struct device_node *parent)
> Should this be marked __init?

Yes we can.

Thanks for review
Maddy
>> +{
>> + struct device_node *child;
>> + int pmu_count = 0, rc = 0;
>> + const struct property *pp;
>> +
>> + if (!parent)
>> + return;
>> +
>> + /* Setup all the IMC pmus */
>> + for_each_child_of_node(parent, child) {
>> + pp = of_get_property(child, "compatible", NULL);
>> + if (pp) {
>> + /*
>> + * If there is a node with a "compatible" field,
>> + * that's a PMU node
>> + */
>> + rc = imc_pmu_create(child, pmu_count);
>> + if (rc)
>> + return;
>> + pmu_count++;
>> + }
>> + }
>> +}
>>
>> static int opal_imc_counters_probe(struct platform_device *pdev)
>> {
>> @@ -102,6 +484,10 @@ static int opal_imc_counters_probe(struct platform_device *pdev)
>> } while (i < (pcni->size / PAGE_SIZE));
>> }
>>
>> +#ifdef CONFIG_PERF_EVENTS
>> + imc_pmu_setup(imc_dev);
>> +#endif
>> +
>> return 0;
>> err:
>> return -ENODEV;
>> --
>> 2.7.4
> Regards,
> Daniel
>

2017-04-06 06:43:45

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v6 04/11] powerpc/perf: Add event attribute and group to IMC pmus



On Tuesday 04 April 2017 07:41 AM, Daniel Axtens wrote:
> Hi,
>
>> Device tree IMC driver code parses the IMC units and their events. It
>> passes the information to IMC pmu code which is placed in powerpc/perf
>> as "imc-pmu.c".
>>
>> This patch creates only event attributes and attribute groups for the
>> IMC pmus.
>>
>> Signed-off-by: Anju T Sudhakar <[email protected]>
>> Signed-off-by: Hemant Kumar <[email protected]>
>> Signed-off-by: Madhavan Srinivasan <[email protected]>
>> ---
>> arch/powerpc/perf/Makefile | 6 +-
>> arch/powerpc/perf/imc-pmu.c | 97 +++++++++++++++++++++++++++++++
>> arch/powerpc/platforms/powernv/opal-imc.c | 12 +++-
>> 3 files changed, 112 insertions(+), 3 deletions(-)
>> create mode 100644 arch/powerpc/perf/imc-pmu.c
>>
>> diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
>> index 4d606b99a5cb..d0d1f04203c7 100644
>> --- a/arch/powerpc/perf/Makefile
>> +++ b/arch/powerpc/perf/Makefile
>> @@ -2,10 +2,14 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
>>
>> obj-$(CONFIG_PERF_EVENTS) += callchain.o perf_regs.o
>>
>> +imc-$(CONFIG_PPC_POWERNV) += imc-pmu.o
>> +
>> obj-$(CONFIG_PPC_PERF_CTRS) += core-book3s.o bhrb.o
>> obj64-$(CONFIG_PPC_PERF_CTRS) += power4-pmu.o ppc970-pmu.o power5-pmu.o \
>> power5+-pmu.o power6-pmu.o power7-pmu.o \
>> - isa207-common.o power8-pmu.o power9-pmu.o
>> + isa207-common.o power8-pmu.o power9-pmu.o \
>> + $(imc-y)
> Hmm, this seems... obtuse. Will the IMC stuff fail to compile on
> non-powerNV? Can we just link it in like we do with every other sort of
> performance counter and let runtime detection do its thing?
Yes. This is true. Also, in powernv_defconfig, we could
have perf_events config disabled. So will address this.

>
>> +
>> obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o
>>
>> obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
>> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
>> new file mode 100644
>> index 000000000000..ec464c76b749
>> --- /dev/null
>> +++ b/arch/powerpc/perf/imc-pmu.c
>> @@ -0,0 +1,97 @@
>> +/*
>> + * Nest Performance Monitor counter support.
>> + *
>> + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
>> + * (C) 2016 Hemant K Shaw, IBM Corporation.
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published
>> + * by the Free Software Foundation.
>> + */
>> +#include <linux/perf_event.h>
>> +#include <linux/slab.h>
>> +#include <asm/opal.h>
>> +#include <asm/imc-pmu.h>
>> +#include <asm/cputhreads.h>
>> +#include <asm/smp.h>
>> +#include <linux/string.h>
>> +
>> +struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
>> +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
>> +
>> +/* dev_str_attr : Populate event "name" and string "str" in attribute */
>> +static struct attribute *dev_str_attr(const char *name, const char *str)
>> +{
>> + struct perf_pmu_events_attr *attr;
>> +
>> + attr = kzalloc(sizeof(*attr), GFP_KERNEL);
>> +
>> + sysfs_attr_init(&attr->attr.attr);
>> +
>> + attr->event_str = str;
>> + attr->attr.attr.name = name;
>> + attr->attr.attr.mode = 0444;
>> + attr->attr.show = perf_event_sysfs_show;
>> +
>> + return &attr->attr.attr;
>> +}
>> +
>> +/*
>> + * update_events_in_group: Update the "events" information in an attr_group
>> + * and assign the attr_group to the pmu "pmu".
>> + */
>> +static int update_events_in_group(struct imc_events *events,
>> + int idx, struct imc_pmu *pmu)
> So I've probably missed a key point in the earlier patches, but for the
> life of me I cannot figure out what idx is supposed to represent.
Ok will add comment. Idea is to keep track of number of events
to update when we create the event_attribute.

>
>> +{
>> + struct attribute_group *attr_group;
>> + struct attribute **attrs;
>> + int i;
>> +
>> + /* Allocate memory for attribute group */
>> + attr_group = kzalloc(sizeof(*attr_group), GFP_KERNEL);
>> + if (!attr_group)
>> + return -ENOMEM;
>> +
>> + /* Allocate memory for attributes */
>> + attrs = kzalloc((sizeof(struct attribute *) * (idx + 1)), GFP_KERNEL);
> Also, idx is an int, so:
> - things will go wrong if idx is -1
> - things will go very wrong if idx is -2
> - do you need to do some sort of validation to make sure it's within
> bounds? If not in this function, what function ensures the necessary
> 'sanity check' preconditions for idx?

Yes, valid point. Should add sanity check for this variable.


>
>> + if (!attrs) {
>> + kfree(attr_group);
>> + return -ENOMEM;
>> + }
>> +
>> + attr_group->name = "events";
>> + attr_group->attrs = attrs;
>> + for (i = 0; i < idx; i++, events++) {
>> + attrs[i] = dev_str_attr((char *)events->ev_name,
>> + (char *)events->ev_value);
>> + }
>> +
>> + pmu->attr_groups[0] = attr_group;
> Again, I may have missed something, but what's special about group 0?
> Patch 1 lets you have up to 4 groups - are they used elsewhere?

Each entry in the group is a specific attribute for the PMU unit.
So we use group 0 as event and group 1/2 as format and cpumask
attributes.

Will add comment to explain that.

>
>> + return 0;
>> +}
>> +
>> +/*
>> + * init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events
>> + * "events".
>> + * Setup the cpu mask information for these pmus and setup the state machine
>> + * hotplug notifiers as well.
> I cannot figure out how this function sets cpu mask information or sets
> up hotplug notifiers. Is this done in a later patch? (looking at the
> subject lines - perhaps patch 6?)

Yes. Its done in patch6, Will remove that text since it confusing here.

>
>> + */
>> +int init_imc_pmu(struct imc_events *events, int idx,
>> + struct imc_pmu *pmu_ptr)
> Should this be marked __init, or can it be called in a hotplug path?

Yes. I will.
>
>> +{
>> + int ret = -ENODEV;
>> +
>> + ret = update_events_in_group(events, idx, pmu_ptr);
>> + if (ret)
>> + goto err_free;
>> +
>> + return 0;
>> +
>> +err_free:
>> + /* Only free the attr_groups which are dynamically allocated */
>> + if (pmu_ptr->attr_groups[0]) {
>> + kfree(pmu_ptr->attr_groups[0]->attrs);
>> + kfree(pmu_ptr->attr_groups[0]);
>> + }
>> +
>> + return ret;
>> +}
>> diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
>> index ba0203e669c0..73c46703c2af 100644
>> --- a/arch/powerpc/platforms/powernv/opal-imc.c
>> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
>> @@ -32,8 +32,11 @@
>> #include <asm/cputable.h>
>> #include <asm/imc-pmu.h>
>>
>> -struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
>> -struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
>> +extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
>> +extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
> Why are these definitions being moved?
> If they're still needed in this file, should the extern lines live in a
> header file?
We primarily use these values in the pmu functions. But will add
the extern to the header file.

Thanks for review

Maddy
>
>> +
>> +extern int init_imc_pmu(struct imc_events *events,
>> + int idx, struct imc_pmu *pmu_ptr);
>>
>> static int imc_event_info(char *name, struct imc_events *events)
>> {
>> @@ -379,6 +382,11 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index)
>> idx += ret;
>> }
>>
>> + ret = init_imc_pmu(events, idx, pmu_ptr);
>> + if (ret) {
>> + pr_err("IMC PMU %s Register failed\n", pmu_ptr->pmu.name);
>> + goto free_events;
>> + }
>> return 0;
>>
>> free_events:
>> --
>> 2.7.4
> Regards,
> Daniel
>

2017-04-06 07:05:13

by Stewart Smith

[permalink] [raw]
Subject: Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module

Madhavan Srinivasan <[email protected]> writes:
> From: Hemant Kumar <[email protected]>
>
> This patch does three things :
> - Enables "opal.c" to create a platform device for the IMC interface
> according to the appropriate compatibility string.
> - Find the reserved-memory region details from the system device tree
> and get the base address of HOMER (Reserved memory) region address for each chip.
> - We also get the Nest PMU counter data offsets (in the HOMER region)
> and their sizes. The offsets for the counters' data are fixed and
> won't change from chip to chip.
>
> The device tree parsing logic is separated from the PMU creation
> functions (which is done in subsequent patches).
>
> Signed-off-by: Anju T Sudhakar <[email protected]>
> Signed-off-by: Hemant Kumar <[email protected]>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> ---
> arch/powerpc/platforms/powernv/Makefile | 2 +-
> arch/powerpc/platforms/powernv/opal-imc.c | 126 ++++++++++++++++++++++++++++++
> arch/powerpc/platforms/powernv/opal.c | 14 ++++
> 3 files changed, 141 insertions(+), 1 deletion(-)
> create mode 100644 arch/powerpc/platforms/powernv/opal-imc.c
>
> diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
> index b5d98cb3f482..44909fec1121 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -2,7 +2,7 @@ obj-y += setup.o opal-wrappers.o opal.o opal-async.o idle.o
> obj-y += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
> obj-y += rng.o opal-elog.o opal-dump.o opal-sysparam.o opal-sensor.o
> obj-y += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
> -obj-y += opal-kmsg.o
> +obj-y += opal-kmsg.o opal-imc.o
>
> obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o
> obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o
> diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c
> new file mode 100644
> index 000000000000..c476d596c6a8
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> @@ -0,0 +1,126 @@
> +/*
> + * OPAL IMC interface detection driver
> + * Supported on POWERNV platform
> + *
> + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
> + * (C) 2016 Hemant K Shaw, IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/platform_device.h>
> +#include <linux/miscdevice.h>
> +#include <linux/fs.h>
> +#include <linux/of.h>
> +#include <linux/of_address.h>
> +#include <linux/of_platform.h>
> +#include <linux/poll.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/crash_dump.h>
> +#include <asm/opal.h>
> +#include <asm/io.h>
> +#include <asm/uaccess.h>
> +#include <asm/cputable.h>
> +#include <asm/imc-pmu.h>
> +
> +struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
> +
> +static int opal_imc_counters_probe(struct platform_device *pdev)
> +{
> + struct device_node *child, *imc_dev, *rm_node = NULL;
> + struct perchip_nest_info *pcni;
> + u32 pages, nest_offset, nest_size, idx;
> + int i = 0;
> + const char *node_name;
> + const __be32 *addrp;
> + u64 reg_addr, reg_size;
> +
> + if (!pdev || !pdev->dev.of_node)
> + return -ENODEV;
> +
> + /*
> + * Check whether this kdump kernel. If yes, just return.
> + */
> + if (is_kdump_kernel())
> + return -ENODEV;
> +
> + imc_dev = pdev->dev.of_node;
> +
> + /*
> + * nest_offset : where the nest-counters' data start.
> + * size : size of the entire nest-counters region
> + */
> + if (of_property_read_u32(imc_dev, "imc-nest-offset", &nest_offset))
> + goto err;
> +
> + if (of_property_read_u32(imc_dev, "imc-nest-size", &nest_size))
> + goto err;
> +
> + /* Find the "homer region" for each chip */
> + rm_node = of_find_node_by_path("/reserved-memory");
> + if (!rm_node)
> + goto err;
> +
> + for_each_child_of_node(rm_node, child) {
> + if (of_property_read_string_index(child, "name", 0,
> + &node_name))
> + continue;
> + if (strncmp("ibm,homer-image", node_name,
> + strlen("ibm,homer-image")))
> + continue;

A better way to do this would be to reference the memory region, like
what's shown in
Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt

just reference the phandle of the memory region.

seeing as these are per chip, why not just have something linking
together chip-id and the IMC layout node?

--
Stewart Smith
OPAL Architect, IBM.

2017-04-06 08:06:06

by Madhavan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH v6 06/11] powerpc/perf: IMC pmu cpumask and cpu hotplug support



On Tuesday 04 April 2017 10:03 AM, Daniel Axtens wrote:
> Madhavan Srinivasan <[email protected]> writes:
>
>> From: Hemant Kumar <[email protected]>
>>
>> Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any
>> online CPU) from each chip for nest PMUs is designated to read counters.
>>
>> On CPU hotplug, dying CPU is checked to see whether it is one of the
>> designated cpus, if yes, next online cpu from the same chip (for nest
>> units) is designated as new cpu to read counters. For this purpose, we
>> introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE.
>>
>> Signed-off-by: Anju T Sudhakar <[email protected]>
>> Signed-off-by: Hemant Kumar <[email protected]>
>> Signed-off-by: Madhavan Srinivasan <[email protected]>
>> ---
>> arch/powerpc/include/asm/opal-api.h | 13 ++-
>> arch/powerpc/include/asm/opal.h | 3 +
>> arch/powerpc/perf/imc-pmu.c | 155 ++++++++++++++++++++++++-
>> arch/powerpc/platforms/powernv/opal-wrappers.S | 1 +
>> include/linux/cpuhotplug.h | 1 +
>> 5 files changed, 171 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
>> index a0aa285869b5..23fc51e9d71d 100644
>> --- a/arch/powerpc/include/asm/opal-api.h
>> +++ b/arch/powerpc/include/asm/opal-api.h
>> @@ -168,7 +168,8 @@
>> #define OPAL_INT_SET_MFRR 125
>> #define OPAL_PCI_TCE_KILL 126
>> #define OPAL_NMMU_SET_PTCR 127
>> -#define OPAL_LAST 127
>> +#define OPAL_NEST_IMC_COUNTERS_CONTROL 149
> A couple of things:
>
> - why is this 149 rather than 128?
Current opal-api.h in the skiboot side has more opal call
defined. AT this point of time, the OPAL # defined in skiboot
for IMC patchset is 149 and 150.


>
> - CONTROL seems like it's inviting a very broad and under-specified
> API. I notice most of the other opal calls are reasonably narrow:
> often defining two calls for get/set rather than a single control
> call.

Reason being going forward, microcode could support multiple
modes to aid debugging. Currently patchset exploits (production mode)
normal accumulation for a subset of events in each unit.

But in future, microcode could support switching of accumulation
mode to different mode (like debug mode) for a units. In which case
we could have new set of events configured. And we do not
want to have multiple calls for the different mode.
And hence we wanted to have one OPAL_NEST_* call
which can be extended to support these modes when available.

>
> I don't have cycles to review the OPAL/skiboot side and I'm very much
> open to being convinced here, I just want to avoid the situation where
> we're passing around complicated data structures in a way that is
> difficult to synchronise if we could avoid it.
>
>> +#define OPAL_LAST 149
>>
>> /* Device tree flags */
>>
>> @@ -928,6 +929,16 @@ enum {
>> OPAL_PCI_TCE_KILL_ALL,
>> };
>>
>> +/* Operation argument to OPAL_NEST_IMC_COUNTERS_CONTROL */
>> +enum {
>> + OPAL_NEST_IMC_PRODUCTION_MODE = 1,
>> +};
> If there's only one mode, why do we need to specify it? As far as I can
> tell no extra mode is added later in the series...
Will documents intended mode to be supported in future.

> ... looking at the opal-side patches, would it be better to just specify
> the modes you intend to support in future here, and rely on opal
> returning OPAL_PARAMETER when they're not supported?
>> +
>> +enum {
>> + OPAL_NEST_IMC_STOP,
>> + OPAL_NEST_IMC_START,
>> +};
> Again, would it be better to have a stop/start call rather than a
> generic control call?
As explain earlier, we would prefer to have one opal call
and take operation to perform as a parameter to it.


>
> Also, the enum values (STOP=0, START=1) seem to be switched vs OPAL,
> where https://patchwork.ozlabs.org/patch/746292/ has STOP=1, START=0. Or
> am I missing something?
ah, this is mess I created with the latest rebase. Yes I switched the
values and updated the same to kernel side. But i missed to update
the same in the comment message when I sent out the opal side patchset.

Will repost with this change in the opal side.



>
>> +
>> #endif /* __ASSEMBLY__ */
>>
>> #endif /* __OPAL_API_H */
>> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
>> index 1ff03a6da76e..d93d08204243 100644
>> --- a/arch/powerpc/include/asm/opal.h
>> +++ b/arch/powerpc/include/asm/opal.h
>> @@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t kill_type,
>> uint64_t dma_addr, uint32_t npages);
>> int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
>>
>> +int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
>> + uint64_t value2, uint64_t value3);
>> +
> This prototype does not make me feel better about the state of the API.
>
> Looking at the opal side, I would be much more comfortable if you could
> nail down what you intend to have these support now, even if OPAL bails
> with OPAL_PARAMETER if they're specified.
Ok i will add more comments/Documents on for the OPAL API
prototype.


>
> Also I think u64 and u32 are preferred, although I see the rest of the
> file uses the long form.
>
>> /* Internal functions */
>> extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
>> int depth, void *data);
>> diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
>> index bd5bf66bd920..b28835b861b3 100644
>> --- a/arch/powerpc/perf/imc-pmu.c
>> +++ b/arch/powerpc/perf/imc-pmu.c
>> @@ -17,6 +17,7 @@
>>
>> struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
>> struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
>> +static cpumask_t nest_imc_cpumask;
>>
>> /* Needed for sanity check */
>> extern u64 nest_max_offset;
>> @@ -32,6 +33,152 @@ static struct attribute_group imc_format_group = {
>> .attrs = imc_format_attrs,
>> };
>>
>> +/* Get the cpumask printed to a buffer "buf" */
>> +static ssize_t imc_pmu_cpumask_get_attr(struct device *dev,
>> + struct device_attribute *attr, char *buf)
>> +{
>> + cpumask_t *active_mask;
>> +
>> + active_mask = &nest_imc_cpumask;
>> + return cpumap_print_to_pagebuf(true, buf, active_mask);
>> +}
>> +
>> +static DEVICE_ATTR(cpumask, S_IRUGO, imc_pmu_cpumask_get_attr, NULL);
>> +
> Out of curiosity, how do you know imc_pmu_cpumask_get_attr is called
> with a large enough buffer? (How does cpumap_print_to_pagebuf know it's
> not overrunning the buffer when it's not taking a length parameter? I
> realise this is not your code but it's midly terrifying.)
>
>> +static struct attribute *imc_pmu_cpumask_attrs[] = {
>> + &dev_attr_cpumask.attr,
>> + NULL,
>> +};
>> +
>> +static struct attribute_group imc_pmu_cpumask_attr_group = {
>> + .attrs = imc_pmu_cpumask_attrs,
>> +};
>> +
>> +/*
>> + * nest_init : Initializes the nest imc engine for the current chip.
>> + */
>> +static void nest_init(int *loc)
>> +{
>> + int rc;
>> +
>> + rc = opal_nest_imc_counters_control(OPAL_NEST_IMC_PRODUCTION_MODE,
>> + OPAL_NEST_IMC_START, 0, 0);
> So OPAL is supposed to figure out which CPU to start based on the CPU
> that is currently running when you call into OPAL? I assume that's fine,
> but you probably want to document that. (Perhaps on the OPAL side - I
> spent a while looking for a parameter indicating the chip to work on!)
Yes. sure. Will document the same here.


>> + if (rc)
>> + loc[smp_processor_id()] = 1;
> loc is not a very helpful name here. Looking further down the patch it
> seems you pass cpus_opal_rc in as a parameter; maybe something like
> cpu_rcs or something?

Yes make sense. Will change it.

>
>> +}
>> +
>> +static void nest_change_cpu_context(int old_cpu, int new_cpu)
>> +{
>> + int i;
>> +
>> + for (i = 0;
>> + (per_nest_pmu_arr[i] != NULL) && (i < IMC_MAX_PMUS); i++)
>> + perf_pmu_migrate_context(&per_nest_pmu_arr[i]->pmu,
>> + old_cpu, new_cpu);
>> +}
> This may be an incredibly misinformed question, but if you have 2
> sockets, will this migrate things from both nests onto one? (I'm happy
> to take you on trust if you just want to say 'no' without a detailed
> explanation.)

No, we try to find the target cpu from the same chip which is handled
int the _offline code.

>> +
>> +static int ppc_nest_imc_cpu_online(unsigned int cpu)
>> +{
>> + int nid;
>> + const struct cpumask *l_cpumask;
>> + struct cpumask tmp_mask;
>> +
>> + /* Find the cpumask of this node */
>> + nid = cpu_to_node(cpu);
>> + l_cpumask = cpumask_of_node(nid);
>> +
>> + /*
>> + * If any of the cpu from this node is already present in the mask,
>> + * just return, if not, then set this cpu in the mask.
>> + */
>> + if (!cpumask_and(&tmp_mask, l_cpumask, &nest_imc_cpumask)) {
>> + cpumask_set_cpu(cpu, &nest_imc_cpumask);
>> + return 0;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int ppc_nest_imc_cpu_offline(unsigned int cpu)
>> +{
>> + int nid, target = -1;
>> + const struct cpumask *l_cpumask;
>> +
>> + /*
>> + * Check in the designated list for this cpu. Dont bother
>> + * if not one of them.
>> + */
>> + if (!cpumask_test_and_clear_cpu(cpu, &nest_imc_cpumask))
>> + return 0;
>> +
>> + /*
>> + * Now that this cpu is one of the designated,
>> + * find a next cpu a) which is online and b) in same chip.
>> + */
>> + nid = cpu_to_node(cpu);
>> + l_cpumask = cpumask_of_node(nid);
>> + target = cpumask_next(cpu, l_cpumask);
>> +
>> + /*
>> + * Update the cpumask with the target cpu and
>> + * migrate the context if needed
>> + */
>> + if (target >= 0 && target < nr_cpu_ids) {
>> + cpumask_set_cpu(target, &nest_imc_cpumask);
>> + nest_change_cpu_context(cpu, target);
>> + }
>> + return 0;
>> +}
>> +
>> +static int nest_pmu_cpumask_init(void)
>> +{
>> + const struct cpumask *l_cpumask;
>> + int cpu, nid;
>> + int *cpus_opal_rc;
>> +
>> + if (!cpumask_empty(&nest_imc_cpumask))
>> + return 0;
>> +
>> + /*
>> + * Memory for OPAL call return value.
>> + */
>> + cpus_opal_rc = kzalloc((sizeof(int) * nr_cpu_ids), GFP_KERNEL);
>> + if (!cpus_opal_rc)
>> + goto fail;
>> +
>> + /*
>> + * Nest PMUs are per-chip counters. So designate a cpu
>> + * from each chip for counter collection.
>> + */
>> + for_each_online_node(nid) {
>> + l_cpumask = cpumask_of_node(nid);
>> +
>> + /* designate first online cpu in this node */
>> + cpu = cpumask_first(l_cpumask);
>> + cpumask_set_cpu(cpu, &nest_imc_cpumask);
>> + }
>> +
>> + /* Initialize Nest PMUs in each node using designated cpus */
>> + on_each_cpu_mask(&nest_imc_cpumask, (smp_call_func_t)nest_init,
>> + (void *)cpus_opal_rc, 1);
>> +
>> + /* Check return value array for any OPAL call failure */
>> + for_each_cpu(cpu, &nest_imc_cpumask) {
>> + if (cpus_opal_rc[cpu])
>> + goto fail;
>> + }
>> +
>> + cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
>> + "POWER_NEST_IMC_ONLINE",
>> + ppc_nest_imc_cpu_online,
>> + ppc_nest_imc_cpu_offline);
>> +
> Shouldn't you be freeing cpus_opal_rc here?
>> + return 0;
>> +
>> +fail:
> Also, shouldn't you be freeing cpus_opal_rc here? (if allocated)

Nice catch. Yes we should be freeing it here. Will add that check.

>
>> + return -ENODEV;
>> +}
>> +
>> static int nest_imc_event_init(struct perf_event *event)
>> {
>> int chip_id;
>> @@ -70,7 +217,7 @@ static int nest_imc_event_init(struct perf_event *event)
>> * and then add to it.
>> */
>> event->hw.event_base = pcni->vbase[config/PAGE_SIZE] +
>> - (config & ~PAGE_MASK);
>> + (config & ~PAGE_MASK);
> This hunk should either be dropped or squashed.
ok sure.

>
>>
>> return 0;
>> }
>> @@ -139,6 +286,7 @@ static int update_pmu_ops(struct imc_pmu *pmu)
>> pmu->pmu.stop = imc_event_stop;
>> pmu->pmu.read = imc_perf_event_update;
>> pmu->attr_groups[1] = &imc_format_group;
>> + pmu->attr_groups[2] = &imc_pmu_cpumask_attr_group;
>> pmu->pmu.attr_groups = pmu->attr_groups;
>>
>> return 0;
>> @@ -206,6 +354,11 @@ int init_imc_pmu(struct imc_events *events, int idx,
>> {
>> int ret = -ENODEV;
>>
>> + /* Add cpumask and register for hotplug notification */
>> + ret = nest_pmu_cpumask_init();
> If cpumask_init() also registers for hotplug, perhaps it should have a
> more inclusive name.

Let me think over since we already have bigger functions name in here.

>
> Or, maybe better to move hotplug setup out of cpumask_init.
>> + if (ret)
>> + return ret;
>> +
>> ret = update_events_in_group(events, idx, pmu_ptr);
>> if (ret)
>> goto err_free;
>> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
>> index da8a0f7a035c..b7208d8e6cc0 100644
>> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
>> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
>> @@ -301,3 +301,4 @@ OPAL_CALL(opal_int_eoi, OPAL_INT_EOI);
>> OPAL_CALL(opal_int_set_mfrr, OPAL_INT_SET_MFRR);
>> OPAL_CALL(opal_pci_tce_kill, OPAL_PCI_TCE_KILL);
>> OPAL_CALL(opal_nmmu_set_ptcr, OPAL_NMMU_SET_PTCR);
>> +OPAL_CALL(opal_nest_imc_counters_control, OPAL_NEST_IMC_COUNTERS_CONTROL);
>> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
>> index 62d240e962f0..cfb0cedc72af 100644
>> --- a/include/linux/cpuhotplug.h
>> +++ b/include/linux/cpuhotplug.h
>> @@ -136,6 +136,7 @@ enum cpuhp_state {
>> CPUHP_AP_PERF_ARM_CCI_ONLINE,
>> CPUHP_AP_PERF_ARM_CCN_ONLINE,
>> CPUHP_AP_PERF_ARM_L2X0_ONLINE,
>> + CPUHP_AP_PERF_POWERPC_NEST_ONLINE,
> This powerpc enum is in the middle of a set of ARM ones. Should it be
> after all the arm ones?
>
> If it's an enum, hopefully order doesn't matter...
Yes. I will make the change.

Thanks for the review comments
Maddy

>
>> CPUHP_AP_PERF_ARM_QCOM_L2_ONLINE,
>> CPUHP_AP_WORKQUEUE_ONLINE,
>> CPUHP_AP_RCUTREE_ONLINE,
>> --
>> 2.7.4
> Regards,
> Daniel
>

2017-04-06 08:08:03

by Stewart Smith

[permalink] [raw]
Subject: Re: [PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions

Madhavan Srinivasan <[email protected]> writes:

> From: Hemant Kumar <[email protected]>
>
> Create new header file "imc-pmu.h" to add the data structures
> and macros needed for IMC pmu support.
>
> Signed-off-by: Anju T Sudhakar <[email protected]>
> Signed-off-by: Hemant Kumar <[email protected]>
> Signed-off-by: Madhavan Srinivasan <[email protected]>
> ---
> arch/powerpc/include/asm/imc-pmu.h | 68 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
> create mode 100644 arch/powerpc/include/asm/imc-pmu.h
>
> diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
> new file mode 100644
> index 000000000000..a3d4f1bf9492
> --- /dev/null
> +++ b/arch/powerpc/include/asm/imc-pmu.h
> @@ -0,0 +1,68 @@
> +#ifndef PPC_POWERNV_IMC_PMU_DEF_H
> +#define PPC_POWERNV_IMC_PMU_DEF_H
> +
> +/*
> + * IMC Nest Performance Monitor counter support.
> + *
> + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
> + * (C) 2016 Hemant K Shaw, IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include <linux/perf_event.h>
> +#include <linux/slab.h>
> +#include <linux/of.h>
> +#include <linux/io.h>
> +#include <asm/opal.h>
> +
> +#define IMC_MAX_CHIPS 32
> +#define IMC_MAX_PMUS 32
> +#define IMC_MAX_PMU_NAME_LEN 256

Why do we need a max length? We get the actual lengths from the device
tree, so we know at each point in time what the length of any new string
should be, right?

Otherwise you appear to be, in the general case, using 10x the memory
than you could.

--
Stewart Smith
OPAL Architect, IBM.

2017-04-06 08:37:58

by Stewart Smith

[permalink] [raw]
Subject: Re: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events

Madhavan Srinivasan <[email protected]> writes:
> --- a/arch/powerpc/platforms/powernv/opal-imc.c
> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> @@ -33,6 +33,388 @@
<snip>
> +static void imc_pmu_setup(struct device_node *parent)
> +{
> + struct device_node *child;
> + int pmu_count = 0, rc = 0;
> + const struct property *pp;
> +
> + if (!parent)
> + return;
> +
> + /* Setup all the IMC pmus */
> + for_each_child_of_node(parent, child) {
> + pp = of_get_property(child, "compatible", NULL);
> + if (pp) {
> + /*
> + * If there is a node with a "compatible" field,
> + * that's a PMU node
> + */
> + rc = imc_pmu_create(child, pmu_count);
> + if (rc)
> + return;
> + pmu_count++;
> + }
> + }
> +}

This doesn't strike me as the right kind of structure, the presence of a
compatible property really just says "hey, there's this device and it's
compatible with these ways of accessing it".

I'm guessing the idea behind having imc-nest-offset/size in a top level
node is because it's common to everything under it and the aim is to not
blow up the device tree to be enormous.

So why not go after each ibm,imc-counters-nest compatible node under the
top level ibm,opal-in-memory-counters node? (i'm not convinced that
having ibm,ibmc-counters-nest versus ibm,imc-counters-core and
ibm,imc-counters-thread as I see in the dts is correct though, as
they're all accessed exactly the same way?)

--
Stewart Smith
OPAL Architect, IBM.

2017-04-06 08:40:29

by Stewart Smith

[permalink] [raw]
Subject: Re: [PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions

Madhavan Srinivasan <[email protected]> writes:
> +#define IMC_MAX_CHIPS 32
> +#define IMC_MAX_PMUS 32

The max chips and PMUs we'd be able to work out from the device tre
though, right? We could just allocate the correct amount of memory on
boot.

We may hot plug/unplug CPUs, but we're not doing that from a hardware
level, what CPUs you get in the DT on PowerNV on boot is all you're
getting.

--
Stewart Smith
OPAL Architect, IBM.

2017-04-06 09:33:26

by Anju T Sudhakar

[permalink] [raw]
Subject: Re: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events

Hi Stewart,


Thanks for the review.


On Thursday 06 April 2017 02:07 PM, Stewart Smith wrote:
> Madhavan Srinivasan <[email protected]> writes:
>> --- a/arch/powerpc/platforms/powernv/opal-imc.c
>> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
>> @@ -33,6 +33,388 @@
> <snip>
>> +static void imc_pmu_setup(struct device_node *parent)
>> +{
>> + struct device_node *child;
>> + int pmu_count = 0, rc = 0;
>> + const struct property *pp;
>> +
>> + if (!parent)
>> + return;
>> +
>> + /* Setup all the IMC pmus */
>> + for_each_child_of_node(parent, child) {
>> + pp = of_get_property(child, "compatible", NULL);
>> + if (pp) {
>> + /*
>> + * If there is a node with a "compatible" field,
>> + * that's a PMU node
>> + */
>> + rc = imc_pmu_create(child, pmu_count);
>> + if (rc)
>> + return;
>> + pmu_count++;
>> + }
>> + }
>> +}
> This doesn't strike me as the right kind of structure, the presence of a
> compatible property really just says "hey, there's this device and it's
> compatible with these ways of accessing it".
>
> I'm guessing the idea behind having imc-nest-offset/size in a top level
> node is because it's common to everything under it and the aim is to not
> blow up the device tree to be enormous.
>
> So why not go after each ibm,imc-counters-nest compatible node under the
> top level ibm,opal-in-memory-counters node? (i'm not convinced that
> having ibm,ibmc-counters-nest versus ibm,imc-counters-core and
> ibm,imc-counters-thread as I see in the dts is correct though, as
> they're all accessed exactly the same way?)
>

The idea here is, we have one directory which contains common events
information for nest(same incase of core and thread), and one directory
for each nest(/core/thread) pmu.
So while parsing we need to make sure that the node which we are parsing
is the pmu node, not the node which contains the common event
information. We use the "compatible" property here for that purpose.
Because we don't have a compatible property for the node which contains
events info.




Regards,
Anju

2017-04-13 11:43:28

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events

Anju T Sudhakar <[email protected]> writes:
> On Thursday 06 April 2017 02:07 PM, Stewart Smith wrote:
>> Madhavan Srinivasan <[email protected]> writes:
>>> --- a/arch/powerpc/platforms/powernv/opal-imc.c
>>> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
>>> @@ -33,6 +33,388 @@
>> <snip>
>>> +static void imc_pmu_setup(struct device_node *parent)
>>> +{
>>> + struct device_node *child;
>>> + int pmu_count = 0, rc = 0;
>>> + const struct property *pp;
>>> +
>>> + if (!parent)
>>> + return;
>>> +
>>> + /* Setup all the IMC pmus */
>>> + for_each_child_of_node(parent, child) {
>>> + pp = of_get_property(child, "compatible", NULL);
>>> + if (pp) {
>>> + /*
>>> + * If there is a node with a "compatible" field,
>>> + * that's a PMU node
>>> + */
>>> + rc = imc_pmu_create(child, pmu_count);
>>> + if (rc)
>>> + return;
>>> + pmu_count++;
>>> + }
>>> + }
>>> +}
>> This doesn't strike me as the right kind of structure, the presence of a
>> compatible property really just says "hey, there's this device and it's
>> compatible with these ways of accessing it".
>>
>> I'm guessing the idea behind having imc-nest-offset/size in a top level
>> node is because it's common to everything under it and the aim is to not
>> blow up the device tree to be enormous.
>>
>> So why not go after each ibm,imc-counters-nest compatible node under the
>> top level ibm,opal-in-memory-counters node? (i'm not convinced that
>> having ibm,ibmc-counters-nest versus ibm,imc-counters-core and
>> ibm,imc-counters-thread as I see in the dts is correct though, as
>> they're all accessed exactly the same way?)
>
> The idea here is, we have one directory which contains common events
> information for nest(same incase of core and thread), and one directory
> for each nest(/core/thread) pmu.

> So while parsing we need to make sure that the node which we are parsing
> is the pmu node, not the node which contains the common event
> information. We use the "compatible" property here for that purpose.
> Because we don't have a compatible property for the node which contains
> events info.

That's a really bad hack.

You can use the compatible property to detect the node you're looking
for, but you need to look at the *value* of the property and check it's
what you expect. Just checking that it's there is fragile.

cheers

2017-04-17 08:08:25

by Anju T Sudhakar

[permalink] [raw]
Subject: Re: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events

Hi Michael,


On Thursday 13 April 2017 05:13 PM, Michael Ellerman wrote:
> Anju T Sudhakar <[email protected]> writes:
>> On Thursday 06 April 2017 02:07 PM, Stewart Smith wrote:
>>> Madhavan Srinivasan <[email protected]> writes:
>>>> --- a/arch/powerpc/platforms/powernv/opal-imc.c
>>>> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
>>>> @@ -33,6 +33,388 @@
>>> <snip>
>>>> +static void imc_pmu_setup(struct device_node *parent)
>>>> +{
>>>> + struct device_node *child;
>>>> + int pmu_count = 0, rc = 0;
>>>> + const struct property *pp;
>>>> +
>>>> + if (!parent)
>>>> + return;
>>>> +
>>>> + /* Setup all the IMC pmus */
>>>> + for_each_child_of_node(parent, child) {
>>>> + pp = of_get_property(child, "compatible", NULL);
>>>> + if (pp) {
>>>> + /*
>>>> + * If there is a node with a "compatible" field,
>>>> + * that's a PMU node
>>>> + */
>>>> + rc = imc_pmu_create(child, pmu_count);
>>>> + if (rc)
>>>> + return;
>>>> + pmu_count++;
>>>> + }
>>>> + }
>>>> +}
>>> This doesn't strike me as the right kind of structure, the presence of a
>>> compatible property really just says "hey, there's this device and it's
>>> compatible with these ways of accessing it".
>>>
>>> I'm guessing the idea behind having imc-nest-offset/size in a top level
>>> node is because it's common to everything under it and the aim is to not
>>> blow up the device tree to be enormous.
>>>
>>> So why not go after each ibm,imc-counters-nest compatible node under the
>>> top level ibm,opal-in-memory-counters node? (i'm not convinced that
>>> having ibm,ibmc-counters-nest versus ibm,imc-counters-core and
>>> ibm,imc-counters-thread as I see in the dts is correct though, as
>>> they're all accessed exactly the same way?)
>> The idea here is, we have one directory which contains common events
>> information for nest(same incase of core and thread), and one directory
>> for each nest(/core/thread) pmu.
>> So while parsing we need to make sure that the node which we are parsing
>> is the pmu node, not the node which contains the common event
>> information. We use the "compatible" property here for that purpose.
>> Because we don't have a compatible property for the node which contains
>> events info.
> That's a really bad hack.
>
> You can use the compatible property to detect the node you're looking
> for, but you need to look at the *value* of the property and check it's
> what you expect. Just checking that it's there is fragile.
>
> cheers
>



ok. I will rework this code.



Thanks,
Anju