2015-07-16 08:05:11

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v5 0/6] powernv: cpufreq: Report frequency throttle by OCC

This patchset intends to add frequency throttle reporting mechanism
to powernv-cpufreq driver when OCC throttles the frequency. OCC is an
On-Chip-Controller which takes care of the power and thermal safety of
the chip. The CPU frequency can be throttled during an OCC reset or
when OCC tries to limit the max allowed frequency. The patchset will
report such conditions so as to keep the user informed about reason
for the drop in performance of workloads when frequency is throttled.

Changes from v4:
- Taken care of Joel Stanley's comment, modification in patch[3].
This replaces memcpy() with be64_to_cpu() and no change in
functionality of the patch

Changes from v3:
- Rebased on top of 4.2-rc1
- Minor changes in patch 2,3,4,6 this does not change the
functionality of the code
- 594fcb9ec9e powerpc/powernv: Expose OPAL APIs required by PRD
interface , this patch fixes the build error due to which this
series was initially dropped
ERROR: ".opal_message_notifier_register"
drivers/cpufreq/powernv-cpufreq.ko] undefined!

Changes from v2:
- Split into multiple patches
- Semantic fixes

Shilpasri G Bhat (6):
cpufreq: powernv: Handle throttling due to Pmax capping at chip level
powerpc/powernv: Add definition of OPAL_MSG_OCC message type
cpufreq: powernv: Register for OCC related opal_message notification
cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is
set
cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling

arch/powerpc/include/asm/opal-api.h | 12 +++
drivers/cpufreq/powernv-cpufreq.c | 198 +++++++++++++++++++++++++++++++++---
2 files changed, 195 insertions(+), 15 deletions(-)

--
1.9.3


2015-07-16 08:05:27

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v5 1/6] cpufreq: powernv: Handle throttling due to Pmax capping at chip level

The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
max allowed frequency for that chip if the chip exceeds its power or
temperature limits. As Pmax capping is a chip level condition report
this throttling behavior at chip level and also do not set the global
'throttled' on Pmax capping instead set the per-chip throttled
variable. Report unthrottling if Pmax is restored after throttling.

This patch adds a structure to store chip id and throttled state of
the chip.

Signed-off-by: Shilpasri G Bhat <[email protected]>
Reviewed-by: Preeti U Murthy <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
No change from v4

drivers/cpufreq/powernv-cpufreq.c | 59 ++++++++++++++++++++++++++++++++++++---
1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index ebef0d8..d0c18c9 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -27,6 +27,7 @@
#include <linux/smp.h>
#include <linux/of.h>
#include <linux/reboot.h>
+#include <linux/slab.h>

#include <asm/cputhreads.h>
#include <asm/firmware.h>
@@ -42,6 +43,13 @@
static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
static bool rebooting, throttled;

+static struct chip {
+ unsigned int id;
+ bool throttled;
+} *chips;
+
+static int nr_chips;
+
/*
* Note: The set of pstates consists of contiguous integers, the
* smallest of which is indicated by powernv_pstate_info.min, the
@@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
static void powernv_cpufreq_throttle_check(unsigned int cpu)
{
unsigned long pmsr;
- int pmsr_pmax, pmsr_lp;
+ int pmsr_pmax, pmsr_lp, i;

pmsr = get_pmspr(SPRN_PMSR);

+ for (i = 0; i < nr_chips; i++)
+ if (chips[i].id == cpu_to_chip_id(cpu))
+ break;
+
/* Check for Pmax Capping */
pmsr_pmax = (s8)PMSR_MAX(pmsr);
if (pmsr_pmax != powernv_pstate_info.max) {
- throttled = true;
- pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
- pr_info("Max allowed Pstate is capped\n");
+ if (chips[i].throttled)
+ goto next;
+ chips[i].throttled = true;
+ pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
+ chips[i].id, pmsr_pmax);
+ } else if (chips[i].throttled) {
+ chips[i].throttled = false;
+ pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
+ chips[i].id, pmsr_pmax);
}

/*
* Check for Psafe by reading LocalPstate
* or check if Psafe_mode_active is set in PMSR.
*/
+next:
pmsr_lp = (s8)PMSR_LP(pmsr);
if ((pmsr_lp < powernv_pstate_info.min) ||
(pmsr & PMSR_PSAFE_ENABLE)) {
@@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
.attr = powernv_cpu_freq_attr,
};

+static int init_chip_info(void)
+{
+ unsigned int chip[256];
+ unsigned int cpu, i;
+ unsigned int prev_chip_id = UINT_MAX;
+
+ for_each_possible_cpu(cpu) {
+ unsigned int id = cpu_to_chip_id(cpu);
+
+ if (prev_chip_id != id) {
+ prev_chip_id = id;
+ chip[nr_chips++] = id;
+ }
+ }
+
+ chips = kmalloc_array(nr_chips, sizeof(struct chip), GFP_KERNEL);
+ if (!chips)
+ return -ENOMEM;
+
+ for (i = 0; i < nr_chips; i++) {
+ chips[i].id = chip[i];
+ chips[i].throttled = false;
+ }
+
+ return 0;
+}
+
static int __init powernv_cpufreq_init(void)
{
int rc = 0;
@@ -429,6 +475,11 @@ static int __init powernv_cpufreq_init(void)
return rc;
}

+ /* Populate chip info */
+ rc = init_chip_info();
+ if (rc)
+ return rc;
+
register_reboot_notifier(&powernv_cpufreq_reboot_nb);
return cpufreq_register_driver(&powernv_cpufreq_driver);
}
--
1.9.3

2015-07-16 08:05:07

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v5 2/6] powerpc/powernv: Add definition of OPAL_MSG_OCC message type

Add OPAL_MSG_OCC message definition to opal_message_type to receive
OCC events like reset, load and throttled. Host performance can be
affected when OCC is reset or OCC throttles the max Pstate.
We can register to opal_message_notifier to receive OPAL_MSG_OCC type
of message and report it to the userspace so as to keep the user
informed about the reason for a performance drop in workloads.

The reset and load OCC events are notified to kernel when FSP sends
OCC_RESET and OCC_LOAD commands. Both reset and load messages are
sent to kernel on successful completion of reset and load operation
respectively.

The throttle OCC event indicates that the Pmax of the chip is reduced.
The chip_id and throttle reason for reducing Pmax is also queued along
with the message.

CC: Stewart Smith <[email protected]>
Signed-off-by: Shilpasri G Bhat <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
No change from v4

Changes from v3:
- '0d7cd8550d3 powerpc/powernv: Add opal-prd channel' this patch adds
the definition of OPAL_MSG_PRD, so remove it and update the
changelog.
- Move the definitions of OCC_RESET, OCC_LOAD and OCC_THROTTLE from
drivers/cpufreq/powernv-cpufreq.c to arch/powerpc/include/asm/opal-api.h
- Define OCC_MAX_THROTTLE_STATUS
- Add a wrapper structure 'opal_occ_msg' to copy 'struct opal_msg.params[0..2]'
This structure will define the parameters received from firmware to
maintain compatibility for any future additions.

No change from v2

Change from v1:
- Update the commit changelog

arch/powerpc/include/asm/opal-api.h | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index e9e4c52..64dc9f5 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -361,6 +361,7 @@ enum opal_msg_type {
OPAL_MSG_HMI_EVT,
OPAL_MSG_DPO,
OPAL_MSG_PRD,
+ OPAL_MSG_OCC,
OPAL_MSG_TYPE_MAX,
};

@@ -700,6 +701,17 @@ struct opal_prd_msg_header {

struct opal_prd_msg;

+#define OCC_RESET 0
+#define OCC_LOAD 1
+#define OCC_THROTTLE 2
+#define OCC_MAX_THROTTLE_STATUS 5
+
+struct opal_occ_msg {
+ __be64 type;
+ __be64 chip;
+ __be64 throttle_status;
+};
+
/*
* SG entries
*
--
1.9.3

2015-07-16 08:06:28

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v5 3/6] cpufreq: powernv: Register for OCC related opal_message notification

OCC is an On-Chip-Controller which takes care of power and thermal
safety of the chip. During runtime due to power failure or
overtemperature the OCC may throttle the frequencies of the CPUs to
remain within the power budget.

We want the cpufreq driver to be aware of such situations to be able
to report the reason to the user. We register to opal_message_notifier
to receive OCC messages from opal.

powernv_cpufreq_throttle_check() reports any frequency throttling and
this patch will report the reason or event that caused throttling. We
can be throttled if OCC is reset or OCC limits Pmax due to power or
thermal reasons. We are also notified of unthrottling after an OCC
reset or if OCC restores Pmax on the chip.

Signed-off-by: Shilpasri G Bhat <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
Changes from v4:
- Replace memcpy() with be64_to_cpu() to copy the msg->params[]

Changes from v3:
- Move the macro definitions of OCC_RESET, OCC_LOAD, OCC_THROTTLE to
arch/powerpc/include/asm/opal-api.h
- Use 'struct opal_occ_msg' to copy the 'opal_msg->params[]' and refer
the members of this structure in the code; Replace 'chip_id',
'token' and 'reason' with omsg.chip, omsg.type, omsg.throttle_status
- Use OCC_MAX_THROTTLE_STATUS instead of the magic number.
- Add opal_message_notifier_unregister()

Changes from v2:
- Patch split in to multiple patches.
- This patch contains only the opal_message notification handler

Changes from v1:
- Add macros to define OCC_RESET, OCC_LOAD and OCC_THROTTLE
- Define a structure to store chip id, chip mask which has bits set
for cpus present in the chip, throttled state and a work_struct.
- Modify powernv_cpufreq_throttle_check() to be called via smp_call()
- On Pmax throttling/unthrottling update 'chip.throttled' and not the
global 'throttled' as Pmax capping is local to the chip.
- Remove the condition which checks if local pstate is less than Pmin
while checking for Psafe frequency. When OCC becomes active after
reset we update 'thottled' to false and when the cpufreq governor
initiates a pstate change, the local pstate will be in Psafe and we
will be reporting a false positive when we are not throttled.
- Schedule a kworker on receiving throttling/unthrottling OCC message
for that chip and schedule on all chips after receiving active.
- After an OCC reset all the cpus will be in Psafe frequency. So call
target() and restore the frequency to policy->cur after OCC_ACTIVE
and Pmax unthrottling
- Taken care of Viresh and Preeti's comments.
drivers/cpufreq/powernv-cpufreq.c | 74 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 73 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index d0c18c9..a634199 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -33,6 +33,7 @@
#include <asm/firmware.h>
#include <asm/reg.h>
#include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
+#include <asm/opal.h>

#define POWERNV_MAX_PSTATES 256
#define PMSR_PSAFE_ENABLE (1UL << 30)
@@ -41,7 +42,7 @@
#define PMSR_LP(x) ((x >> 48) & 0xFF)

static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
-static bool rebooting, throttled;
+static bool rebooting, throttled, occ_reset;

static struct chip {
unsigned int id;
@@ -414,6 +415,74 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
.notifier_call = powernv_cpufreq_reboot_notifier,
};

+static char throttle_reason[][30] = {
+ "No throttling",
+ "Power Cap",
+ "Processor Over Temperature",
+ "Power Supply Failure",
+ "Over Current",
+ "OCC Reset"
+ };
+
+static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
+ unsigned long msg_type, void *_msg)
+{
+ struct opal_msg *msg = _msg;
+ struct opal_occ_msg omsg;
+
+ if (msg_type != OPAL_MSG_OCC)
+ return 0;
+
+ omsg.type = be64_to_cpu(msg->params[0]);
+
+ switch (omsg.type) {
+ case OCC_RESET:
+ occ_reset = true;
+ /*
+ * powernv_cpufreq_throttle_check() is called in
+ * target() callback which can detect the throttle state
+ * for governors like ondemand.
+ * But static governors will not call target() often thus
+ * report throttling here.
+ */
+ if (!throttled) {
+ throttled = true;
+ pr_crit("CPU Frequency is throttled\n");
+ }
+ pr_info("OCC: Reset\n");
+ break;
+ case OCC_LOAD:
+ pr_info("OCC: Loaded\n");
+ break;
+ case OCC_THROTTLE:
+ omsg.chip = be64_to_cpu(msg->params[1]);
+ omsg.throttle_status = be64_to_cpu(msg->params[2]);
+
+ if (occ_reset) {
+ occ_reset = false;
+ throttled = false;
+ pr_info("OCC: Active\n");
+ return 0;
+ }
+
+ if (omsg.throttle_status &&
+ omsg.throttle_status <= OCC_MAX_THROTTLE_STATUS)
+ pr_info("OCC: Chip %u Pmax reduced due to %s\n",
+ (unsigned int)omsg.chip,
+ throttle_reason[omsg.throttle_status]);
+ else if (!omsg.throttle_status)
+ pr_info("OCC: Chip %u %s\n", (unsigned int)omsg.chip,
+ throttle_reason[omsg.throttle_status]);
+ }
+ return 0;
+}
+
+static struct notifier_block powernv_cpufreq_opal_nb = {
+ .notifier_call = powernv_cpufreq_occ_msg,
+ .next = NULL,
+ .priority = 0,
+};
+
static void powernv_cpufreq_stop_cpu(struct cpufreq_policy *policy)
{
struct powernv_smp_call_data freq_data;
@@ -481,6 +550,7 @@ static int __init powernv_cpufreq_init(void)
return rc;

register_reboot_notifier(&powernv_cpufreq_reboot_nb);
+ opal_message_notifier_register(OPAL_MSG_OCC, &powernv_cpufreq_opal_nb);
return cpufreq_register_driver(&powernv_cpufreq_driver);
}
module_init(powernv_cpufreq_init);
@@ -488,6 +558,8 @@ module_init(powernv_cpufreq_init);
static void __exit powernv_cpufreq_exit(void)
{
unregister_reboot_notifier(&powernv_cpufreq_reboot_nb);
+ opal_message_notifier_unregister(OPAL_MSG_OCC,
+ &powernv_cpufreq_opal_nb);
cpufreq_unregister_driver(&powernv_cpufreq_driver);
}
module_exit(powernv_cpufreq_exit);
--
1.9.3

2015-07-16 08:15:22

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v5 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
notification by executing *throttle_check() on any one of the cpu on
the chip. This is a sanity check to verify if we were indeed
throttled/unthrottled after receiving OCC_THROTTLE notification.

We cannot call *throttle_check() directly from the notification
handler because we could be handling chip1's notification in chip2. So
initiate an smp_call to execute *throttle_check(). We are irq-disabled
in the notification handler, so use a worker thread to smp_call
throttle_check() on any of the cpu in the chipmask.

Signed-off-by: Shilpasri G Bhat <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
No changes from v4

Changes from v3:
- Refer to the members of 'struct opal_occ_msg' in the patch.
Replace 'chip_id' with 'omsg.chip'

drivers/cpufreq/powernv-cpufreq.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index a634199..22f33ff 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -47,6 +47,8 @@ static bool rebooting, throttled, occ_reset;
static struct chip {
unsigned int id;
bool throttled;
+ cpumask_t mask;
+ struct work_struct throttle;
} *chips;

static int nr_chips;
@@ -307,8 +309,9 @@ static inline unsigned int get_nominal_index(void)
return powernv_pstate_info.max - powernv_pstate_info.nominal;
}

-static void powernv_cpufreq_throttle_check(unsigned int cpu)
+static void powernv_cpufreq_throttle_check(void *data)
{
+ unsigned int cpu = smp_processor_id();
unsigned long pmsr;
int pmsr_pmax, pmsr_lp, i;

@@ -370,7 +373,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
return 0;

if (!throttled)
- powernv_cpufreq_throttle_check(smp_processor_id());
+ powernv_cpufreq_throttle_check(NULL);

freq_data.pstate_id = powernv_freqs[new_index].driver_data;

@@ -415,6 +418,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
.notifier_call = powernv_cpufreq_reboot_notifier,
};

+void powernv_cpufreq_work_fn(struct work_struct *work)
+{
+ struct chip *chip = container_of(work, struct chip, throttle);
+
+ smp_call_function_any(&chip->mask,
+ powernv_cpufreq_throttle_check, NULL, 0);
+}
+
static char throttle_reason[][30] = {
"No throttling",
"Power Cap",
@@ -429,6 +440,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
{
struct opal_msg *msg = _msg;
struct opal_occ_msg omsg;
+ int i;

if (msg_type != OPAL_MSG_OCC)
return 0;
@@ -462,6 +474,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
occ_reset = false;
throttled = false;
pr_info("OCC: Active\n");
+
+ for (i = 0; i < nr_chips; i++)
+ schedule_work(&chips[i].throttle);
+
return 0;
}

@@ -473,6 +489,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
else if (!omsg.throttle_status)
pr_info("OCC: Chip %u %s\n", (unsigned int)omsg.chip,
throttle_reason[omsg.throttle_status]);
+ else
+ return 0;
+
+ for (i = 0; i < nr_chips; i++)
+ if (chips[i].id == omsg.chip)
+ schedule_work(&chips[i].throttle);
}
return 0;
}
@@ -524,6 +546,8 @@ static int init_chip_info(void)
for (i = 0; i < nr_chips; i++) {
chips[i].id = chip[i];
chips[i].throttled = false;
+ cpumask_copy(&chips[i].mask, cpumask_of_node(chip[i]));
+ INIT_WORK(&chips[i].throttle, powernv_cpufreq_work_fn);
}

return 0;
--
1.9.3

2015-07-16 08:15:35

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v5 5/6] cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is set

On a reset cycle of OCC, although the system retires from safe
frequency state the local pstate is not restored to Pmin or last
requested pstate. Now if the cpufreq governor initiates a pstate
change, the local pstate will be in Psafe and we will be reporting a
false positive when we are not throttled.

So in powernv_cpufreq_throttle_check() remove the condition which
checks if local pstate is less than Pmin while checking for Psafe
frequency. If the cpus are forced to Psafe then PMSR.psafe_mode_active
bit will be set. So, when OCCs become active this bit will be cleared.
Let us just rely on this bit for reporting throttling.

Signed-off-by: Shilpasri G Bhat <[email protected]>
Reviewed-by: Preeti U Murthy <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
No changes from v4

drivers/cpufreq/powernv-cpufreq.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index 22f33ff..90b4293 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -39,7 +39,6 @@
#define PMSR_PSAFE_ENABLE (1UL << 30)
#define PMSR_SPR_EM_DISABLE (1UL << 31)
#define PMSR_MAX(x) ((x >> 32) & 0xFF)
-#define PMSR_LP(x) ((x >> 48) & 0xFF)

static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
static bool rebooting, throttled, occ_reset;
@@ -313,7 +312,7 @@ static void powernv_cpufreq_throttle_check(void *data)
{
unsigned int cpu = smp_processor_id();
unsigned long pmsr;
- int pmsr_pmax, pmsr_lp, i;
+ int pmsr_pmax, i;

pmsr = get_pmspr(SPRN_PMSR);

@@ -335,14 +334,9 @@ static void powernv_cpufreq_throttle_check(void *data)
chips[i].id, pmsr_pmax);
}

- /*
- * Check for Psafe by reading LocalPstate
- * or check if Psafe_mode_active is set in PMSR.
- */
+ /* Check if Psafe_mode_active is set in PMSR. */
next:
- pmsr_lp = (s8)PMSR_LP(pmsr);
- if ((pmsr_lp < powernv_pstate_info.min) ||
- (pmsr & PMSR_PSAFE_ENABLE)) {
+ if (pmsr & PMSR_PSAFE_ENABLE) {
throttled = true;
pr_info("Pstate set to safe frequency\n");
}
--
1.9.3

2015-07-16 08:15:20

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v5 6/6] cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling

If frequency is throttled due to OCC reset then cpus will be in Psafe
frequency, so restore the frequency on all cpus to policy->cur when
OCCs are active again. And if frequency is throttled due to Pmax
capping then restore the frequency of all the cpus in the chip on
unthrottling.

Signed-off-by: Shilpasri G Bhat <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
---
No changes from v4

Changes from v3:
- Refer to the members of 'struct opal_occ_msg' in the patch.
Replace 'reason' with 'omsg.throttle_status'

drivers/cpufreq/powernv-cpufreq.c | 31 +++++++++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index 90b4293..546e056 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -48,6 +48,7 @@ static struct chip {
bool throttled;
cpumask_t mask;
struct work_struct throttle;
+ bool restore;
} *chips;

static int nr_chips;
@@ -415,9 +416,29 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
void powernv_cpufreq_work_fn(struct work_struct *work)
{
struct chip *chip = container_of(work, struct chip, throttle);
+ unsigned int cpu;
+ cpumask_var_t mask;

smp_call_function_any(&chip->mask,
powernv_cpufreq_throttle_check, NULL, 0);
+
+ if (!chip->restore)
+ return;
+
+ chip->restore = false;
+ cpumask_copy(mask, &chip->mask);
+ for_each_cpu_and(cpu, mask, cpu_online_mask) {
+ int index, tcpu;
+ struct cpufreq_policy policy;
+
+ cpufreq_get_policy(&policy, cpu);
+ cpufreq_frequency_table_target(&policy, policy.freq_table,
+ policy.cur,
+ CPUFREQ_RELATION_C, &index);
+ powernv_cpufreq_target_index(&policy, index);
+ for_each_cpu(tcpu, policy.cpus)
+ cpumask_clear_cpu(tcpu, mask);
+ }
}

static char throttle_reason[][30] = {
@@ -469,8 +490,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
throttled = false;
pr_info("OCC: Active\n");

- for (i = 0; i < nr_chips; i++)
+ for (i = 0; i < nr_chips; i++) {
+ chips[i].restore = true;
schedule_work(&chips[i].throttle);
+ }

return 0;
}
@@ -487,8 +510,11 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
return 0;

for (i = 0; i < nr_chips; i++)
- if (chips[i].id == omsg.chip)
+ if (chips[i].id == omsg.chip) {
+ if (!omsg.throttle_status)
+ chips[i].restore = true;
schedule_work(&chips[i].throttle);
+ }
}
return 0;
}
@@ -542,6 +568,7 @@ static int init_chip_info(void)
chips[i].throttled = false;
cpumask_copy(&chips[i].mask, cpumask_of_node(chip[i]));
INIT_WORK(&chips[i].throttle, powernv_cpufreq_work_fn);
+ chips[i].restore = false;
}

return 0;
--
1.9.3

2015-08-10 00:24:30

by Stewart Smith

[permalink] [raw]
Subject: Re: [PATCH v5 2/6] powerpc/powernv: Add definition of OPAL_MSG_OCC message type

Shilpasri G Bhat <[email protected]> writes:
> Add OPAL_MSG_OCC message definition to opal_message_type to receive
> OCC events like reset, load and throttled. Host performance can be
> affected when OCC is reset or OCC throttles the max Pstate.
> We can register to opal_message_notifier to receive OPAL_MSG_OCC type
> of message and report it to the userspace so as to keep the user
> informed about the reason for a performance drop in workloads.
>
> The reset and load OCC events are notified to kernel when FSP sends
> OCC_RESET and OCC_LOAD commands. Both reset and load messages are
> sent to kernel on successful completion of reset and load operation
> respectively.

How is this done on OpenPower systems? Explanation involving just what
OPAL does is likely better, rather than explaining in context of FSP,
which Linux has no real knowledge of (OPAL provides all abstraction of
it).

2015-08-10 01:41:19

by Stewart Smith

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] cpufreq: powernv: Register for OCC related opal_message notification

Shilpasri G Bhat <[email protected]> writes:
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index d0c18c9..a634199 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -33,6 +33,7 @@
> #include <asm/firmware.h>
> #include <asm/reg.h>
> #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
> +#include <asm/opal.h>
>
> #define POWERNV_MAX_PSTATES 256
> #define PMSR_PSAFE_ENABLE (1UL << 30)
> @@ -41,7 +42,7 @@
> #define PMSR_LP(x) ((x >> 48) & 0xFF)
>
> static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
> -static bool rebooting, throttled;
> +static bool rebooting, throttled, occ_reset;
>
> static struct chip {
> unsigned int id;
> @@ -414,6 +415,74 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
> .notifier_call = powernv_cpufreq_reboot_notifier,
> };
>
> +static char throttle_reason[][30] = {
> + "No throttling",
> + "Power Cap",
> + "Processor Over Temperature",
> + "Power Supply Failure",
> + "Over Current",
> + "OCC Reset"
> + };
> +
> +static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
> + unsigned long msg_type, void *_msg)
> +{
> + struct opal_msg *msg = _msg;
> + struct opal_occ_msg omsg;
> +
> + if (msg_type != OPAL_MSG_OCC)
> + return 0;
> +
> + omsg.type = be64_to_cpu(msg->params[0]);
> +
> + switch (omsg.type) {
> + case OCC_RESET:
> + occ_reset = true;
> + /*
> + * powernv_cpufreq_throttle_check() is called in
> + * target() callback which can detect the throttle state
> + * for governors like ondemand.
> + * But static governors will not call target() often thus
> + * report throttling here.
> + */
> + if (!throttled) {
> + throttled = true;
> + pr_crit("CPU Frequency is throttled\n");
> + }
> + pr_info("OCC: Reset\n");
> + break;
> + case OCC_LOAD:
> + pr_info("OCC: Loaded\n");
> + break;

I wonder if we could have the log messages be a bit clearer here, odds
are, unless you're one of the people reading this code, you have no idea
what an OCC is or what on earth "OCC: Loaded" means and why this
*doesn't* mean that your CPUs are no longer throttled so that your
computer doesn't catch fire/break/add 1+1 and get 4.

Also, do we export this information via sysfs somewhere? It would seem
to want to go along with other cpufreq/cpu info there.

It feels like we could do much better at informing users as to what is
going on.... maybe something like:

"OCC (On Chip Controller - enforces hard thermal/power limits) Resetting: CPU frequency throttled for duration"
"OCC Loading, CPU frequency throttled until OCC started"
"OCC Active, CPU frequency no longer throttled"

2015-08-10 07:38:30

by Shilpasri G Bhat

[permalink] [raw]
Subject: Re: [PATCH v5 2/6] powerpc/powernv: Add definition of OPAL_MSG_OCC message type

Hi Stewart,

On 08/10/2015 05:53 AM, Stewart Smith wrote:
> Shilpasri G Bhat <[email protected]> writes:
>> Add OPAL_MSG_OCC message definition to opal_message_type to receive
>> OCC events like reset, load and throttled. Host performance can be
>> affected when OCC is reset or OCC throttles the max Pstate.
>> We can register to opal_message_notifier to receive OPAL_MSG_OCC type
>> of message and report it to the userspace so as to keep the user
>> informed about the reason for a performance drop in workloads.
>>
>> The reset and load OCC events are notified to kernel when FSP sends
>> OCC_RESET and OCC_LOAD commands. Both reset and load messages are
>> sent to kernel on successful completion of reset and load operation
>> respectively.
>
> How is this done on OpenPower systems? Explanation involving just what
> OPAL does is likely better, rather than explaining in context of FSP,
> which Linux has no real knowledge of (OPAL provides all abstraction of
> it).
>

In OpenPower systems, opal will only send OCC throttled event. OCC reset
and load messages are not sent to kernel.

How about the following git log message?

Add OPAL_MSG_OCC message definition to opal_message_type to receive
OCC events like reset, load and throttled. Host performance can be
affected when OCC is reset or OCC throttles the max Pstate. We can
register to opal_message_notifier to receive OPAL_MSG_OCC type of
message and report it to the userspace so as to keep the user informed
about the reason for a performance drop in workloads.

Opal will send reset and load events to kernel on successful
completion of reset and load operation of OCC. During this duration
the cpu frequency will be throttled until OCC is started. Opal will
send a throttle message during the OCC reset-cycle to indicate that
OCC is active.

Opal will send throttle message to kernel when OCC is active to
indicate that the Pmax of the chip is reduced. The chip_id and
throttle reason for reducing Pmax is queued along with the message.

Thanks and Regards,
Shilpa

2015-08-10 07:51:36

by Shilpasri G Bhat

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] cpufreq: powernv: Register for OCC related opal_message notification



On 08/10/2015 07:11 AM, Stewart Smith wrote:
> Shilpasri G Bhat <[email protected]> writes:
>> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
>> index d0c18c9..a634199 100644
>> --- a/drivers/cpufreq/powernv-cpufreq.c
>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>> @@ -33,6 +33,7 @@
>> #include <asm/firmware.h>
>> #include <asm/reg.h>
>> #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
>> +#include <asm/opal.h>
>>
>> #define POWERNV_MAX_PSTATES 256
>> #define PMSR_PSAFE_ENABLE (1UL << 30)
>> @@ -41,7 +42,7 @@
>> #define PMSR_LP(x) ((x >> 48) & 0xFF)
>>
>> static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>> -static bool rebooting, throttled;
>> +static bool rebooting, throttled, occ_reset;
>>
>> static struct chip {
>> unsigned int id;
>> @@ -414,6 +415,74 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
>> .notifier_call = powernv_cpufreq_reboot_notifier,
>> };
>>
>> +static char throttle_reason[][30] = {
>> + "No throttling",
>> + "Power Cap",
>> + "Processor Over Temperature",
>> + "Power Supply Failure",
>> + "Over Current",
>> + "OCC Reset"
>> + };
>> +
>> +static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>> + unsigned long msg_type, void *_msg)
>> +{
>> + struct opal_msg *msg = _msg;
>> + struct opal_occ_msg omsg;
>> +
>> + if (msg_type != OPAL_MSG_OCC)
>> + return 0;
>> +
>> + omsg.type = be64_to_cpu(msg->params[0]);
>> +
>> + switch (omsg.type) {
>> + case OCC_RESET:
>> + occ_reset = true;
>> + /*
>> + * powernv_cpufreq_throttle_check() is called in
>> + * target() callback which can detect the throttle state
>> + * for governors like ondemand.
>> + * But static governors will not call target() often thus
>> + * report throttling here.
>> + */
>> + if (!throttled) {
>> + throttled = true;
>> + pr_crit("CPU Frequency is throttled\n");
>> + }
>> + pr_info("OCC: Reset\n");
>> + break;
>> + case OCC_LOAD:
>> + pr_info("OCC: Loaded\n");
>> + break;
>
> I wonder if we could have the log messages be a bit clearer here, odds
> are, unless you're one of the people reading this code, you have no idea
> what an OCC is or what on earth "OCC: Loaded" means and why this
> *doesn't* mean that your CPUs are no longer throttled so that your
> computer doesn't catch fire/break/add 1+1 and get 4.
>
> Also, do we export this information via sysfs somewhere? It would seem
> to want to go along with other cpufreq/cpu info there.

No we don't export the throttling status of the cpu via sysfs. Since the
throttling state is common across the chip, the per_cpu export will be
redundant. Did you mean something like one of the below:

1)/sys/devices/system/cpu/cpufreq/chipN_throttle

2)/sys/devices/system/cpu/cpuN/cpufreq/throttle

>
> It feels like we could do much better at informing users as to what is
> going on.... maybe something like:
>
> "OCC (On Chip Controller - enforces hard thermal/power limits) Resetting: CPU frequency throttled for duration"
> "OCC Loading, CPU frequency throttled until OCC started"
> "OCC Active, CPU frequency no longer throttled"
>

Okay will change the messages.

Thanks and Regards,
Shilpa

2015-08-10 07:55:30

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] cpufreq: powernv: Register for OCC related opal_message notification

On 10-08-15, 13:21, Shilpasri G Bhat wrote:
> Okay will change the messages.

This series is already applied by Rafael. So send a new patch if there
is any code change. Else, just let it go :)

--
viresh

2015-08-10 08:19:18

by Stewart Smith

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] cpufreq: powernv: Register for OCC related opal_message notification

Shilpasri G Bhat <[email protected]> writes:
>> Also, do we export this information via sysfs somewhere? It would seem
>> to want to go along with other cpufreq/cpu info there.
>
> No we don't export the throttling status of the cpu via sysfs. Since the
> throttling state is common across the chip, the per_cpu export will be
> redundant. Did you mean something like one of the below:
>
> 1)/sys/devices/system/cpu/cpufreq/chipN_throttle
>
> 2)/sys/devices/system/cpu/cpuN/cpufreq/throttle

yeah, I was thinking something like that.