2015-05-04 09:00:31

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v3 0/6] powernv: cpufreq: Report frequency throttle by OCC

This patchset intends to add frequency throttle reporting mechanism
to powernv-cpufreq driver when OCC throttles the frequency. OCC is an
On-Chip-Controller which takes care of the power and thermal safety of
the chip. The CPU frequency can be throttled during an OCC reset or
when OCC tries to limit the max allowed frequency. The patchset will
report such conditions so as to keep the user informed about reason
for the drop in performance of workloads when frequency is throttled.

Changes from v2:
- Split into multiple patches
- Semantic fixes

Shilpasri G Bhat (6):
cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
powerpc/powernv: Add definition of OPAL_MSG_OCC message type
cpufreq: powernv: Register for OCC related opal_message notification
cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is
set
cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling

arch/powerpc/include/asm/opal-api.h | 8 ++
drivers/cpufreq/powernv-cpufreq.c | 199 +++++++++++++++++++++++++++++++++---
2 files changed, 192 insertions(+), 15 deletions(-)

--
1.9.3


2015-05-04 09:01:16

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
max allowed frequency for that chip if the chip exceeds its power or
temperature limits. As Pmax capping is a chip level condition report
this throttling behavior at chip level and also do not set the global
'throttled' on Pmax capping instead set the per-chip throttled
variable. Report unthrottling if Pmax is restored after throttling.

This patch adds a structure to store chip id and throttled state of
the chip.

Signed-off-by: Shilpasri G Bhat <[email protected]>
---
drivers/cpufreq/powernv-cpufreq.c | 59 ++++++++++++++++++++++++++++++++++++---
1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index ebef0d8..d0c18c9 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -27,6 +27,7 @@
#include <linux/smp.h>
#include <linux/of.h>
#include <linux/reboot.h>
+#include <linux/slab.h>

#include <asm/cputhreads.h>
#include <asm/firmware.h>
@@ -42,6 +43,13 @@
static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
static bool rebooting, throttled;

+static struct chip {
+ unsigned int id;
+ bool throttled;
+} *chips;
+
+static int nr_chips;
+
/*
* Note: The set of pstates consists of contiguous integers, the
* smallest of which is indicated by powernv_pstate_info.min, the
@@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
static void powernv_cpufreq_throttle_check(unsigned int cpu)
{
unsigned long pmsr;
- int pmsr_pmax, pmsr_lp;
+ int pmsr_pmax, pmsr_lp, i;

pmsr = get_pmspr(SPRN_PMSR);

+ for (i = 0; i < nr_chips; i++)
+ if (chips[i].id == cpu_to_chip_id(cpu))
+ break;
+
/* Check for Pmax Capping */
pmsr_pmax = (s8)PMSR_MAX(pmsr);
if (pmsr_pmax != powernv_pstate_info.max) {
- throttled = true;
- pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
- pr_info("Max allowed Pstate is capped\n");
+ if (chips[i].throttled)
+ goto next;
+ chips[i].throttled = true;
+ pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
+ chips[i].id, pmsr_pmax);
+ } else if (chips[i].throttled) {
+ chips[i].throttled = false;
+ pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
+ chips[i].id, pmsr_pmax);
}

/*
* Check for Psafe by reading LocalPstate
* or check if Psafe_mode_active is set in PMSR.
*/
+next:
pmsr_lp = (s8)PMSR_LP(pmsr);
if ((pmsr_lp < powernv_pstate_info.min) ||
(pmsr & PMSR_PSAFE_ENABLE)) {
@@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
.attr = powernv_cpu_freq_attr,
};

+static int init_chip_info(void)
+{
+ unsigned int chip[256];
+ unsigned int cpu, i;
+ unsigned int prev_chip_id = UINT_MAX;
+
+ for_each_possible_cpu(cpu) {
+ unsigned int id = cpu_to_chip_id(cpu);
+
+ if (prev_chip_id != id) {
+ prev_chip_id = id;
+ chip[nr_chips++] = id;
+ }
+ }
+
+ chips = kmalloc_array(nr_chips, sizeof(struct chip), GFP_KERNEL);
+ if (!chips)
+ return -ENOMEM;
+
+ for (i = 0; i < nr_chips; i++) {
+ chips[i].id = chip[i];
+ chips[i].throttled = false;
+ }
+
+ return 0;
+}
+
static int __init powernv_cpufreq_init(void)
{
int rc = 0;
@@ -429,6 +475,11 @@ static int __init powernv_cpufreq_init(void)
return rc;
}

+ /* Populate chip info */
+ rc = init_chip_info();
+ if (rc)
+ return rc;
+
register_reboot_notifier(&powernv_cpufreq_reboot_nb);
return cpufreq_register_driver(&powernv_cpufreq_driver);
}
--
1.9.3

2015-05-04 09:01:07

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v3 2/6] powerpc/powernv: Add definition of OPAL_MSG_OCC message type

Add OPAL_MSG_OCC message definition to opal_message_type to receive
OCC events like reset, load and throttled. Host performance can be
affected when OCC is reset or OCC throttles the max Pstate.
We can register to opal_message_notifier to receive OPAL_MSG_OCC type
of message and report it to the userspace so as to keep the user
informed about the reason for a performance drop in workloads.

The reset and load OCC events are notified to kernel when FSP sends
OCC_RESET and OCC_LOAD commands. Both reset and load messages are
sent to kernel on successful completion of reset and load operation
respectively.

The throttle OCC event indicates that the Pmax of the chip is reduced.
The chip_id and throttle reason for reducing Pmax is also queued along
with the message.

Additional opal message type OPAL_MSG_PRD is added to maintain
compatibility between opal and kernel definition of opal_message_type.

Signed-off-by: Shilpasri G Bhat <[email protected]>
Reviewed-by: Preeti U Murthy <[email protected]>
---
No change from V2

Change from v1:
- Update the commit changelog

arch/powerpc/include/asm/opal-api.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 0321a90..50053b7 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -352,6 +352,14 @@ enum opal_msg_type {
OPAL_MSG_SHUTDOWN, /* params[0] = 1 reboot, 0 shutdown */
OPAL_MSG_HMI_EVT,
OPAL_MSG_DPO,
+ OPAL_MSG_PRD,
+ OPAL_MSG_OCC, /*
+ * params[0] = 0 reset,
+ * 1 load,
+ * 2 throttle
+ * params[1] = chip_id
+ * params[2] = throttle_status
+ */
OPAL_MSG_TYPE_MAX,
};

--
1.9.3

2015-05-04 09:00:46

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v3 3/6] cpufreq: powernv: Register for OCC related opal_message notification

OCC is an On-Chip-Controller which takes care of power and thermal
safety of the chip. During runtime due to power failure or
overtemperature the OCC may throttle the frequencies of the CPUs to
remain within the power budget.

We want the cpufreq driver to be aware of such situations to be able
to report the reason to the user. We register to opal_message_notifier
to receive OCC messages from opal.

powernv_cpufreq_throttle_check() reports any frequency throttling and
this patch will report the reason or event that caused throttling. We
can be throttled if OCC is reset or OCC limits Pmax due to power or
thermal reasons. We are also notified of unthrottling after an OCC
reset or if OCC restores Pmax on the chip.

Signed-off-by: Shilpasri G Bhat <[email protected]>
---
Changes from v2:
- Patch split in to multiple patches.
- This patch contains only the opal_message notification handler

Changes from v1:
- Add macros to define OCC_RESET, OCC_LOAD and OCC_THROTTLE
- Define a structure to store chip id, chip mask which has bits set
for cpus present in the chip, throttled state and a work_struct.
- Modify powernv_cpufreq_throttle_check() to be called via smp_call()
- On Pmax throttling/unthrottling update 'chip.throttled' and not the
global 'throttled' as Pmax capping is local to the chip.
- Remove the condition which checks if local pstate is less than Pmin
while checking for Psafe frequency. When OCC becomes active after
reset we update 'thottled' to false and when the cpufreq governor
initiates a pstate change, the local pstate will be in Psafe and we
will be reporting a false positive when we are not throttled.
- Schedule a kworker on receiving throttling/unthrottling OCC message
for that chip and schedule on all chips after receiving active.
- After an OCC reset all the cpus will be in Psafe frequency. So call
target() and restore the frequency to policy->cur after OCC_ACTIVE
and Pmax unthrottling
- Taken care of Viresh and Preeti's comments.

drivers/cpufreq/powernv-cpufreq.c | 75 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 74 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index d0c18c9..9268424 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -33,15 +33,19 @@
#include <asm/firmware.h>
#include <asm/reg.h>
#include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
+#include <asm/opal.h>

#define POWERNV_MAX_PSTATES 256
#define PMSR_PSAFE_ENABLE (1UL << 30)
#define PMSR_SPR_EM_DISABLE (1UL << 31)
#define PMSR_MAX(x) ((x >> 32) & 0xFF)
#define PMSR_LP(x) ((x >> 48) & 0xFF)
+#define OCC_RESET 0
+#define OCC_LOAD 1
+#define OCC_THROTTLE 2

static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
-static bool rebooting, throttled;
+static bool rebooting, throttled, occ_reset;

static struct chip {
unsigned int id;
@@ -414,6 +418,74 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
.notifier_call = powernv_cpufreq_reboot_notifier,
};

+static char throttle_reason[][30] = {
+ "No throttling",
+ "Power Cap",
+ "Processor Over Temperature",
+ "Power Supply Failure",
+ "Over Current",
+ "OCC Reset"
+ };
+
+static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
+ unsigned long msg_type, void *msg)
+{
+ struct opal_msg *occ_msg = msg;
+ uint64_t token;
+ uint64_t chip_id, reason;
+
+ if (msg_type != OPAL_MSG_OCC)
+ return 0;
+
+ token = be64_to_cpu(occ_msg->params[0]);
+
+ switch (token) {
+ case OCC_RESET:
+ occ_reset = true;
+ /*
+ * powernv_cpufreq_throttle_check() is called in
+ * target() callback which can detect the throttle state
+ * for governors like ondemand.
+ * But static governors will not call target() often thus
+ * report throttling here.
+ */
+ if (!throttled) {
+ throttled = true;
+ pr_crit("CPU Frequency is throttled\n");
+ }
+ pr_info("OCC: Reset\n");
+ break;
+ case OCC_LOAD:
+ pr_info("OCC: Loaded\n");
+ break;
+ case OCC_THROTTLE:
+ chip_id = be64_to_cpu(occ_msg->params[1]);
+ reason = be64_to_cpu(occ_msg->params[2]);
+
+ if (occ_reset) {
+ occ_reset = false;
+ throttled = false;
+ pr_info("OCC: Active\n");
+ return 0;
+ }
+
+ if (reason && reason <= 5)
+ pr_info("OCC: Chip %u Pmax reduced due to %s\n",
+ (unsigned int)chip_id,
+ throttle_reason[reason]);
+ else if (!reason)
+ pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
+ throttle_reason[reason]);
+ }
+ return 0;
+}
+
+static struct notifier_block powernv_cpufreq_opal_nb = {
+ .notifier_call = powernv_cpufreq_occ_msg,
+ .next = NULL,
+ .priority = 0,
+};
+
static void powernv_cpufreq_stop_cpu(struct cpufreq_policy *policy)
{
struct powernv_smp_call_data freq_data;
@@ -481,6 +553,7 @@ static int __init powernv_cpufreq_init(void)
return rc;

register_reboot_notifier(&powernv_cpufreq_reboot_nb);
+ opal_message_notifier_register(OPAL_MSG_OCC, &powernv_cpufreq_opal_nb);
return cpufreq_register_driver(&powernv_cpufreq_driver);
}
module_init(powernv_cpufreq_init);
--
1.9.3

2015-05-04 09:00:57

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
notification by executing *throttle_check() on any one of the cpu on
the chip. This is a sanity check to verify if we were indeed
throttled/unthrottled after receiving OCC_THROTTLE notification.

We cannot call *throttle_check() directly from the notification
handler because we could be handling chip1's notification in chip2. So
initiate an smp_call to execute *throttle_check(). We are irq-disabled
in the notification handler, so use a worker thread to smp_call
throttle_check() on any of the cpu in the chipmask.

Signed-off-by: Shilpasri G Bhat <[email protected]>
---
drivers/cpufreq/powernv-cpufreq.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index 9268424..9618813 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
static struct chip {
unsigned int id;
bool throttled;
+ cpumask_t mask;
+ struct work_struct throttle;
} *chips;

static int nr_chips;
@@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
return powernv_pstate_info.max - powernv_pstate_info.nominal;
}

-static void powernv_cpufreq_throttle_check(unsigned int cpu)
+static void powernv_cpufreq_throttle_check(void *data)
{
+ unsigned int cpu = smp_processor_id();
unsigned long pmsr;
int pmsr_pmax, pmsr_lp, i;

@@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
return 0;

if (!throttled)
- powernv_cpufreq_throttle_check(smp_processor_id());
+ powernv_cpufreq_throttle_check(NULL);

freq_data.pstate_id = powernv_freqs[new_index].driver_data;

@@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
.notifier_call = powernv_cpufreq_reboot_notifier,
};

+void powernv_cpufreq_work_fn(struct work_struct *work)
+{
+ struct chip *chip = container_of(work, struct chip, throttle);
+
+ smp_call_function_any(&chip->mask,
+ powernv_cpufreq_throttle_check, NULL, 0);
+}
+
static char throttle_reason[][30] = {
"No throttling",
"Power Cap",
@@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
struct opal_msg *occ_msg = msg;
uint64_t token;
uint64_t chip_id, reason;
+ int i;

if (msg_type != OPAL_MSG_OCC)
return 0;
@@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
occ_reset = false;
throttled = false;
pr_info("OCC: Active\n");
+
+ for (i = 0; i < nr_chips; i++)
+ schedule_work(&chips[i].throttle);
+
return 0;
}

@@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
else if (!reason)
pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
throttle_reason[reason]);
+ else
+ return 0;
+
+ for (i = 0; i < nr_chips; i++)
+ if (chips[i].id == chip_id)
+ schedule_work(&chips[i].throttle);
}
return 0;
}
@@ -527,6 +549,8 @@ static int init_chip_info(void)
for (i = 0; i < nr_chips; i++) {
chips[i].id = chip[i];
chips[i].throttled = false;
+ cpumask_copy(&chips[i].mask, cpumask_of_node(chip[i]));
+ INIT_WORK(&chips[i].throttle, powernv_cpufreq_work_fn);
}

return 0;
--
1.9.3

2015-05-04 09:01:24

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v3 5/6] cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is set

On a reset cycle of OCC, although the system retires from safe
frequency state the local pstate is not restored to Pmin or last
requested pstate. Now if the cpufreq governor initiates a pstate
change, the local pstate will be in Psafe and we will be reporting a
false positive when we are not throttled.

So in powernv_cpufreq_throttle_check() remove the condition which
checks if local pstate is less than Pmin while checking for Psafe
frequency. If the cpus are forced to Psafe then PMSR.psafe_mode_active
bit will be set. So, when OCCs become active this bit will be cleared.
Let us just rely on this bit for reporting throttling.

Signed-off-by: Shilpasri G Bhat <[email protected]>
---
drivers/cpufreq/powernv-cpufreq.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index 9618813..0a59d5b 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -39,7 +39,6 @@
#define PMSR_PSAFE_ENABLE (1UL << 30)
#define PMSR_SPR_EM_DISABLE (1UL << 31)
#define PMSR_MAX(x) ((x >> 32) & 0xFF)
-#define PMSR_LP(x) ((x >> 48) & 0xFF)
#define OCC_RESET 0
#define OCC_LOAD 1
#define OCC_THROTTLE 2
@@ -316,7 +315,7 @@ static void powernv_cpufreq_throttle_check(void *data)
{
unsigned int cpu = smp_processor_id();
unsigned long pmsr;
- int pmsr_pmax, pmsr_lp, i;
+ int pmsr_pmax, i;

pmsr = get_pmspr(SPRN_PMSR);

@@ -338,14 +337,9 @@ static void powernv_cpufreq_throttle_check(void *data)
chips[i].id, pmsr_pmax);
}

- /*
- * Check for Psafe by reading LocalPstate
- * or check if Psafe_mode_active is set in PMSR.
- */
+ /* Check if Psafe_mode_active is set in PMSR. */
next:
- pmsr_lp = (s8)PMSR_LP(pmsr);
- if ((pmsr_lp < powernv_pstate_info.min) ||
- (pmsr & PMSR_PSAFE_ENABLE)) {
+ if (pmsr & PMSR_PSAFE_ENABLE) {
throttled = true;
pr_info("Pstate set to safe frequency\n");
}
--
1.9.3

2015-05-04 09:02:02

by Shilpasri G Bhat

[permalink] [raw]
Subject: [PATCH v3 6/6] cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling

If frequency is throttled due to OCC reset then cpus will be in Psafe
frequency, so restore the frequency on all cpus to policy->cur when
OCCs are active again. And if frequency is throttled due to Pmax
capping then restore the frequency of all the cpus in the chip on
unthrottling.

Signed-off-by: Shilpasri G Bhat <[email protected]>
---
drivers/cpufreq/powernv-cpufreq.c | 31 +++++++++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
index 0a59d5b..b2915bc 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -51,6 +51,7 @@ static struct chip {
bool throttled;
cpumask_t mask;
struct work_struct throttle;
+ bool restore;
} *chips;

static int nr_chips;
@@ -418,9 +419,29 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
void powernv_cpufreq_work_fn(struct work_struct *work)
{
struct chip *chip = container_of(work, struct chip, throttle);
+ unsigned int cpu;
+ cpumask_var_t mask;

smp_call_function_any(&chip->mask,
powernv_cpufreq_throttle_check, NULL, 0);
+
+ if (!chip->restore)
+ return;
+
+ chip->restore = false;
+ cpumask_copy(mask, &chip->mask);
+ for_each_cpu_and(cpu, mask, cpu_online_mask) {
+ int index, tcpu;
+ struct cpufreq_policy policy;
+
+ cpufreq_get_policy(&policy, cpu);
+ cpufreq_frequency_table_target(&policy, policy.freq_table,
+ policy.cur,
+ CPUFREQ_RELATION_C, &index);
+ powernv_cpufreq_target_index(&policy, index);
+ for_each_cpu(tcpu, policy.cpus)
+ cpumask_clear_cpu(tcpu, mask);
+ }
}

static char throttle_reason[][30] = {
@@ -473,8 +494,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
throttled = false;
pr_info("OCC: Active\n");

- for (i = 0; i < nr_chips; i++)
+ for (i = 0; i < nr_chips; i++) {
+ chips[i].restore = true;
schedule_work(&chips[i].throttle);
+ }

return 0;
}
@@ -490,8 +513,11 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
return 0;

for (i = 0; i < nr_chips; i++)
- if (chips[i].id == chip_id)
+ if (chips[i].id == chip_id) {
+ if (!reason)
+ chips[i].restore = true;
schedule_work(&chips[i].throttle);
+ }
}
return 0;
}
@@ -545,6 +571,7 @@ static int init_chip_info(void)
chips[i].throttled = false;
cpumask_copy(&chips[i].mask, cpumask_of_node(chip[i]));
INIT_WORK(&chips[i].throttle, powernv_cpufreq_work_fn);
+ chips[i].restore = false;
}

return 0;
--
1.9.3

2015-05-05 03:43:14

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] cpufreq: powernv: Register for OCC related opal_message notification

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> OCC is an On-Chip-Controller which takes care of power and thermal
> safety of the chip. During runtime due to power failure or
> overtemperature the OCC may throttle the frequencies of the CPUs to
> remain within the power budget.
>
> We want the cpufreq driver to be aware of such situations to be able
> to report the reason to the user. We register to opal_message_notifier
> to receive OCC messages from opal.
>
> powernv_cpufreq_throttle_check() reports any frequency throttling and
> this patch will report the reason or event that caused throttling. We
> can be throttled if OCC is reset or OCC limits Pmax due to power or
> thermal reasons. We are also notified of unthrottling after an OCC
> reset or if OCC restores Pmax on the chip.
>
> Signed-off-by: Shilpasri G Bhat <[email protected]>
> ---
> Changes from v2:
> - Patch split in to multiple patches.
> - This patch contains only the opal_message notification handler
>
> Changes from v1:
> - Add macros to define OCC_RESET, OCC_LOAD and OCC_THROTTLE
> - Define a structure to store chip id, chip mask which has bits set
> for cpus present in the chip, throttled state and a work_struct.
> - Modify powernv_cpufreq_throttle_check() to be called via smp_call()
> - On Pmax throttling/unthrottling update 'chip.throttled' and not the
> global 'throttled' as Pmax capping is local to the chip.
> - Remove the condition which checks if local pstate is less than Pmin
> while checking for Psafe frequency. When OCC becomes active after
> reset we update 'thottled' to false and when the cpufreq governor
> initiates a pstate change, the local pstate will be in Psafe and we
> will be reporting a false positive when we are not throttled.
> - Schedule a kworker on receiving throttling/unthrottling OCC message
> for that chip and schedule on all chips after receiving active.
> - After an OCC reset all the cpus will be in Psafe frequency. So call
> target() and restore the frequency to policy->cur after OCC_ACTIVE
> and Pmax unthrottling
> - Taken care of Viresh and Preeti's comments.
>
> drivers/cpufreq/powernv-cpufreq.c | 75 ++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 74 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index d0c18c9..9268424 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -33,15 +33,19 @@
> #include <asm/firmware.h>
> #include <asm/reg.h>
> #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
> +#include <asm/opal.h>
>
> #define POWERNV_MAX_PSTATES 256
> #define PMSR_PSAFE_ENABLE (1UL << 30)
> #define PMSR_SPR_EM_DISABLE (1UL << 31)
> #define PMSR_MAX(x) ((x >> 32) & 0xFF)
> #define PMSR_LP(x) ((x >> 48) & 0xFF)
> +#define OCC_RESET 0
> +#define OCC_LOAD 1
> +#define OCC_THROTTLE 2
>
> static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
> -static bool rebooting, throttled;
> +static bool rebooting, throttled, occ_reset;
>
> static struct chip {
> unsigned int id;
> @@ -414,6 +418,74 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
> .notifier_call = powernv_cpufreq_reboot_notifier,
> };
>
> +static char throttle_reason[][30] = {
> + "No throttling",
> + "Power Cap",
> + "Processor Over Temperature",
> + "Power Supply Failure",
> + "Over Current",
> + "OCC Reset"
> + };
> +
> +static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
> + unsigned long msg_type, void *msg)
> +{
> + struct opal_msg *occ_msg = msg;
> + uint64_t token;
> + uint64_t chip_id, reason;
> +
> + if (msg_type != OPAL_MSG_OCC)
> + return 0;
> +
> + token = be64_to_cpu(occ_msg->params[0]);
> +
> + switch (token) {
> + case OCC_RESET:
> + occ_reset = true;
> + /*
> + * powernv_cpufreq_throttle_check() is called in
> + * target() callback which can detect the throttle state
> + * for governors like ondemand.
> + * But static governors will not call target() often thus
> + * report throttling here.
> + */
> + if (!throttled) {
> + throttled = true;
> + pr_crit("CPU Frequency is throttled\n");
> + }
> + pr_info("OCC: Reset\n");
> + break;
> + case OCC_LOAD:
> + pr_info("OCC: Loaded\n");
> + break;
> + case OCC_THROTTLE:
> + chip_id = be64_to_cpu(occ_msg->params[1]);
> + reason = be64_to_cpu(occ_msg->params[2]);
> +
> + if (occ_reset) {
> + occ_reset = false;
> + throttled = false;
> + pr_info("OCC: Active\n");
> + return 0;
> + }
> +
> + if (reason && reason <= 5)
> + pr_info("OCC: Chip %u Pmax reduced due to %s\n",
> + (unsigned int)chip_id,
> + throttle_reason[reason]);
> + else if (!reason)
> + pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
> + throttle_reason[reason]);
> + }
> + return 0;
> +}
> +
> +static struct notifier_block powernv_cpufreq_opal_nb = {
> + .notifier_call = powernv_cpufreq_occ_msg,
> + .next = NULL,
> + .priority = 0,
> +};
> +
> static void powernv_cpufreq_stop_cpu(struct cpufreq_policy *policy)
> {
> struct powernv_smp_call_data freq_data;
> @@ -481,6 +553,7 @@ static int __init powernv_cpufreq_init(void)
> return rc;
>
> register_reboot_notifier(&powernv_cpufreq_reboot_nb);
> + opal_message_notifier_register(OPAL_MSG_OCC, &powernv_cpufreq_opal_nb);
> return cpufreq_register_driver(&powernv_cpufreq_driver);
> }
> module_init(powernv_cpufreq_init);
>

Looks good.

Reviewed-by: Preeti U Murthy <[email protected]>

2015-05-05 03:46:22

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is set

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> On a reset cycle of OCC, although the system retires from safe
> frequency state the local pstate is not restored to Pmin or last
> requested pstate. Now if the cpufreq governor initiates a pstate
> change, the local pstate will be in Psafe and we will be reporting a
> false positive when we are not throttled.
>
> So in powernv_cpufreq_throttle_check() remove the condition which
> checks if local pstate is less than Pmin while checking for Psafe
> frequency. If the cpus are forced to Psafe then PMSR.psafe_mode_active
> bit will be set. So, when OCCs become active this bit will be cleared.
> Let us just rely on this bit for reporting throttling.
>
> Signed-off-by: Shilpasri G Bhat <[email protected]>
> ---
> drivers/cpufreq/powernv-cpufreq.c | 12 +++---------
> 1 file changed, 3 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index 9618813..0a59d5b 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -39,7 +39,6 @@
> #define PMSR_PSAFE_ENABLE (1UL << 30)
> #define PMSR_SPR_EM_DISABLE (1UL << 31)
> #define PMSR_MAX(x) ((x >> 32) & 0xFF)
> -#define PMSR_LP(x) ((x >> 48) & 0xFF)
> #define OCC_RESET 0
> #define OCC_LOAD 1
> #define OCC_THROTTLE 2
> @@ -316,7 +315,7 @@ static void powernv_cpufreq_throttle_check(void *data)
> {
> unsigned int cpu = smp_processor_id();
> unsigned long pmsr;
> - int pmsr_pmax, pmsr_lp, i;
> + int pmsr_pmax, i;
>
> pmsr = get_pmspr(SPRN_PMSR);
>
> @@ -338,14 +337,9 @@ static void powernv_cpufreq_throttle_check(void *data)
> chips[i].id, pmsr_pmax);
> }
>
> - /*
> - * Check for Psafe by reading LocalPstate
> - * or check if Psafe_mode_active is set in PMSR.
> - */
> + /* Check if Psafe_mode_active is set in PMSR. */
> next:
> - pmsr_lp = (s8)PMSR_LP(pmsr);
> - if ((pmsr_lp < powernv_pstate_info.min) ||
> - (pmsr & PMSR_PSAFE_ENABLE)) {
> + if (pmsr & PMSR_PSAFE_ENABLE) {
> throttled = true;
> pr_info("Pstate set to safe frequency\n");
> }
>

Reviewed-by: Preeti U Murthy <[email protected]>

2015-05-05 03:51:55

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

Hi Shilpa,

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
> max allowed frequency for that chip if the chip exceeds its power or
> temperature limits. As Pmax capping is a chip level condition report
> this throttling behavior at chip level and also do not set the global
> 'throttled' on Pmax capping instead set the per-chip throttled
> variable. Report unthrottling if Pmax is restored after throttling.
>
> This patch adds a structure to store chip id and throttled state of
> the chip.
>
> Signed-off-by: Shilpasri G Bhat <[email protected]>
> ---
> drivers/cpufreq/powernv-cpufreq.c | 59 ++++++++++++++++++++++++++++++++++++---
> 1 file changed, 55 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index ebef0d8..d0c18c9 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -27,6 +27,7 @@
> #include <linux/smp.h>
> #include <linux/of.h>
> #include <linux/reboot.h>
> +#include <linux/slab.h>
>
> #include <asm/cputhreads.h>
> #include <asm/firmware.h>
> @@ -42,6 +43,13 @@
> static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
> static bool rebooting, throttled;
>
> +static struct chip {
> + unsigned int id;
> + bool throttled;
> +} *chips;
> +
> +static int nr_chips;
> +
> /*
> * Note: The set of pstates consists of contiguous integers, the
> * smallest of which is indicated by powernv_pstate_info.min, the
> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
> static void powernv_cpufreq_throttle_check(unsigned int cpu)
> {
> unsigned long pmsr;
> - int pmsr_pmax, pmsr_lp;
> + int pmsr_pmax, pmsr_lp, i;
>
> pmsr = get_pmspr(SPRN_PMSR);
>
> + for (i = 0; i < nr_chips; i++)
> + if (chips[i].id == cpu_to_chip_id(cpu))
> + break;
> +
> /* Check for Pmax Capping */
> pmsr_pmax = (s8)PMSR_MAX(pmsr);
> if (pmsr_pmax != powernv_pstate_info.max) {
> - throttled = true;
> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
> - pr_info("Max allowed Pstate is capped\n");
> + if (chips[i].throttled)
> + goto next;
> + chips[i].throttled = true;
> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
> + chips[i].id, pmsr_pmax);
> + } else if (chips[i].throttled) {
> + chips[i].throttled = false;

Is this check on pmax sufficient to indicate that the chip is unthrottled ?

> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
> + chips[i].id, pmsr_pmax);
> }
>
> /*
> * Check for Psafe by reading LocalPstate
> * or check if Psafe_mode_active is set in PMSR.
> */
> +next:
> pmsr_lp = (s8)PMSR_LP(pmsr);
> if ((pmsr_lp < powernv_pstate_info.min) ||
> (pmsr & PMSR_PSAFE_ENABLE)) {
> @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
> .attr = powernv_cpu_freq_attr,

What about the situation where although occ is active, this particular
chip has been throttled and we end up repeatedly reporting "pstate set
to safe" and "frequency control disabled from OS" ? Should we not have a
check on (chips[i].throttled) before reporting an anomaly for these two
scenarios as well just like you have for pmsr_pmax ?

Regards
Preeti U Murthy

2015-05-05 04:01:06

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

Hi Shilpa,

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
> notification by executing *throttle_check() on any one of the cpu on
> the chip. This is a sanity check to verify if we were indeed
> throttled/unthrottled after receiving OCC_THROTTLE notification.
>
> We cannot call *throttle_check() directly from the notification
> handler because we could be handling chip1's notification in chip2. So
> initiate an smp_call to execute *throttle_check(). We are irq-disabled
> in the notification handler, so use a worker thread to smp_call
> throttle_check() on any of the cpu in the chipmask.

I see that the first patch takes care of reporting *per-chip* throttling
for pmax capping condition. But where are we taking care of reporting
"pstate set to safe" and "freq control disabled" scenarios per-chip ?

>
> Signed-off-by: Shilpasri G Bhat <[email protected]>
> ---
> drivers/cpufreq/powernv-cpufreq.c | 28 ++++++++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index 9268424..9618813 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
> static struct chip {
> unsigned int id;
> bool throttled;
> + cpumask_t mask;
> + struct work_struct throttle;
> } *chips;
>
> static int nr_chips;
> @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
> return powernv_pstate_info.max - powernv_pstate_info.nominal;
> }
>
> -static void powernv_cpufreq_throttle_check(unsigned int cpu)
> +static void powernv_cpufreq_throttle_check(void *data)
> {
> + unsigned int cpu = smp_processor_id();
> unsigned long pmsr;
> int pmsr_pmax, pmsr_lp, i;
>
> @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
> return 0;
>
> if (!throttled)
> - powernv_cpufreq_throttle_check(smp_processor_id());
> + powernv_cpufreq_throttle_check(NULL);
>
> freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>
> @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
> .notifier_call = powernv_cpufreq_reboot_notifier,
> };
>
> +void powernv_cpufreq_work_fn(struct work_struct *work)
> +{
> + struct chip *chip = container_of(work, struct chip, throttle);
> +
> + smp_call_function_any(&chip->mask,
> + powernv_cpufreq_throttle_check, NULL, 0);
> +}
> +
> static char throttle_reason[][30] = {
> "No throttling",
> "Power Cap",
> @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
> struct opal_msg *occ_msg = msg;
> uint64_t token;
> uint64_t chip_id, reason;
> + int i;
>
> if (msg_type != OPAL_MSG_OCC)
> return 0;
> @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
> occ_reset = false;
> throttled = false;
> pr_info("OCC: Active\n");
> +
> + for (i = 0; i < nr_chips; i++)
> + schedule_work(&chips[i].throttle);
> +
> return 0;
> }
>
> @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
> else if (!reason)
> pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
> throttle_reason[reason]);
> + else
> + return 0;

Why the else section ? The code can never reach here, can it ?

> +
> + for (i = 0; i < nr_chips; i++)
> + if (chips[i].id == chip_id)
> + schedule_work(&chips[i].throttle);
> }

Should we not do this only when we get unthrottled so as to cross verify
if it is indeed the case ? In case of throttling notification, opal's
verdict is final and there is no need to cross verify right ?

Perhaps the one thing that needs to be taken care in addition to
reporting throttling is setting the chip's throttled parameter to true.
This should do right ? I don't see the need to call throttle_check() here.

Regards
Preeti U Murthy


> return 0;
> }
> @@ -527,6 +549,8 @@ static int init_chip_info(void)
> for (i = 0; i < nr_chips; i++) {
> chips[i].id = chip[i];
> chips[i].throttled = false;
> + cpumask_copy(&chips[i].mask, cpumask_of_node(chip[i]));
> + INIT_WORK(&chips[i].throttle, powernv_cpufreq_work_fn);
> }
>
> return 0;
>

2015-05-05 06:07:11

by Shilpasri G Bhat

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

Hi Preeti,

On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
> Hi Shilpa,
>
> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
>> max allowed frequency for that chip if the chip exceeds its power or
>> temperature limits. As Pmax capping is a chip level condition report
>> this throttling behavior at chip level and also do not set the global
>> 'throttled' on Pmax capping instead set the per-chip throttled
>> variable. Report unthrottling if Pmax is restored after throttling.
>>
>> This patch adds a structure to store chip id and throttled state of
>> the chip.
>>
>> Signed-off-by: Shilpasri G Bhat <[email protected]>
>> ---
>> drivers/cpufreq/powernv-cpufreq.c | 59 ++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 55 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
>> index ebef0d8..d0c18c9 100644
>> --- a/drivers/cpufreq/powernv-cpufreq.c
>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>> @@ -27,6 +27,7 @@
>> #include <linux/smp.h>
>> #include <linux/of.h>
>> #include <linux/reboot.h>
>> +#include <linux/slab.h>
>>
>> #include <asm/cputhreads.h>
>> #include <asm/firmware.h>
>> @@ -42,6 +43,13 @@
>> static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>> static bool rebooting, throttled;
>>
>> +static struct chip {
>> + unsigned int id;
>> + bool throttled;
>> +} *chips;
>> +
>> +static int nr_chips;
>> +
>> /*
>> * Note: The set of pstates consists of contiguous integers, the
>> * smallest of which is indicated by powernv_pstate_info.min, the
>> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
>> static void powernv_cpufreq_throttle_check(unsigned int cpu)
>> {
>> unsigned long pmsr;
>> - int pmsr_pmax, pmsr_lp;
>> + int pmsr_pmax, pmsr_lp, i;
>>
>> pmsr = get_pmspr(SPRN_PMSR);
>>
>> + for (i = 0; i < nr_chips; i++)
>> + if (chips[i].id == cpu_to_chip_id(cpu))
>> + break;
>> +
>> /* Check for Pmax Capping */
>> pmsr_pmax = (s8)PMSR_MAX(pmsr);
>> if (pmsr_pmax != powernv_pstate_info.max) {
>> - throttled = true;
>> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
>> - pr_info("Max allowed Pstate is capped\n");
>> + if (chips[i].throttled)
>> + goto next;
>> + chips[i].throttled = true;
>> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
>> + chips[i].id, pmsr_pmax);
>> + } else if (chips[i].throttled) {
>> + chips[i].throttled = false;
>
> Is this check on pmax sufficient to indicate that the chip is unthrottled ?

Unthrottling due to Pmax uncapping here is specific to a chip. So it is
sufficient to decide throttling/unthrottling when OCC is active for that chip.

>
>> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
>> + chips[i].id, pmsr_pmax);
>> }
>>
>> /*
>> * Check for Psafe by reading LocalPstate
>> * or check if Psafe_mode_active is set in PMSR.
>> */
>> +next:
>> pmsr_lp = (s8)PMSR_LP(pmsr);
>> if ((pmsr_lp < powernv_pstate_info.min) ||
>> (pmsr & PMSR_PSAFE_ENABLE)) {
>> @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>> .attr = powernv_cpu_freq_attr,
>
> What about the situation where although occ is active, this particular
> chip has been throttled and we end up repeatedly reporting "pstate set
> to safe" and "frequency control disabled from OS" ? Should we not have a
> check on (chips[i].throttled) before reporting an anomaly for these two
> scenarios as well just like you have for pmsr_pmax ?

We will not have "Psafe" and "frequency control disabled" repeatedly printed
because of global variable 'throttled', which is set to true on passing any of
these two conditions.

It is quite unlikely behavior to have only one chip in "Psafe" or "frequency
control disabled" state. These two conditions are most likely to happen during
an OCC reset cycle which will occur across all chips.

Thanks and Regards,
Shilpa

2015-05-05 06:34:08

by Shilpasri G Bhat

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

Hi Preeti,

On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
> Hi Shilpa,
>
> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
>> notification by executing *throttle_check() on any one of the cpu on
>> the chip. This is a sanity check to verify if we were indeed
>> throttled/unthrottled after receiving OCC_THROTTLE notification.
>>
>> We cannot call *throttle_check() directly from the notification
>> handler because we could be handling chip1's notification in chip2. So
>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
>> in the notification handler, so use a worker thread to smp_call
>> throttle_check() on any of the cpu in the chipmask.
>
> I see that the first patch takes care of reporting *per-chip* throttling
> for pmax capping condition. But where are we taking care of reporting
> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
>

IMO let us not have "psafe" and "freq control disabled" states managed per-chip.
Because when the above two conditions occur it is likely to happen across all
chips during an OCC reset cycle. So I am setting 'throttled' to false on
OCC_ACTIVE and re-verifying if it actually is the case by invoking
*throttle_check().

>>
>> Signed-off-by: Shilpasri G Bhat <[email protected]>
>> ---
>> drivers/cpufreq/powernv-cpufreq.c | 28 ++++++++++++++++++++++++++--
>> 1 file changed, 26 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
>> index 9268424..9618813 100644
>> --- a/drivers/cpufreq/powernv-cpufreq.c
>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>> @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
>> static struct chip {
>> unsigned int id;
>> bool throttled;
>> + cpumask_t mask;
>> + struct work_struct throttle;
>> } *chips;
>>
>> static int nr_chips;
>> @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
>> return powernv_pstate_info.max - powernv_pstate_info.nominal;
>> }
>>
>> -static void powernv_cpufreq_throttle_check(unsigned int cpu)
>> +static void powernv_cpufreq_throttle_check(void *data)
>> {
>> + unsigned int cpu = smp_processor_id();
>> unsigned long pmsr;
>> int pmsr_pmax, pmsr_lp, i;
>>
>> @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
>> return 0;
>>
>> if (!throttled)
>> - powernv_cpufreq_throttle_check(smp_processor_id());
>> + powernv_cpufreq_throttle_check(NULL);
>>
>> freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>>
>> @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
>> .notifier_call = powernv_cpufreq_reboot_notifier,
>> };
>>
>> +void powernv_cpufreq_work_fn(struct work_struct *work)
>> +{
>> + struct chip *chip = container_of(work, struct chip, throttle);
>> +
>> + smp_call_function_any(&chip->mask,
>> + powernv_cpufreq_throttle_check, NULL, 0);
>> +}
>> +
>> static char throttle_reason[][30] = {
>> "No throttling",
>> "Power Cap",
>> @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>> struct opal_msg *occ_msg = msg;
>> uint64_t token;
>> uint64_t chip_id, reason;
>> + int i;
>>
>> if (msg_type != OPAL_MSG_OCC)
>> return 0;
>> @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>> occ_reset = false;
>> throttled = false;
>> pr_info("OCC: Active\n");
>> +
>> + for (i = 0; i < nr_chips; i++)
>> + schedule_work(&chips[i].throttle);
>> +
>> return 0;
>> }
>>
>> @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>> else if (!reason)
>> pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
>> throttle_reason[reason]);
>> + else
>> + return 0;
>
> Why the else section ? The code can never reach here, can it ?

When reason > 5 , we dont want to handle it.

>
>> +
>> + for (i = 0; i < nr_chips; i++)
>> + if (chips[i].id == chip_id)
>> + schedule_work(&chips[i].throttle);
>> }
>
> Should we not do this only when we get unthrottled so as to cross verify
> if it is indeed the case ? In case of throttling notification, opal's
> verdict is final and there is no need to cross verify right ?

Two reasons for invoking *throttle_check() on throttling:
1) We just got to know the reason and not the Pmax value we are getting
throttled to.
2) It could be a spurious message caused due to late/lost delivery. My point
here is let us not completely rely on the notification to declare throttling
unless we verify it from reading PMSR.

>
> Perhaps the one thing that needs to be taken care in addition to
> reporting throttling is setting the chip's throttled parameter to true.
> This should do right ? I don't see the need to call throttle_check() here.
>
>

Thanks and Regards,
Shilpa

2015-05-05 08:38:49

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote:
> Hi Preeti,
>
> On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
>> Hi Shilpa,
>>
>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
>>> max allowed frequency for that chip if the chip exceeds its power or
>>> temperature limits. As Pmax capping is a chip level condition report
>>> this throttling behavior at chip level and also do not set the global
>>> 'throttled' on Pmax capping instead set the per-chip throttled
>>> variable. Report unthrottling if Pmax is restored after throttling.
>>>
>>> This patch adds a structure to store chip id and throttled state of
>>> the chip.
>>>
>>> Signed-off-by: Shilpasri G Bhat <[email protected]>
>>> ---
>>> drivers/cpufreq/powernv-cpufreq.c | 59 ++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 55 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
>>> index ebef0d8..d0c18c9 100644
>>> --- a/drivers/cpufreq/powernv-cpufreq.c
>>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>>> @@ -27,6 +27,7 @@
>>> #include <linux/smp.h>
>>> #include <linux/of.h>
>>> #include <linux/reboot.h>
>>> +#include <linux/slab.h>
>>>
>>> #include <asm/cputhreads.h>
>>> #include <asm/firmware.h>
>>> @@ -42,6 +43,13 @@
>>> static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>>> static bool rebooting, throttled;
>>>
>>> +static struct chip {
>>> + unsigned int id;
>>> + bool throttled;
>>> +} *chips;
>>> +
>>> +static int nr_chips;
>>> +
>>> /*
>>> * Note: The set of pstates consists of contiguous integers, the
>>> * smallest of which is indicated by powernv_pstate_info.min, the
>>> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
>>> static void powernv_cpufreq_throttle_check(unsigned int cpu)
>>> {
>>> unsigned long pmsr;
>>> - int pmsr_pmax, pmsr_lp;
>>> + int pmsr_pmax, pmsr_lp, i;
>>>
>>> pmsr = get_pmspr(SPRN_PMSR);
>>>
>>> + for (i = 0; i < nr_chips; i++)
>>> + if (chips[i].id == cpu_to_chip_id(cpu))
>>> + break;
>>> +
>>> /* Check for Pmax Capping */
>>> pmsr_pmax = (s8)PMSR_MAX(pmsr);
>>> if (pmsr_pmax != powernv_pstate_info.max) {
>>> - throttled = true;
>>> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
>>> - pr_info("Max allowed Pstate is capped\n");
>>> + if (chips[i].throttled)
>>> + goto next;
>>> + chips[i].throttled = true;
>>> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
>>> + chips[i].id, pmsr_pmax);
>>> + } else if (chips[i].throttled) {
>>> + chips[i].throttled = false;
>>
>> Is this check on pmax sufficient to indicate that the chip is unthrottled ?
>
> Unthrottling due to Pmax uncapping here is specific to a chip. So it is
> sufficient to decide throttling/unthrottling when OCC is active for that chip.

Ok then we can perhaps exit after detecting unthrottling here.
>
>>
>>> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
>>> + chips[i].id, pmsr_pmax);
>>> }
>>>
>>> /*
>>> * Check for Psafe by reading LocalPstate
>>> * or check if Psafe_mode_active is set in PMSR.
>>> */
>>> +next:
>>> pmsr_lp = (s8)PMSR_LP(pmsr);
>>> if ((pmsr_lp < powernv_pstate_info.min) ||
>>> (pmsr & PMSR_PSAFE_ENABLE)) {
>>> @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>>> .attr = powernv_cpu_freq_attr,
>>
>> What about the situation where although occ is active, this particular
>> chip has been throttled and we end up repeatedly reporting "pstate set
>> to safe" and "frequency control disabled from OS" ? Should we not have a
>> check on (chips[i].throttled) before reporting an anomaly for these two
>> scenarios as well just like you have for pmsr_pmax ?
>
> We will not have "Psafe" and "frequency control disabled" repeatedly printed
> because of global variable 'throttled', which is set to true on passing any of
> these two conditions.
>
> It is quite unlikely behavior to have only one chip in "Psafe" or "frequency
> control disabled" state. These two conditions are most likely to happen during
> an OCC reset cycle which will occur across all chips.

Let us then add a comment to indicate that Psafe and frequency control
disabled conditions will fail *only if OCC is inactive* and not
otherwise and that this is a system wide phenomenon.

Regards
Preeti U Murthy

>
> Thanks and Regards,
> Shilpa
>
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>

2015-05-05 08:42:04

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
> Hi Preeti,
>
> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
>> Hi Shilpa,
>>
>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
>>> notification by executing *throttle_check() on any one of the cpu on
>>> the chip. This is a sanity check to verify if we were indeed
>>> throttled/unthrottled after receiving OCC_THROTTLE notification.
>>>
>>> We cannot call *throttle_check() directly from the notification
>>> handler because we could be handling chip1's notification in chip2. So
>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
>>> in the notification handler, so use a worker thread to smp_call
>>> throttle_check() on any of the cpu in the chipmask.
>>
>> I see that the first patch takes care of reporting *per-chip* throttling
>> for pmax capping condition. But where are we taking care of reporting
>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
>>
>
> IMO let us not have "psafe" and "freq control disabled" states managed per-chip.
> Because when the above two conditions occur it is likely to happen across all
> chips during an OCC reset cycle. So I am setting 'throttled' to false on
> OCC_ACTIVE and re-verifying if it actually is the case by invoking
> *throttle_check().

Alright like I pointed in the previous reply, a comment to indicate that
psafe and freq control disabled conditions will fail when occ is
inactive and that all chips face the consequence of this will help.

>
>>>
>>> Signed-off-by: Shilpasri G Bhat <[email protected]>
>>> ---
>>> drivers/cpufreq/powernv-cpufreq.c | 28 ++++++++++++++++++++++++++--
>>> 1 file changed, 26 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
>>> index 9268424..9618813 100644
>>> --- a/drivers/cpufreq/powernv-cpufreq.c
>>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>>> @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
>>> static struct chip {
>>> unsigned int id;
>>> bool throttled;
>>> + cpumask_t mask;
>>> + struct work_struct throttle;
>>> } *chips;
>>>
>>> static int nr_chips;
>>> @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
>>> return powernv_pstate_info.max - powernv_pstate_info.nominal;
>>> }
>>>
>>> -static void powernv_cpufreq_throttle_check(unsigned int cpu)
>>> +static void powernv_cpufreq_throttle_check(void *data)
>>> {
>>> + unsigned int cpu = smp_processor_id();
>>> unsigned long pmsr;
>>> int pmsr_pmax, pmsr_lp, i;
>>>
>>> @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
>>> return 0;
>>>
>>> if (!throttled)
>>> - powernv_cpufreq_throttle_check(smp_processor_id());
>>> + powernv_cpufreq_throttle_check(NULL);
>>>
>>> freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>>>
>>> @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
>>> .notifier_call = powernv_cpufreq_reboot_notifier,
>>> };
>>>
>>> +void powernv_cpufreq_work_fn(struct work_struct *work)
>>> +{
>>> + struct chip *chip = container_of(work, struct chip, throttle);
>>> +
>>> + smp_call_function_any(&chip->mask,
>>> + powernv_cpufreq_throttle_check, NULL, 0);
>>> +}
>>> +
>>> static char throttle_reason[][30] = {
>>> "No throttling",
>>> "Power Cap",
>>> @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>>> struct opal_msg *occ_msg = msg;
>>> uint64_t token;
>>> uint64_t chip_id, reason;
>>> + int i;
>>>
>>> if (msg_type != OPAL_MSG_OCC)
>>> return 0;
>>> @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>>> occ_reset = false;
>>> throttled = false;
>>> pr_info("OCC: Active\n");
>>> +
>>> + for (i = 0; i < nr_chips; i++)
>>> + schedule_work(&chips[i].throttle);
>>> +
>>> return 0;
>>> }
>>>
>>> @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>>> else if (!reason)
>>> pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
>>> throttle_reason[reason]);
>>> + else
>>> + return 0;
>>
>> Why the else section ? The code can never reach here, can it ?
>
> When reason > 5 , we dont want to handle it.

Of course! My bad!
>
>>
>>> +
>>> + for (i = 0; i < nr_chips; i++)
>>> + if (chips[i].id == chip_id)
>>> + schedule_work(&chips[i].throttle);
>>> }
>>
>> Should we not do this only when we get unthrottled so as to cross verify
>> if it is indeed the case ? In case of throttling notification, opal's
>> verdict is final and there is no need to cross verify right ?
>
> Two reasons for invoking *throttle_check() on throttling:
> 1) We just got to know the reason and not the Pmax value we are getting
> throttled to.
> 2) It could be a spurious message caused due to late/lost delivery. My point
> here is let us not completely rely on the notification to declare throttling
> unless we verify it from reading PMSR.

Sounds good.

Regards
Preeti U Murthy

2015-05-05 09:40:10

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling

On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> If frequency is throttled due to OCC reset then cpus will be in Psafe
> frequency, so restore the frequency on all cpus to policy->cur when
> OCCs are active again. And if frequency is throttled due to Pmax
> capping then restore the frequency of all the cpus in the chip on
> unthrottling.
>
> Signed-off-by: Shilpasri G Bhat <[email protected]>
> ---
> drivers/cpufreq/powernv-cpufreq.c | 31 +++++++++++++++++++++++++++++--
> 1 file changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index 0a59d5b..b2915bc 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -51,6 +51,7 @@ static struct chip {
> bool throttled;
> cpumask_t mask;
> struct work_struct throttle;
> + bool restore;
> } *chips;
>
> static int nr_chips;
> @@ -418,9 +419,29 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
> void powernv_cpufreq_work_fn(struct work_struct *work)
> {
> struct chip *chip = container_of(work, struct chip, throttle);
> + unsigned int cpu;
> + cpumask_var_t mask;
>
> smp_call_function_any(&chip->mask,
> powernv_cpufreq_throttle_check, NULL, 0);
> +
> + if (!chip->restore)
> + return;
> +
> + chip->restore = false;
> + cpumask_copy(mask, &chip->mask);
> + for_each_cpu_and(cpu, mask, cpu_online_mask) {
> + int index, tcpu;
> + struct cpufreq_policy policy;
> +
> + cpufreq_get_policy(&policy, cpu);
> + cpufreq_frequency_table_target(&policy, policy.freq_table,
> + policy.cur,
> + CPUFREQ_RELATION_C, &index);
> + powernv_cpufreq_target_index(&policy, index);
> + for_each_cpu(tcpu, policy.cpus)
> + cpumask_clear_cpu(tcpu, mask);
> + }
> }
>
> static char throttle_reason[][30] = {
> @@ -473,8 +494,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
> throttled = false;
> pr_info("OCC: Active\n");
>
> - for (i = 0; i < nr_chips; i++)
> + for (i = 0; i < nr_chips; i++) {
> + chips[i].restore = true;
> schedule_work(&chips[i].throttle);
> + }
>
> return 0;
> }
> @@ -490,8 +513,11 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
> return 0;
>
> for (i = 0; i < nr_chips; i++)
> - if (chips[i].id == chip_id)
> + if (chips[i].id == chip_id) {
> + if (!reason)
> + chips[i].restore = true;
> schedule_work(&chips[i].throttle);
> + }
> }
> return 0;
> }
> @@ -545,6 +571,7 @@ static int init_chip_info(void)
> chips[i].throttled = false;
> cpumask_copy(&chips[i].mask, cpumask_of_node(chip[i]));
> INIT_WORK(&chips[i].throttle, powernv_cpufreq_work_fn);
> + chips[i].restore = false;
> }
>
> return 0;
>

Reviewed-by: Preeti U Murthy <[email protected]>

2015-05-07 10:35:54

by Shilpasri G Bhat

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level



On 05/05/2015 02:08 PM, Preeti U Murthy wrote:
> On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote:
>> Hi Preeti,
>>
>> On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
>>> Hi Shilpa,
>>>
>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>>> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
>>>> max allowed frequency for that chip if the chip exceeds its power or
>>>> temperature limits. As Pmax capping is a chip level condition report
>>>> this throttling behavior at chip level and also do not set the global
>>>> 'throttled' on Pmax capping instead set the per-chip throttled
>>>> variable. Report unthrottling if Pmax is restored after throttling.
>>>>
>>>> This patch adds a structure to store chip id and throttled state of
>>>> the chip.
>>>>
>>>> Signed-off-by: Shilpasri G Bhat <[email protected]>
>>>> ---
>>>> drivers/cpufreq/powernv-cpufreq.c | 59 ++++++++++++++++++++++++++++++++++++---
>>>> 1 file changed, 55 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
>>>> index ebef0d8..d0c18c9 100644
>>>> --- a/drivers/cpufreq/powernv-cpufreq.c
>>>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>>>> @@ -27,6 +27,7 @@
>>>> #include <linux/smp.h>
>>>> #include <linux/of.h>
>>>> #include <linux/reboot.h>
>>>> +#include <linux/slab.h>
>>>>
>>>> #include <asm/cputhreads.h>
>>>> #include <asm/firmware.h>
>>>> @@ -42,6 +43,13 @@
>>>> static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>>>> static bool rebooting, throttled;
>>>>
>>>> +static struct chip {
>>>> + unsigned int id;
>>>> + bool throttled;
>>>> +} *chips;
>>>> +
>>>> +static int nr_chips;
>>>> +
>>>> /*
>>>> * Note: The set of pstates consists of contiguous integers, the
>>>> * smallest of which is indicated by powernv_pstate_info.min, the
>>>> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
>>>> static void powernv_cpufreq_throttle_check(unsigned int cpu)
>>>> {
>>>> unsigned long pmsr;
>>>> - int pmsr_pmax, pmsr_lp;
>>>> + int pmsr_pmax, pmsr_lp, i;
>>>>
>>>> pmsr = get_pmspr(SPRN_PMSR);
>>>>
>>>> + for (i = 0; i < nr_chips; i++)
>>>> + if (chips[i].id == cpu_to_chip_id(cpu))
>>>> + break;
>>>> +
>>>> /* Check for Pmax Capping */
>>>> pmsr_pmax = (s8)PMSR_MAX(pmsr);
>>>> if (pmsr_pmax != powernv_pstate_info.max) {
>>>> - throttled = true;
>>>> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
>>>> - pr_info("Max allowed Pstate is capped\n");
>>>> + if (chips[i].throttled)
>>>> + goto next;
>>>> + chips[i].throttled = true;
>>>> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
>>>> + chips[i].id, pmsr_pmax);
>>>> + } else if (chips[i].throttled) {
>>>> + chips[i].throttled = false;
>>>
>>> Is this check on pmax sufficient to indicate that the chip is unthrottled ?
>>
>> Unthrottling due to Pmax uncapping here is specific to a chip. So it is
>> sufficient to decide throttling/unthrottling when OCC is active for that chip.
>
> Ok then we can perhaps exit after detecting unthrottling here.

This won't work for older firmwares which do not clear "Frequency control
enabled bit" on OCC reset cycle. So let us check for remaining two conditions on
unthrottling as well.

>>
>>>
>>>> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
>>>> + chips[i].id, pmsr_pmax);
>>>> }
>>>>
>>>> /*
>>>> * Check for Psafe by reading LocalPstate
>>>> * or check if Psafe_mode_active is set in PMSR.
>>>> */
>>>> +next:
>>>> pmsr_lp = (s8)PMSR_LP(pmsr);
>>>> if ((pmsr_lp < powernv_pstate_info.min) ||
>>>> (pmsr & PMSR_PSAFE_ENABLE)) {
>>>> @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>>>> .attr = powernv_cpu_freq_attr,
>>>
>>> What about the situation where although occ is active, this particular
>>> chip has been throttled and we end up repeatedly reporting "pstate set
>>> to safe" and "frequency control disabled from OS" ? Should we not have a
>>> check on (chips[i].throttled) before reporting an anomaly for these two
>>> scenarios as well just like you have for pmsr_pmax ?
>>
>> We will not have "Psafe" and "frequency control disabled" repeatedly printed
>> because of global variable 'throttled', which is set to true on passing any of
>> these two conditions.
>>
>> It is quite unlikely behavior to have only one chip in "Psafe" or "frequency
>> control disabled" state. These two conditions are most likely to happen during
>> an OCC reset cycle which will occur across all chips.
>
> Let us then add a comment to indicate that Psafe and frequency control
> disabled conditions will fail *only if OCC is inactive* and not
> otherwise and that this is a system wide phenomenon.
>

I agree that adding a comment here will clear global vs local throttling
scenarios, but this will contradict the architectural design of OCC wherein it
can independently go to "Psafe" and "frequency control disabled" state. It is
the implementation in FSP today that has made the above two states global. My
point is adding a comment here may be confusing if at all for the future
firmwares this implementation is changed. Having said that the current patch set
still seems fit for the newer implementation for the following reason:
1) The aim here is to identify any sort of throttling and report it to the user
with least flooding of error messages, which will happen even if OCC can
independently reset and restore.
2) On unthrottling verify throttling on the chips with the exception of Pmax
capping is also taken care by this patch set.

Thanks and Regards,
Shilpa

2015-05-07 12:15:30

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] cpufreq: poowernv: Handle throttling due to Pmax capping at chip level

On 05/07/2015 04:05 PM, Shilpasri G Bhat wrote:
>
>
> On 05/05/2015 02:08 PM, Preeti U Murthy wrote:
>> On 05/05/2015 11:36 AM, Shilpasri G Bhat wrote:
>>> Hi Preeti,
>>>
>>> On 05/05/2015 09:21 AM, Preeti U Murthy wrote:
>>>> Hi Shilpa,
>>>>
>>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>>>> The On-Chip-Controller(OCC) can throttle cpu frequency by reducing the
>>>>> max allowed frequency for that chip if the chip exceeds its power or
>>>>> temperature limits. As Pmax capping is a chip level condition report
>>>>> this throttling behavior at chip level and also do not set the global
>>>>> 'throttled' on Pmax capping instead set the per-chip throttled
>>>>> variable. Report unthrottling if Pmax is restored after throttling.
>>>>>
>>>>> This patch adds a structure to store chip id and throttled state of
>>>>> the chip.
>>>>>
>>>>> Signed-off-by: Shilpasri G Bhat <[email protected]>
>>>>> ---
>>>>> drivers/cpufreq/powernv-cpufreq.c | 59 ++++++++++++++++++++++++++++++++++++---
>>>>> 1 file changed, 55 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
>>>>> index ebef0d8..d0c18c9 100644
>>>>> --- a/drivers/cpufreq/powernv-cpufreq.c
>>>>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>>>>> @@ -27,6 +27,7 @@
>>>>> #include <linux/smp.h>
>>>>> #include <linux/of.h>
>>>>> #include <linux/reboot.h>
>>>>> +#include <linux/slab.h>
>>>>>
>>>>> #include <asm/cputhreads.h>
>>>>> #include <asm/firmware.h>
>>>>> @@ -42,6 +43,13 @@
>>>>> static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>>>>> static bool rebooting, throttled;
>>>>>
>>>>> +static struct chip {
>>>>> + unsigned int id;
>>>>> + bool throttled;
>>>>> +} *chips;
>>>>> +
>>>>> +static int nr_chips;
>>>>> +
>>>>> /*
>>>>> * Note: The set of pstates consists of contiguous integers, the
>>>>> * smallest of which is indicated by powernv_pstate_info.min, the
>>>>> @@ -301,22 +309,33 @@ static inline unsigned int get_nominal_index(void)
>>>>> static void powernv_cpufreq_throttle_check(unsigned int cpu)
>>>>> {
>>>>> unsigned long pmsr;
>>>>> - int pmsr_pmax, pmsr_lp;
>>>>> + int pmsr_pmax, pmsr_lp, i;
>>>>>
>>>>> pmsr = get_pmspr(SPRN_PMSR);
>>>>>
>>>>> + for (i = 0; i < nr_chips; i++)
>>>>> + if (chips[i].id == cpu_to_chip_id(cpu))
>>>>> + break;
>>>>> +
>>>>> /* Check for Pmax Capping */
>>>>> pmsr_pmax = (s8)PMSR_MAX(pmsr);
>>>>> if (pmsr_pmax != powernv_pstate_info.max) {
>>>>> - throttled = true;
>>>>> - pr_info("CPU %d Pmax is reduced to %d\n", cpu, pmsr_pmax);
>>>>> - pr_info("Max allowed Pstate is capped\n");
>>>>> + if (chips[i].throttled)
>>>>> + goto next;
>>>>> + chips[i].throttled = true;
>>>>> + pr_info("CPU %d on Chip %u has Pmax reduced to %d\n", cpu,
>>>>> + chips[i].id, pmsr_pmax);
>>>>> + } else if (chips[i].throttled) {
>>>>> + chips[i].throttled = false;
>>>>
>>>> Is this check on pmax sufficient to indicate that the chip is unthrottled ?
>>>
>>> Unthrottling due to Pmax uncapping here is specific to a chip. So it is
>>> sufficient to decide throttling/unthrottling when OCC is active for that chip.
>>
>> Ok then we can perhaps exit after detecting unthrottling here.
>
> This won't work for older firmwares which do not clear "Frequency control
> enabled bit" on OCC reset cycle. So let us check for remaining two conditions on
> unthrottling as well.

ok.

>
>>>
>>>>
>>>>> + pr_info("CPU %d on Chip %u has Pmax restored to %d\n", cpu,
>>>>> + chips[i].id, pmsr_pmax);
>>>>> }
>>>>>
>>>>> /*
>>>>> * Check for Psafe by reading LocalPstate
>>>>> * or check if Psafe_mode_active is set in PMSR.
>>>>> */
>>>>> +next:
>>>>> pmsr_lp = (s8)PMSR_LP(pmsr);
>>>>> if ((pmsr_lp < powernv_pstate_info.min) ||
>>>>> (pmsr & PMSR_PSAFE_ENABLE)) {
>>>>> @@ -414,6 +433,33 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>>>>> .attr = powernv_cpu_freq_attr,
>>>>
>>>> What about the situation where although occ is active, this particular
>>>> chip has been throttled and we end up repeatedly reporting "pstate set
>>>> to safe" and "frequency control disabled from OS" ? Should we not have a
>>>> check on (chips[i].throttled) before reporting an anomaly for these two
>>>> scenarios as well just like you have for pmsr_pmax ?
>>>
>>> We will not have "Psafe" and "frequency control disabled" repeatedly printed
>>> because of global variable 'throttled', which is set to true on passing any of
>>> these two conditions.
>>>
>>> It is quite unlikely behavior to have only one chip in "Psafe" or "frequency
>>> control disabled" state. These two conditions are most likely to happen during
>>> an OCC reset cycle which will occur across all chips.
>>
>> Let us then add a comment to indicate that Psafe and frequency control
>> disabled conditions will fail *only if OCC is inactive* and not
>> otherwise and that this is a system wide phenomenon.
>>
>
> I agree that adding a comment here will clear global vs local throttling
> scenarios, but this will contradict the architectural design of OCC wherein it
> can independently go to "Psafe" and "frequency control disabled" state. It is
> the implementation in FSP today that has made the above two states global. My
> point is adding a comment here may be confusing if at all for the future
> firmwares this implementation is changed. Having said that the current patch set
> still seems fit for the newer implementation for the following reason:
> 1) The aim here is to identify any sort of throttling and report it to the user
> with least flooding of error messages, which will happen even if OCC can
> independently reset and restore.
> 2) On unthrottling verify throttling on the chips with the exception of Pmax
> capping is also taken care by this patch set.

Ok as long as we are reporting right, its fine.

Reviewed-by: Preeti U Murthy <[email protected]>
>
> Thanks and Regards,
> Shilpa
>
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>

2015-05-07 12:19:35

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
> On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
>> Hi Preeti,
>>
>> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
>>> Hi Shilpa,
>>>
>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
>>>> notification by executing *throttle_check() on any one of the cpu on
>>>> the chip. This is a sanity check to verify if we were indeed
>>>> throttled/unthrottled after receiving OCC_THROTTLE notification.
>>>>
>>>> We cannot call *throttle_check() directly from the notification
>>>> handler because we could be handling chip1's notification in chip2. So
>>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
>>>> in the notification handler, so use a worker thread to smp_call
>>>> throttle_check() on any of the cpu in the chipmask.
>>>
>>> I see that the first patch takes care of reporting *per-chip* throttling
>>> for pmax capping condition. But where are we taking care of reporting
>>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
>>>
>>
>> IMO let us not have "psafe" and "freq control disabled" states managed per-chip.
>> Because when the above two conditions occur it is likely to happen across all
>> chips during an OCC reset cycle. So I am setting 'throttled' to false on
>> OCC_ACTIVE and re-verifying if it actually is the case by invoking
>> *throttle_check().
>
> Alright like I pointed in the previous reply, a comment to indicate that
> psafe and freq control disabled conditions will fail when occ is
> inactive and that all chips face the consequence of this will help.

>From your explanation on the thread of the first patch of this series,
this will not be required.

So,
Reviewed-by: Preeti U Murthy <[email protected]>

Regards
Preeti U Murthy
>
>>
>>>>
>>>> Signed-off-by: Shilpasri G Bhat <[email protected]>
>>>> ---
>>>> drivers/cpufreq/powernv-cpufreq.c | 28 ++++++++++++++++++++++++++--
>>>> 1 file changed, 26 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
>>>> index 9268424..9618813 100644
>>>> --- a/drivers/cpufreq/powernv-cpufreq.c
>>>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>>>> @@ -50,6 +50,8 @@ static bool rebooting, throttled, occ_reset;
>>>> static struct chip {
>>>> unsigned int id;
>>>> bool throttled;
>>>> + cpumask_t mask;
>>>> + struct work_struct throttle;
>>>> } *chips;
>>>>
>>>> static int nr_chips;
>>>> @@ -310,8 +312,9 @@ static inline unsigned int get_nominal_index(void)
>>>> return powernv_pstate_info.max - powernv_pstate_info.nominal;
>>>> }
>>>>
>>>> -static void powernv_cpufreq_throttle_check(unsigned int cpu)
>>>> +static void powernv_cpufreq_throttle_check(void *data)
>>>> {
>>>> + unsigned int cpu = smp_processor_id();
>>>> unsigned long pmsr;
>>>> int pmsr_pmax, pmsr_lp, i;
>>>>
>>>> @@ -373,7 +376,7 @@ static int powernv_cpufreq_target_index(struct cpufreq_policy *policy,
>>>> return 0;
>>>>
>>>> if (!throttled)
>>>> - powernv_cpufreq_throttle_check(smp_processor_id());
>>>> + powernv_cpufreq_throttle_check(NULL);
>>>>
>>>> freq_data.pstate_id = powernv_freqs[new_index].driver_data;
>>>>
>>>> @@ -418,6 +421,14 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
>>>> .notifier_call = powernv_cpufreq_reboot_notifier,
>>>> };
>>>>
>>>> +void powernv_cpufreq_work_fn(struct work_struct *work)
>>>> +{
>>>> + struct chip *chip = container_of(work, struct chip, throttle);
>>>> +
>>>> + smp_call_function_any(&chip->mask,
>>>> + powernv_cpufreq_throttle_check, NULL, 0);
>>>> +}
>>>> +
>>>> static char throttle_reason[][30] = {
>>>> "No throttling",
>>>> "Power Cap",
>>>> @@ -433,6 +444,7 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>>>> struct opal_msg *occ_msg = msg;
>>>> uint64_t token;
>>>> uint64_t chip_id, reason;
>>>> + int i;
>>>>
>>>> if (msg_type != OPAL_MSG_OCC)
>>>> return 0;
>>>> @@ -466,6 +478,10 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>>>> occ_reset = false;
>>>> throttled = false;
>>>> pr_info("OCC: Active\n");
>>>> +
>>>> + for (i = 0; i < nr_chips; i++)
>>>> + schedule_work(&chips[i].throttle);
>>>> +
>>>> return 0;
>>>> }
>>>>
>>>> @@ -476,6 +492,12 @@ static int powernv_cpufreq_occ_msg(struct notifier_block *nb,
>>>> else if (!reason)
>>>> pr_info("OCC: Chip %u %s\n", (unsigned int)chip_id,
>>>> throttle_reason[reason]);
>>>> + else
>>>> + return 0;
>>>
>>> Why the else section ? The code can never reach here, can it ?
>>
>> When reason > 5 , we dont want to handle it.
>
> Of course! My bad!
>>
>>>
>>>> +
>>>> + for (i = 0; i < nr_chips; i++)
>>>> + if (chips[i].id == chip_id)
>>>> + schedule_work(&chips[i].throttle);
>>>> }
>>>
>>> Should we not do this only when we get unthrottled so as to cross verify
>>> if it is indeed the case ? In case of throttling notification, opal's
>>> verdict is final and there is no need to cross verify right ?
>>
>> Two reasons for invoking *throttle_check() on throttling:
>> 1) We just got to know the reason and not the Pmax value we are getting
>> throttled to.
>> 2) It could be a spurious message caused due to late/lost delivery. My point
>> here is let us not completely rely on the notification to declare throttling
>> unless we verify it from reading PMSR.
>
> Sounds good.
>
> Regards
> Preeti U Murthy
>
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>

2015-05-07 20:34:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote:
> On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
> > On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
> >> Hi Preeti,
> >>
> >> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
> >>> Hi Shilpa,
> >>>
> >>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> >>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
> >>>> notification by executing *throttle_check() on any one of the cpu on
> >>>> the chip. This is a sanity check to verify if we were indeed
> >>>> throttled/unthrottled after receiving OCC_THROTTLE notification.
> >>>>
> >>>> We cannot call *throttle_check() directly from the notification
> >>>> handler because we could be handling chip1's notification in chip2. So
> >>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
> >>>> in the notification handler, so use a worker thread to smp_call
> >>>> throttle_check() on any of the cpu in the chipmask.
> >>>
> >>> I see that the first patch takes care of reporting *per-chip* throttling
> >>> for pmax capping condition. But where are we taking care of reporting
> >>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
> >>>
> >>
> >> IMO let us not have "psafe" and "freq control disabled" states managed per-chip.
> >> Because when the above two conditions occur it is likely to happen across all
> >> chips during an OCC reset cycle. So I am setting 'throttled' to false on
> >> OCC_ACTIVE and re-verifying if it actually is the case by invoking
> >> *throttle_check().
> >
> > Alright like I pointed in the previous reply, a comment to indicate that
> > psafe and freq control disabled conditions will fail when occ is
> > inactive and that all chips face the consequence of this will help.
>
> From your explanation on the thread of the first patch of this series,
> this will not be required.
>
> So,
> Reviewed-by: Preeti U Murthy <[email protected]>

OK, so is the whole series reviewed now?


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2015-05-08 03:46:55

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

On 05/08/2015 02:29 AM, Rafael J. Wysocki wrote:
> On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote:
>> On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
>>> On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
>>>> Hi Preeti,
>>>>
>>>> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
>>>>> Hi Shilpa,
>>>>>
>>>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
>>>>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
>>>>>> notification by executing *throttle_check() on any one of the cpu on
>>>>>> the chip. This is a sanity check to verify if we were indeed
>>>>>> throttled/unthrottled after receiving OCC_THROTTLE notification.
>>>>>>
>>>>>> We cannot call *throttle_check() directly from the notification
>>>>>> handler because we could be handling chip1's notification in chip2. So
>>>>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
>>>>>> in the notification handler, so use a worker thread to smp_call
>>>>>> throttle_check() on any of the cpu in the chipmask.
>>>>>
>>>>> I see that the first patch takes care of reporting *per-chip* throttling
>>>>> for pmax capping condition. But where are we taking care of reporting
>>>>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
>>>>>
>>>>
>>>> IMO let us not have "psafe" and "freq control disabled" states managed per-chip.
>>>> Because when the above two conditions occur it is likely to happen across all
>>>> chips during an OCC reset cycle. So I am setting 'throttled' to false on
>>>> OCC_ACTIVE and re-verifying if it actually is the case by invoking
>>>> *throttle_check().
>>>
>>> Alright like I pointed in the previous reply, a comment to indicate that
>>> psafe and freq control disabled conditions will fail when occ is
>>> inactive and that all chips face the consequence of this will help.
>>
>> From your explanation on the thread of the first patch of this series,
>> this will not be required.
>>
>> So,
>> Reviewed-by: Preeti U Murthy <[email protected]>
>
> OK, so is the whole series reviewed now?

Yes the whole series has been reviewed.

Regards
Preeti U Murthy


>
>

2015-05-08 05:12:37

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] powernv: cpufreq: Report frequency throttle by OCC

On 4 May 2015 at 14:24, Shilpasri G Bhat <[email protected]> wrote:
> This patchset intends to add frequency throttle reporting mechanism
> to powernv-cpufreq driver when OCC throttles the frequency. OCC is an
> On-Chip-Controller which takes care of the power and thermal safety of
> the chip. The CPU frequency can be throttled during an OCC reset or
> when OCC tries to limit the max allowed frequency. The patchset will
> report such conditions so as to keep the user informed about reason
> for the drop in performance of workloads when frequency is throttled.
>
> Changes from v2:
> - Split into multiple patches
> - Semantic fixes
>
> Shilpasri G Bhat (6):
> cpufreq: poowernv: Handle throttling due to Pmax capping at chip level
> powerpc/powernv: Add definition of OPAL_MSG_OCC message type
> cpufreq: powernv: Register for OCC related opal_message notification
> cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE
> cpufreq: powernv: Report Psafe only if PMSR.psafe_mode_active bit is
> set
> cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling
>
> arch/powerpc/include/asm/opal-api.h | 8 ++
> drivers/cpufreq/powernv-cpufreq.c | 199 +++++++++++++++++++++++++++++++++---
> 2 files changed, 192 insertions(+), 15 deletions(-)

Acked-by: Viresh Kumar <[email protected]>

2015-05-08 13:46:54

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cpufreq: powernv: Call throttle_check() on receiving OCC_THROTTLE

On Friday, May 08, 2015 09:16:44 AM Preeti U Murthy wrote:
> On 05/08/2015 02:29 AM, Rafael J. Wysocki wrote:
> > On Thursday, May 07, 2015 05:49:22 PM Preeti U Murthy wrote:
> >> On 05/05/2015 02:11 PM, Preeti U Murthy wrote:
> >>> On 05/05/2015 12:03 PM, Shilpasri G Bhat wrote:
> >>>> Hi Preeti,
> >>>>
> >>>> On 05/05/2015 09:30 AM, Preeti U Murthy wrote:
> >>>>> Hi Shilpa,
> >>>>>
> >>>>> On 05/04/2015 02:24 PM, Shilpasri G Bhat wrote:
> >>>>>> Re-evaluate the chip's throttled state on recieving OCC_THROTTLE
> >>>>>> notification by executing *throttle_check() on any one of the cpu on
> >>>>>> the chip. This is a sanity check to verify if we were indeed
> >>>>>> throttled/unthrottled after receiving OCC_THROTTLE notification.
> >>>>>>
> >>>>>> We cannot call *throttle_check() directly from the notification
> >>>>>> handler because we could be handling chip1's notification in chip2. So
> >>>>>> initiate an smp_call to execute *throttle_check(). We are irq-disabled
> >>>>>> in the notification handler, so use a worker thread to smp_call
> >>>>>> throttle_check() on any of the cpu in the chipmask.
> >>>>>
> >>>>> I see that the first patch takes care of reporting *per-chip* throttling
> >>>>> for pmax capping condition. But where are we taking care of reporting
> >>>>> "pstate set to safe" and "freq control disabled" scenarios per-chip ?
> >>>>>
> >>>>
> >>>> IMO let us not have "psafe" and "freq control disabled" states managed per-chip.
> >>>> Because when the above two conditions occur it is likely to happen across all
> >>>> chips during an OCC reset cycle. So I am setting 'throttled' to false on
> >>>> OCC_ACTIVE and re-verifying if it actually is the case by invoking
> >>>> *throttle_check().
> >>>
> >>> Alright like I pointed in the previous reply, a comment to indicate that
> >>> psafe and freq control disabled conditions will fail when occ is
> >>> inactive and that all chips face the consequence of this will help.
> >>
> >> From your explanation on the thread of the first patch of this series,
> >> this will not be required.
> >>
> >> So,
> >> Reviewed-by: Preeti U Murthy <[email protected]>
> >
> > OK, so is the whole series reviewed now?
>
> Yes the whole series has been reviewed.

OK, I'll queue it up for 4.2, then, thanks!


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.