2015-08-27 08:23:19

by Huang Rui

[permalink] [raw]
Subject: [PATCH 00/15] hwmon, fam15h_power: introduce an accumulated power reporting algorithm

Hi all,

This serial of patches introduces an accumulated power reporting
algorithm. It will calculate the average power consumption for the
processor. The cpu feature flag is CPUID.8000_0007H:EDX[12].

This algorithm is used to test the comparison of processor power
consumption with between MWAITX delay and TSC delay on AMD Carrizo
platforms.

Reference:
http://marc.info/?l=linux-kernel&m=143874573111310&w=2

Commit f96756 at tip ("x86/asm: Add MONITORX/MWAITX instruction support")
Commit b466bd at tip ("x86/asm/delay: Introduce an MWAITX-based delay with a configurable timer")

Thanks,
Rui

Huang Rui (15):
hwmon, fam15h_power: add support for AMD Carrizo
hwmon, fam15h_power: rename fam15h_power_is_internal_node0 function
hwmon, fam15h_power: refactor attributes for dynamically added
hwmon, fam15h_power: update running_avg_capture bit field to 28
hwmon, fam15h_power: enable power1_input on AMD Carrizo
hwmon, fam15h_power: add documentation for new processors support
hwmon, fam15h_power: add ratio of Tsample to the PTSC period
hwmon, fam15h_power: add max compute unit accumulated power
x86, amd: add accessor for number of cores per compute unit
hwmon, fam15h_power: add compute unit accumulated power
hwmon, fam15h_power: add ptsc counter value for accumulated power
hwmon, fam15h_power: introduce a cpu accumulated power reporting
algorithm
hwmon, fam15h_power: add documentation for previous TDP reporting
hwmon, fam15h_power: add documentation for accumulated power algorithm
MAINTAINERS: change the maintainer of fam15h_power driver

Documentation/hwmon/fam15h_power | 63 +++++++++++-
MAINTAINERS | 4 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/processor.h | 1 +
arch/x86/kernel/cpu/amd.c | 19 +++-
drivers/hwmon/fam15h_power.c | 204 ++++++++++++++++++++++++++++++++++-----
6 files changed, 258 insertions(+), 34 deletions(-)

--
1.9.1


2015-08-27 08:24:22

by Huang Rui

[permalink] [raw]
Subject: [PATCH 01/15] hwmon, fam15h_power: add support for AMD Carrizo

AMD Carrizo(Fam15h, M60h) processors can report power1_crit
(ProcessorPwrWatts) and power1_input (CurrPwrWatts) values.
And this patch adds support for CZ.

Signed-off-by: Huang Rui <[email protected]>
---
drivers/hwmon/fam15h_power.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index 3057dfc..0753810 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -233,6 +233,7 @@ static int fam15h_power_probe(struct pci_dev *pdev,
static const struct pci_device_id fam15h_power_id_table[] = {
{ PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_15H_NB_F4) },
{ PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_15H_M30H_NB_F4) },
+ { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_15H_M60H_NB_F4) },
{ PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_16H_NB_F4) },
{ PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_16H_M30H_NB_F4) },
{}
--
1.9.1

2015-08-27 08:09:32

by Huang Rui

[permalink] [raw]
Subject: [PATCH 02/15] hwmon, fam15h_power: rename fam15h_power_is_internal_node0 function

We rename fam15h_power_is_internal_node0() function to
should_load_on_this_node(), because it may not be node0 from KV and
on, and they are single-node processors.

Signed-off-by: Huang Rui <[email protected]>
---
drivers/hwmon/fam15h_power.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index 0753810..820adf1 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -117,7 +117,7 @@ static const struct attribute_group fam15h_power_group = {
};
__ATTRIBUTE_GROUPS(fam15h_power);

-static bool fam15h_power_is_internal_node0(struct pci_dev *f4)
+static bool should_load_on_this_node(struct pci_dev *f4)
{
u32 val;

@@ -214,7 +214,7 @@ static int fam15h_power_probe(struct pci_dev *pdev,
*/
tweak_runavg_range(pdev);

- if (!fam15h_power_is_internal_node0(pdev))
+ if (!should_load_on_this_node(pdev))
return -ENODEV;

data = devm_kzalloc(dev, sizeof(struct fam15h_power_data), GFP_KERNEL);
--
1.9.1

2015-08-27 08:24:17

by Huang Rui

[permalink] [raw]
Subject: [PATCH 03/15] hwmon, fam15h_power: refactor attributes for dynamically added

Attributes depend on the CPU model the driver gets loaded on.
Therefore, add those attributes dynamically at init time. This is more
flexible to control the different attributes on different platforms.

Signed-off-by: Huang Rui <[email protected]>
---
drivers/hwmon/fam15h_power.c | 49 +++++++++++++++++++++++++++-----------------
1 file changed, 30 insertions(+), 19 deletions(-)

diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index 820adf1..65ffb06 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -41,6 +41,8 @@ MODULE_LICENSE("GPL");
#define REG_TDP_RUNNING_AVERAGE 0xe0
#define REG_TDP_LIMIT3 0xe8

+#define FAM15H_MIN_POWER_GROUPS 2
+
struct fam15h_power_data {
struct pci_dev *pdev;
unsigned int tdp_to_watts;
@@ -93,29 +95,35 @@ static ssize_t show_power_crit(struct device *dev,
}
static DEVICE_ATTR(power1_crit, S_IRUGO, show_power_crit, NULL);

-static umode_t fam15h_power_is_visible(struct kobject *kobj,
- struct attribute *attr,
- int index)
+static struct attribute_group fam15h_power_group;
+__ATTRIBUTE_GROUPS(fam15h_power);
+
+static int fam15h_power_init_attrs(struct pci_dev *pdev)
{
- /* power1_input is only reported for Fam15h, Models 00h-0fh */
- if (attr == &dev_attr_power1_input.attr &&
- (boot_cpu_data.x86 != 0x15 || boot_cpu_data.x86_model > 0xf))
- return 0;
+ int n = FAM15H_MIN_POWER_GROUPS;
+ struct attribute **fam15h_power_attrs;

- return attr->mode;
-}
+ if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model <= 0xf)
+ n += 1;

-static struct attribute *fam15h_power_attrs[] = {
- &dev_attr_power1_input.attr,
- &dev_attr_power1_crit.attr,
- NULL
-};
+ fam15h_power_attrs = devm_kcalloc(&pdev->dev, n,
+ sizeof(*fam15h_power_attrs),
+ GFP_KERNEL);

-static const struct attribute_group fam15h_power_group = {
- .attrs = fam15h_power_attrs,
- .is_visible = fam15h_power_is_visible,
-};
-__ATTRIBUTE_GROUPS(fam15h_power);
+ if (!fam15h_power_attrs) {
+ dev_err(&pdev->dev, "failed to alloc fam15h_power_attrs\n");
+ return -ENOMEM;
+ }
+
+ n = 0;
+ fam15h_power_attrs[n++] = &dev_attr_power1_crit.attr;
+ if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model <= 0xf)
+ fam15h_power_attrs[n++] = &dev_attr_power1_input.attr;
+
+ fam15h_power_group.attrs = fam15h_power_attrs;
+
+ return 0;
+}

static bool should_load_on_this_node(struct pci_dev *f4)
{
@@ -221,6 +229,9 @@ static int fam15h_power_probe(struct pci_dev *pdev,
if (!data)
return -ENOMEM;

+ if (fam15h_power_init_attrs(pdev))
+ return -ENOMEM;
+
fam15h_power_init_data(pdev, data);
data->pdev = pdev;

--
1.9.1

2015-08-27 08:24:53

by Huang Rui

[permalink] [raw]
Subject: [PATCH 04/15] hwmon, fam15h_power: update running_avg_capture bit field to 28

On Carrizo and later platforms, running_avg_capture bit field is
extended to 4:31 (28 bits) from 4:25.

Signed-off-by: Huang Rui <[email protected]>
---
drivers/hwmon/fam15h_power.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index 65ffb06..3e9b3b9 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -61,8 +61,19 @@ static ssize_t show_power(struct device *dev,

pci_bus_read_config_dword(f4->bus, PCI_DEVFN(PCI_SLOT(f4->devfn), 5),
REG_TDP_RUNNING_AVERAGE, &val);
- running_avg_capture = (val >> 4) & 0x3fffff;
- running_avg_capture = sign_extend32(running_avg_capture, 21);
+
+ /*
+ * On Carrizo and later platforms, TdpRunAvgAccCap bit field
+ * is extended to 4:31 from 4:25.
+ */
+ if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model >= 0x60) {
+ running_avg_capture = val >> 4;
+ running_avg_capture = sign_extend32(running_avg_capture, 27);
+ } else {
+ running_avg_capture = (val >> 4) & 0x3fffff;
+ running_avg_capture = sign_extend32(running_avg_capture, 21);
+ }
+
running_avg_range = (val & 0xf) + 1;

pci_bus_read_config_dword(f4->bus, PCI_DEVFN(PCI_SLOT(f4->devfn), 5),
--
1.9.1

2015-08-27 08:24:56

by Huang Rui

[permalink] [raw]
Subject: [PATCH 05/15] hwmon, fam15h_power: enable power1_input on AMD Carrizo

This patch enables power1_input attribute for Carrizo platform.

Signed-off-by: Huang Rui <[email protected]>
---
drivers/hwmon/fam15h_power.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index 3e9b3b9..55b25ef 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -113,8 +113,11 @@ static int fam15h_power_init_attrs(struct pci_dev *pdev)
{
int n = FAM15H_MIN_POWER_GROUPS;
struct attribute **fam15h_power_attrs;
+ struct cpuinfo_x86 *c = &boot_cpu_data;

- if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model <= 0xf)
+ if (c->x86 == 0x15 &&
+ ((c->x86_model <= 0xf) ||
+ (c->x86_model >= 0x60 && c->x86_model <= 0x6f)))
n += 1;

fam15h_power_attrs = devm_kcalloc(&pdev->dev, n,
@@ -128,7 +131,9 @@ static int fam15h_power_init_attrs(struct pci_dev *pdev)

n = 0;
fam15h_power_attrs[n++] = &dev_attr_power1_crit.attr;
- if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model <= 0xf)
+ if (c->x86 == 0x15 &&
+ ((c->x86_model <= 0xf) ||
+ (c->x86_model >= 0x60 && c->x86_model <= 0x6f)))
fam15h_power_attrs[n++] = &dev_attr_power1_input.attr;

fam15h_power_group.attrs = fam15h_power_attrs;
--
1.9.1

2015-08-27 08:25:58

by Huang Rui

[permalink] [raw]
Subject: [PATCH 06/15] hwmon, fam15h_power: add documentation for new processors support

This patch updates description of fam15h_power driver, its scope is
extended to family 16h processsors.

Signed-off-by: Huang Rui <[email protected]>
---
Documentation/hwmon/fam15h_power | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/Documentation/hwmon/fam15h_power b/Documentation/hwmon/fam15h_power
index 8065481..e2b1b69 100644
--- a/Documentation/hwmon/fam15h_power
+++ b/Documentation/hwmon/fam15h_power
@@ -3,12 +3,13 @@ Kernel driver fam15h_power

Supported chips:
* AMD Family 15h Processors
+* AMD Family 16h Processors

Prefix: 'fam15h_power'
Addresses scanned: PCI space
Datasheets:
BIOS and Kernel Developer's Guide (BKDG) For AMD Family 15h Processors
- (not yet published)
+ BIOS and Kernel Developer's Guide (BKDG) For AMD Family 16h Processors

Author: Andreas Herrmann <[email protected]>

@@ -16,10 +17,11 @@ Description
-----------

This driver permits reading of registers providing power information
-of AMD Family 15h processors.
+of AMD Family 15h and 16h processors.

-For AMD Family 15h processors the following power values can be
-calculated using different processor northbridge function registers:
+For AMD Family 15h and 16h processors the following power values can
+be calculated using different processor northbridge function
+registers:

* BasePwrWatts: Specifies in watts the maximum amount of power
consumed by the processor for NB and logic external to the core.
--
1.9.1

2015-08-27 08:25:11

by Huang Rui

[permalink] [raw]
Subject: [PATCH 07/15] hwmon, fam15h_power: add ratio of Tsample to the PTSC period

This patch adds a member (cpu_pwr_sample_ratio) of fam15h_power_data,
that represents the ratio of compute unit power accumulator sample
period to the PTSC counter period.

Tsample: compute unit power accumulator sample period
Tref: the performance timestamp counter period
PTSC: performance timestamp counter

Signed-off-by: Huang Rui <[email protected]>
---
drivers/hwmon/fam15h_power.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index 55b25ef..d6efcf6 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -48,6 +48,7 @@ struct fam15h_power_data {
unsigned int tdp_to_watts;
unsigned int base_tdp;
unsigned int processor_pwr_watts;
+ unsigned int cpu_pwr_sample_ratio;
};

static ssize_t show_power(struct device *dev,
@@ -201,7 +202,7 @@ static int fam15h_power_resume(struct pci_dev *pdev)
static void fam15h_power_init_data(struct pci_dev *f4,
struct fam15h_power_data *data)
{
- u32 val;
+ u32 val, eax, ebx, ecx, edx;
u64 tmp;

pci_read_config_dword(f4, REG_PROCESSOR_TDP, &val);
@@ -222,6 +223,19 @@ static void fam15h_power_init_data(struct pci_dev *f4,

/* convert to microWatt */
data->processor_pwr_watts = (tmp * 15625) >> 10;
+
+ cpuid(0x80000007, &eax, &ebx, &ecx, &edx);
+
+ /* CPUID Fn8000_0007:EDX[12] indicates to support accumulated power */
+ if (!(edx & BIT(12)))
+ return;
+
+ /*
+ * detemine the ratio of the compute unit power accumulator
+ * sample period to the PTSC counter period by executing CPUID
+ * Fn8000_0007:ECX
+ */
+ data->cpu_pwr_sample_ratio = ecx;
}

static int fam15h_power_probe(struct pci_dev *pdev,
--
1.9.1

2015-08-27 08:25:36

by Huang Rui

[permalink] [raw]
Subject: [PATCH 08/15] hwmon, fam15h_power: add max compute unit accumulated power

This patch adds a member in fam15h_power_data which specifies the
maximum accumulated power in a compute unit.

Signed-off-by: Huang Rui <[email protected]>
---
drivers/hwmon/fam15h_power.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index d6efcf6..fdfa18e 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -26,6 +26,7 @@
#include <linux/pci.h>
#include <linux/bitops.h>
#include <asm/processor.h>
+#include <asm/msr.h>

MODULE_DESCRIPTION("AMD Family 15h CPU processor power monitor");
MODULE_AUTHOR("Andreas Herrmann <[email protected]>");
@@ -43,12 +44,16 @@ MODULE_LICENSE("GPL");

#define FAM15H_MIN_POWER_GROUPS 2

+#define MSR_F15H_CU_MAX_PWR_ACCUMULATOR 0xc001007b
+
struct fam15h_power_data {
struct pci_dev *pdev;
unsigned int tdp_to_watts;
unsigned int base_tdp;
unsigned int processor_pwr_watts;
unsigned int cpu_pwr_sample_ratio;
+ /* maximum accumulated power of a compute unit */
+ u64 max_cu_acc_power;
};

static ssize_t show_power(struct device *dev,
@@ -199,8 +204,8 @@ static int fam15h_power_resume(struct pci_dev *pdev)
#define fam15h_power_resume NULL
#endif

-static void fam15h_power_init_data(struct pci_dev *f4,
- struct fam15h_power_data *data)
+static int fam15h_power_init_data(struct pci_dev *f4,
+ struct fam15h_power_data *data)
{
u32 val, eax, ebx, ecx, edx;
u64 tmp;
@@ -228,7 +233,7 @@ static void fam15h_power_init_data(struct pci_dev *f4,

/* CPUID Fn8000_0007:EDX[12] indicates to support accumulated power */
if (!(edx & BIT(12)))
- return;
+ return 0;

/*
* detemine the ratio of the compute unit power accumulator
@@ -236,6 +241,15 @@ static void fam15h_power_init_data(struct pci_dev *f4,
* Fn8000_0007:ECX
*/
data->cpu_pwr_sample_ratio = ecx;
+
+ if (rdmsrl_safe(MSR_F15H_CU_MAX_PWR_ACCUMULATOR, &tmp)) {
+ pr_err("Failed to read max compute unit power accumulator MSR\n");
+ return -ENODEV;
+ }
+
+ data->max_cu_acc_power = tmp;
+
+ return 0;
}

static int fam15h_power_probe(struct pci_dev *pdev,
@@ -262,7 +276,9 @@ static int fam15h_power_probe(struct pci_dev *pdev,
if (fam15h_power_init_attrs(pdev))
return -ENOMEM;

- fam15h_power_init_data(pdev, data);
+ if (fam15h_power_init_data(pdev, data))
+ return -ENODEV;
+
data->pdev = pdev;

hwmon_dev = devm_hwmon_device_register_with_groups(dev, "fam15h_power",
--
1.9.1

2015-08-27 08:11:12

by Huang Rui

[permalink] [raw]
Subject: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

Add an accessor function amd_get_cores_per_cu() which returns the
number of cores per compute unit.

In a subsequent patch, we will use this function in fam15h_power
driver.

Signed-off-by: Huang Rui <[email protected]>
---
arch/x86/include/asm/processor.h | 1 +
arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++--
2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 19577dd..831ad682 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -810,6 +810,7 @@ static inline int mpx_disable_management(void)

extern u16 amd_get_nb_id(int cpu);
extern u32 amd_get_nodes_per_socket(void);
+extern u32 amd_get_cores_per_cu(void);

static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
{
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 51ad2af..8ab939a 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -26,6 +26,9 @@
*/
static u32 nodes_per_socket = 1;

+/* cores_per_cu: stores the number of cores per compute unit */
+static u32 cores_per_cu = 1;
+
static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
{
u32 gprs[8] = { 0 };
@@ -298,7 +301,6 @@ static int nearby_node(int apicid)
#ifdef CONFIG_SMP
static void amd_get_topology(struct cpuinfo_x86 *c)
{
- u32 cores_per_cu = 1;
u8 node_id;
int cpu = smp_processor_id();

@@ -313,7 +315,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
/* get compute unit information */
smp_num_siblings = ((ebx >> 8) & 3) + 1;
c->compute_unit_id = ebx & 0xff;
- cores_per_cu += ((ebx >> 8) & 3);
} else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
u64 value;

@@ -379,6 +380,13 @@ u32 amd_get_nodes_per_socket(void)
}
EXPORT_SYMBOL_GPL(amd_get_nodes_per_socket);

+/* this function returns the number of cores per compute unit */
+u32 amd_get_cores_per_cu(void)
+{
+ return cores_per_cu;
+}
+EXPORT_SYMBOL_GPL(amd_get_cores_per_cu);
+
static void srat_detect_node(struct cpuinfo_x86 *c)
{
#ifdef CONFIG_NUMA
@@ -506,6 +514,13 @@ static void bsp_init_amd(struct cpuinfo_x86 *c)
/* A random value per boot for bit slice [12:upper_bit) */
va_align.bits = get_random_int() & va_align.mask;
}
+
+ if (cpu_has_topoext) {
+ u32 cpuid;
+
+ cpuid = cpuid_ebx(0x8000001e);
+ cores_per_cu += ((cpuid >> 8) & 3);
+ }
}

static void early_init_amd(struct cpuinfo_x86 *c)
--
1.9.1

2015-08-27 08:26:32

by Huang Rui

[permalink] [raw]
Subject: [PATCH 10/15] hwmon, fam15h_power: add compute unit accumulated power

This patch adds a member in fam15h_power_data which specifies the
compute unit accumulated power.

Signed-off-by: Huang Rui <[email protected]>
---
drivers/hwmon/fam15h_power.c | 31 +++++++++++++++++++++++++++----
1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index fdfa18e..f5ff48f 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -43,7 +43,9 @@ MODULE_LICENSE("GPL");
#define REG_TDP_LIMIT3 0xe8

#define FAM15H_MIN_POWER_GROUPS 2
+#define MAX_CUS 8

+#define MSR_F15H_CU_PWR_ACCUMULATOR 0xc001007a
#define MSR_F15H_CU_MAX_PWR_ACCUMULATOR 0xc001007b

struct fam15h_power_data {
@@ -54,6 +56,8 @@ struct fam15h_power_data {
unsigned int cpu_pwr_sample_ratio;
/* maximum accumulated power of a compute unit */
u64 max_cu_acc_power;
+ /* accumulated power of the compute units */
+ u64 cu_acc_power[MAX_CUS];
};

static ssize_t show_power(struct device *dev,
@@ -207,6 +211,7 @@ static int fam15h_power_resume(struct pci_dev *pdev)
static int fam15h_power_init_data(struct pci_dev *f4,
struct fam15h_power_data *data)
{
+ int cu_num, cores_per_cu, cpu, cu;
u32 val, eax, ebx, ecx, edx;
u64 tmp;

@@ -249,6 +254,21 @@ static int fam15h_power_init_data(struct pci_dev *f4,

data->max_cu_acc_power = tmp;

+ cores_per_cu = amd_get_cores_per_cu();
+ cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
+
+ WARN_ON_ONCE(cu_num > MAX_CUS);
+
+ for (cpu = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
+ cu = cpu / cores_per_cu;
+ if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
+ &data->cu_acc_power[cu])) {
+ pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
+ cpu);
+ return -ENODEV;
+ }
+ }
+
return 0;
}

@@ -258,6 +278,7 @@ static int fam15h_power_probe(struct pci_dev *pdev,
struct fam15h_power_data *data;
struct device *dev = &pdev->dev;
struct device *hwmon_dev;
+ int ret;

/*
* though we ignore every other northbridge, we still have to
@@ -273,11 +294,13 @@ static int fam15h_power_probe(struct pci_dev *pdev,
if (!data)
return -ENOMEM;

- if (fam15h_power_init_attrs(pdev))
- return -ENOMEM;
+ ret = fam15h_power_init_attrs(pdev);
+ if (ret)
+ return ret;

- if (fam15h_power_init_data(pdev, data))
- return -ENODEV;
+ ret = fam15h_power_init_data(pdev, data);
+ if (ret)
+ return ret;

data->pdev = pdev;

--
1.9.1

2015-08-27 08:11:32

by Huang Rui

[permalink] [raw]
Subject: [PATCH 11/15] hwmon, fam15h_power: add ptsc counter value for accumulated power

PTSC is the performance timestamp counter value in a cpu core and the
cores in one compute unit have the fixed frequency. So it picks up the
performance timestamp counter value of the first core per compute unit
to measure the interval for average power per compute unit.

Signed-off-by: Huang Rui <[email protected]>
---
arch/x86/include/asm/msr-index.h | 1 +
drivers/hwmon/fam15h_power.c | 9 +++++++++
2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9ebc3d0..5566360 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -293,6 +293,7 @@
#define MSR_F15H_PERF_CTR 0xc0010201
#define MSR_F15H_NB_PERF_CTL 0xc0010240
#define MSR_F15H_NB_PERF_CTR 0xc0010241
+#define MSR_F15H_PTSC 0xc0010280

/* Fam 10h MSRs */
#define MSR_FAM10H_MMIO_CONF_BASE 0xc0010058
diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index f5ff48f..d529e4b 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -58,6 +58,8 @@ struct fam15h_power_data {
u64 max_cu_acc_power;
/* accumulated power of the compute units */
u64 cu_acc_power[MAX_CUS];
+ /* performance timestamp counter */
+ u64 cpu_sw_pwr_ptsc[MAX_CUS];
};

static ssize_t show_power(struct device *dev,
@@ -267,6 +269,13 @@ static int fam15h_power_init_data(struct pci_dev *f4,
cpu);
return -ENODEV;
}
+
+ if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC,
+ &data->cpu_sw_pwr_ptsc[cu])) {
+ pr_err("Failed to read performance timestamp counter MSR on core%d\n",
+ cpu);
+ return -ENODEV;
+ }
}

return 0;
--
1.9.1

2015-08-27 08:26:42

by Huang Rui

[permalink] [raw]
Subject: [PATCH 12/15] hwmon, fam15h_power: introduce a cpu accumulated power reporting algorithm

This patch introduces an algorithm that computes the average power by
reading a delta value of “core power accumulator” register during
measurement interval, and then dividing delta value by the length of
the time interval.

User is able to use power1_acc entry to measure the processor power
consumption and power1_acc just needs to be read twice with an needed
interval in-between.

A simple example:

$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
$ sleep 10000s
$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc

The result is current average processor power consumption in 10000
seconds. The unit of the result is uWatt.

Signed-off-by: Huang Rui <[email protected]>
---
drivers/hwmon/fam15h_power.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)

diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
index d529e4b..3bab797 100644
--- a/drivers/hwmon/fam15h_power.c
+++ b/drivers/hwmon/fam15h_power.c
@@ -60,6 +60,7 @@ struct fam15h_power_data {
u64 cu_acc_power[MAX_CUS];
/* performance timestamp counter */
u64 cpu_sw_pwr_ptsc[MAX_CUS];
+ struct mutex acc_pwr_mutex;
};

static ssize_t show_power(struct device *dev,
@@ -121,17 +122,74 @@ static DEVICE_ATTR(power1_crit, S_IRUGO, show_power_crit, NULL);
static struct attribute_group fam15h_power_group;
__ATTRIBUTE_GROUPS(fam15h_power);

+static ssize_t show_power_acc(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int cpu, cu, cu_num, cores_per_cu;
+ u64 curr_cu_acc_power[MAX_CUS],
+ curr_ptsc[MAX_CUS], jdelta[MAX_CUS];
+ u64 tdelta, avg_acc;
+ struct fam15h_power_data *data = dev_get_drvdata(dev);
+
+ cores_per_cu = amd_get_cores_per_cu();
+ cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
+
+ for (cpu = 0, avg_acc = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
+ cu = cpu / cores_per_cu;
+ if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC, &curr_ptsc[cu])) {
+ pr_err("Failed to read PTSC counter MSR on core%d\n",
+ cpu);
+ return 0;
+ }
+
+ if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
+ &curr_cu_acc_power[cu])) {
+ pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
+ cpu);
+ return 0;
+ }
+
+ if (curr_cu_acc_power[cu] < data->cu_acc_power[cu]) {
+ jdelta[cu] = data->max_cu_acc_power + curr_cu_acc_power[cu];
+ jdelta[cu] -= data->cu_acc_power[cu];
+ } else {
+ jdelta[cu] = curr_cu_acc_power[cu] - data->cu_acc_power[cu];
+ }
+ tdelta = curr_ptsc[cu] - data->cpu_sw_pwr_ptsc[cu];
+ jdelta[cu] *= data->cpu_pwr_sample_ratio * 1000;
+ do_div(jdelta[cu], tdelta);
+
+ mutex_lock(&data->acc_pwr_mutex);
+ data->cu_acc_power[cu] = curr_cu_acc_power[cu];
+ data->cpu_sw_pwr_ptsc[cu] = curr_ptsc[cu];
+ mutex_unlock(&data->acc_pwr_mutex);
+
+ /* the unit is microWatt */
+ avg_acc += jdelta[cu];
+ }
+
+ return sprintf(buf, "%u\n", (unsigned int) avg_acc);
+}
+static DEVICE_ATTR(power1_acc, S_IRUGO, show_power_acc, NULL);
+
static int fam15h_power_init_attrs(struct pci_dev *pdev)
{
int n = FAM15H_MIN_POWER_GROUPS;
struct attribute **fam15h_power_attrs;
struct cpuinfo_x86 *c = &boot_cpu_data;
+ u32 cpuid;

if (c->x86 == 0x15 &&
((c->x86_model <= 0xf) ||
(c->x86_model >= 0x60 && c->x86_model <= 0x6f)))
n += 1;

+ cpuid = cpuid_edx(0x80000007);
+
+ /* check if processor supports accumulated power */
+ if (cpuid & BIT(12))
+ n += 1;
+
fam15h_power_attrs = devm_kcalloc(&pdev->dev, n,
sizeof(*fam15h_power_attrs),
GFP_KERNEL);
@@ -148,6 +206,9 @@ static int fam15h_power_init_attrs(struct pci_dev *pdev)
(c->x86_model >= 0x60 && c->x86_model <= 0x6f)))
fam15h_power_attrs[n++] = &dev_attr_power1_input.attr;

+ if (cpuid & BIT(12))
+ fam15h_power_attrs[n++] = &dev_attr_power1_acc.attr;
+
fam15h_power_group.attrs = fam15h_power_attrs;

return 0;
@@ -311,6 +372,7 @@ static int fam15h_power_probe(struct pci_dev *pdev,
if (ret)
return ret;

+ mutex_init(&data->acc_pwr_mutex);
data->pdev = pdev;

hwmon_dev = devm_hwmon_device_register_with_groups(dev, "fam15h_power",
--
1.9.1

2015-08-27 08:11:40

by Huang Rui

[permalink] [raw]
Subject: [PATCH 13/15] hwmon, fam15h_power: add documentation for previous TDP reporting

This patch adds description to explain the TDP reporting mechanism of
fam15h_power driver.

Signed-off-by: Huang Rui <[email protected]>
---
Documentation/hwmon/fam15h_power | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/hwmon/fam15h_power b/Documentation/hwmon/fam15h_power
index e2b1b69..42bf04e 100644
--- a/Documentation/hwmon/fam15h_power
+++ b/Documentation/hwmon/fam15h_power
@@ -10,14 +10,22 @@ Supported chips:
Datasheets:
BIOS and Kernel Developer's Guide (BKDG) For AMD Family 15h Processors
BIOS and Kernel Developer's Guide (BKDG) For AMD Family 16h Processors
+ AMD64 Architecture Programmer's Manual Volume 2: System Programming

Author: Andreas Herrmann <[email protected]>

Description
-----------

+1) Processor TDP (Thermal design power)
+
+Given a fixed frequency and voltage, the power consumption of a
+processor varies based on the workload being executed. Derated power
+is the power consumed when running a specific application. Thermal
+design power (TDP) is an example of derated power.
+
This driver permits reading of registers providing power information
-of AMD Family 15h and 16h processors.
+of AMD Family 15h and 16h processors via TDP algorithm.

For AMD Family 15h and 16h processors the following power values can
be calculated using different processor northbridge function
--
1.9.1

2015-08-27 08:26:59

by Huang Rui

[permalink] [raw]
Subject: [PATCH 14/15] hwmon, fam15h_power: add documentation for accumulated power algorithm

This patch adds the description to explain the accumulated power
algorithm.

Signed-off-by: Huang Rui <[email protected]>
---
Documentation/hwmon/fam15h_power | 45 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)

diff --git a/Documentation/hwmon/fam15h_power b/Documentation/hwmon/fam15h_power
index 42bf04e..dc2bc69 100644
--- a/Documentation/hwmon/fam15h_power
+++ b/Documentation/hwmon/fam15h_power
@@ -45,3 +45,48 @@ This driver provides ProcessorPwrWatts and CurrPwrWatts:
On multi-node processors the calculated value is for the entire
package and not for a single node. Thus the driver creates sysfs
attributes only for internal node0 of a multi-node processor.
+
+2) Accumulated Power Mechanism
+
+This driver also introduces an algorithm that should be used to
+calculate the average power consumed by a processor during a
+measurement interval Tm. The feature of accumulated power mechanism is
+indicated by CPUID Fn8000_0007_EDX[12].
+
+* Tsample: compute unit power accumulator sample period
+* Tref: the PTSC counter period
+* PTSC: performance timestamp counter
+* N: the ratio of compute unit power accumulator sample period to the
+ PTSC period
+* Jmax: max compute unit accumulated power which is indicated by
+ MaxCpuSwPwrAcc MSR C001007b
+* Jx/Jy: compute unit accumulated power which is indicated by
+ CpuSwPwrAcc MSR C001007a
+* Tx/Ty: the value of performance timestamp counter which is indicated
+ by CU_PTSC MSR C0010280
+* PwrCPUave: CPU average power
+
+i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
+ N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].
+
+ii. Read the full range of the cumulative energy value from the new
+MSR MaxCpuSwPwrAcc.
+ Jmax = value returned.
+iii. At time x, SW reads CpuSwPwrAcc MSR and samples the PTSC.
+ Jx = value read from CpuSwPwrAcc and Tx = value read from
+PTSC.
+
+iv. At time y, SW reads CpuSwPwrAcc MSR and samples the PTSC.
+ Jy = value read from CpuSwPwrAcc and Ty = value read from
+PTSC.
+
+v. Calculate the average power consumption for a compute unit over
+time period (y-x). Unit of result is uWatt.
+ if (Jy < Jx) // Rollover has occurred
+ Jdelta = (Jy + Jmax) - Jx
+ else
+ Jdelta = Jy - Jx
+ PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)
+
+This driver provides PwrCPUave:
+* power1_acc (PwrCPUave)
--
1.9.1

2015-08-27 08:26:19

by Huang Rui

[permalink] [raw]
Subject: [PATCH 15/15] MAINTAINERS: change the maintainer of fam15h_power driver

Andreas Herrmann won't take the maintainer of fam15h_power driver. I
will take it and appreciate him for the great contributions on this
driver.

Signed-off-by: Huang Rui <[email protected]>
Cc: Andreas Herrmann <[email protected]>
---
MAINTAINERS | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 941d7b7..241d45e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -601,9 +601,9 @@ F: drivers/crypto/ccp/
F: include/linux/ccp.h

AMD FAM15H PROCESSOR POWER MONITORING DRIVER
-M: Andreas Herrmann <[email protected]>
+M: Huang Rui <[email protected]>
L: [email protected]
-S: Maintained
+S: Supported
F: Documentation/hwmon/fam15h_power
F: drivers/hwmon/fam15h_power.c

--
1.9.1

2015-08-27 14:35:15

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 01/15] hwmon, fam15h_power: add support for AMD Carrizo

On 08/27/2015 01:07 AM, Huang Rui wrote:
> AMD Carrizo(Fam15h, M60h) processors can report power1_crit
> (ProcessorPwrWatts) and power1_input (CurrPwrWatts) values.
> And this patch adds support for CZ.
>
> Signed-off-by: Huang Rui <[email protected]>

Applied to hwmon-next.

Thanks,
Guenter

2015-08-27 14:35:37

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 02/15] hwmon, fam15h_power: rename fam15h_power_is_internal_node0 function

On 08/27/2015 01:07 AM, Huang Rui wrote:
> We rename fam15h_power_is_internal_node0() function to
> should_load_on_this_node(), because it may not be node0 from KV and
> on, and they are single-node processors.
>
> Signed-off-by: Huang Rui <[email protected]>

Applied to hwmon-next.

Thanks,
Guenter

2015-08-27 14:46:24

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 03/15] hwmon, fam15h_power: refactor attributes for dynamically added

On 08/27/2015 01:07 AM, Huang Rui wrote:
> Attributes depend on the CPU model the driver gets loaded on.
> Therefore, add those attributes dynamically at init time. This is more
> flexible to control the different attributes on different platforms.
>
> Signed-off-by: Huang Rui <[email protected]>
> ---
> drivers/hwmon/fam15h_power.c | 49 +++++++++++++++++++++++++++-----------------
> 1 file changed, 30 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> index 820adf1..65ffb06 100644
> --- a/drivers/hwmon/fam15h_power.c
> +++ b/drivers/hwmon/fam15h_power.c
> @@ -41,6 +41,8 @@ MODULE_LICENSE("GPL");
> #define REG_TDP_RUNNING_AVERAGE 0xe0
> #define REG_TDP_LIMIT3 0xe8
>
> +#define FAM15H_MIN_POWER_GROUPS 2

This should be something like FAM15H_MIN_NUM_ATTRS.
There is only one group with a variable number of attributes.

> +
> struct fam15h_power_data {
> struct pci_dev *pdev;
> unsigned int tdp_to_watts;
> @@ -93,29 +95,35 @@ static ssize_t show_power_crit(struct device *dev,
> }
> static DEVICE_ATTR(power1_crit, S_IRUGO, show_power_crit, NULL);
>
> -static umode_t fam15h_power_is_visible(struct kobject *kobj,
> - struct attribute *attr,
> - int index)
> +static struct attribute_group fam15h_power_group;
> +__ATTRIBUTE_GROUPS(fam15h_power);
> +
> +static int fam15h_power_init_attrs(struct pci_dev *pdev)
> {
> - /* power1_input is only reported for Fam15h, Models 00h-0fh */
> - if (attr == &dev_attr_power1_input.attr &&
> - (boot_cpu_data.x86 != 0x15 || boot_cpu_data.x86_model > 0xf))
> - return 0;
> + int n = FAM15H_MIN_POWER_GROUPS;
> + struct attribute **fam15h_power_attrs;
>
> - return attr->mode;
> -}
> + if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model <= 0xf)
> + n += 1;
>
> -static struct attribute *fam15h_power_attrs[] = {
> - &dev_attr_power1_input.attr,
> - &dev_attr_power1_crit.attr,
> - NULL
> -};
> + fam15h_power_attrs = devm_kcalloc(&pdev->dev, n,
> + sizeof(*fam15h_power_attrs),
> + GFP_KERNEL);
>
> -static const struct attribute_group fam15h_power_group = {
> - .attrs = fam15h_power_attrs,
> - .is_visible = fam15h_power_is_visible,
> -};
> -__ATTRIBUTE_GROUPS(fam15h_power);
> + if (!fam15h_power_attrs) {
> + dev_err(&pdev->dev, "failed to alloc fam15h_power_attrs\n");

The infrastructure already dumps a message for memory allocation errors.
No need for another one.

> + return -ENOMEM;
> + }
> +
> + n = 0;
> + fam15h_power_attrs[n++] = &dev_attr_power1_crit.attr;
> + if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model <= 0xf)
> + fam15h_power_attrs[n++] = &dev_attr_power1_input.attr;
> +
> + fam15h_power_group.attrs = fam15h_power_attrs;
> +
Assuming this will be called for each CPU in a multi-CPU system,
this will be overwritten each time a new CPU comes online.
fam15h_power_group and fam15h_power_groups probably need to be moved
into fam15h_power_data to avoid that. In essence, there should only
be read-only static variables. Everything else should be allocated.

> + return 0;
> +}
>
> static bool should_load_on_this_node(struct pci_dev *f4)
> {
> @@ -221,6 +229,9 @@ static int fam15h_power_probe(struct pci_dev *pdev,
> if (!data)
> return -ENOMEM;
>
> + if (fam15h_power_init_attrs(pdev))
> + return -ENOMEM;

This should return the error code from fam15h_power_init_attrs().

> +
> fam15h_power_init_data(pdev, data);
> data->pdev = pdev;
>
>

2015-08-27 14:48:17

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 04/15] hwmon, fam15h_power: update running_avg_capture bit field to 28

On 08/27/2015 01:07 AM, Huang Rui wrote:
> On Carrizo and later platforms, running_avg_capture bit field is
> extended to 4:31 (28 bits) from 4:25.
>
> Signed-off-by: Huang Rui <[email protected]>

Applied to hwmon-next.

Thanks,
Guenter

2015-08-27 14:50:33

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 05/15] hwmon, fam15h_power: enable power1_input on AMD Carrizo

On 08/27/2015 01:07 AM, Huang Rui wrote:
> This patch enables power1_input attribute for Carrizo platform.
>
> Signed-off-by: Huang Rui <[email protected]>

Needs rebase to updated #3, otherwise looks good.

Thanks,
Guenter

> ---
> drivers/hwmon/fam15h_power.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> index 3e9b3b9..55b25ef 100644
> --- a/drivers/hwmon/fam15h_power.c
> +++ b/drivers/hwmon/fam15h_power.c
> @@ -113,8 +113,11 @@ static int fam15h_power_init_attrs(struct pci_dev *pdev)
> {
> int n = FAM15H_MIN_POWER_GROUPS;
> struct attribute **fam15h_power_attrs;
> + struct cpuinfo_x86 *c = &boot_cpu_data;
>
> - if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model <= 0xf)
> + if (c->x86 == 0x15 &&
> + ((c->x86_model <= 0xf) ||
> + (c->x86_model >= 0x60 && c->x86_model <= 0x6f)))
> n += 1;
>
> fam15h_power_attrs = devm_kcalloc(&pdev->dev, n,
> @@ -128,7 +131,9 @@ static int fam15h_power_init_attrs(struct pci_dev *pdev)
>
> n = 0;
> fam15h_power_attrs[n++] = &dev_attr_power1_crit.attr;
> - if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model <= 0xf)
> + if (c->x86 == 0x15 &&
> + ((c->x86_model <= 0xf) ||
> + (c->x86_model >= 0x60 && c->x86_model <= 0x6f)))
> fam15h_power_attrs[n++] = &dev_attr_power1_input.attr;
>
> fam15h_power_group.attrs = fam15h_power_attrs;
>

2015-08-27 14:52:02

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 06/15] hwmon, fam15h_power: add documentation for new processors support

On 08/27/2015 01:07 AM, Huang Rui wrote:
> This patch updates description of fam15h_power driver, its scope is
> extended to family 16h processsors.
>
> Signed-off-by: Huang Rui <[email protected]>

Applied to hwmon-next.

Thanks,
Guenter

2015-08-27 14:54:36

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 07/15] hwmon, fam15h_power: add ratio of Tsample to the PTSC period

On 08/27/2015 01:07 AM, Huang Rui wrote:
> This patch adds a member (cpu_pwr_sample_ratio) of fam15h_power_data,
> that represents the ratio of compute unit power accumulator sample
> period to the PTSC counter period.
>
> Tsample: compute unit power accumulator sample period
> Tref: the performance timestamp counter period
> PTSC: performance timestamp counter
>
> Signed-off-by: Huang Rui <[email protected]>

Applied to hwmon-next (after fixing spelling error 'detemine').

Thanks,
Guenter

> ---
> drivers/hwmon/fam15h_power.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> index 55b25ef..d6efcf6 100644
> --- a/drivers/hwmon/fam15h_power.c
> +++ b/drivers/hwmon/fam15h_power.c
> @@ -48,6 +48,7 @@ struct fam15h_power_data {
> unsigned int tdp_to_watts;
> unsigned int base_tdp;
> unsigned int processor_pwr_watts;
> + unsigned int cpu_pwr_sample_ratio;
> };
>
> static ssize_t show_power(struct device *dev,
> @@ -201,7 +202,7 @@ static int fam15h_power_resume(struct pci_dev *pdev)
> static void fam15h_power_init_data(struct pci_dev *f4,
> struct fam15h_power_data *data)
> {
> - u32 val;
> + u32 val, eax, ebx, ecx, edx;
> u64 tmp;
>
> pci_read_config_dword(f4, REG_PROCESSOR_TDP, &val);
> @@ -222,6 +223,19 @@ static void fam15h_power_init_data(struct pci_dev *f4,
>
> /* convert to microWatt */
> data->processor_pwr_watts = (tmp * 15625) >> 10;
> +
> + cpuid(0x80000007, &eax, &ebx, &ecx, &edx);
> +
> + /* CPUID Fn8000_0007:EDX[12] indicates to support accumulated power */
> + if (!(edx & BIT(12)))
> + return;
> +
> + /*
> + * detemine the ratio of the compute unit power accumulator
> + * sample period to the PTSC counter period by executing CPUID
> + * Fn8000_0007:ECX
> + */
> + data->cpu_pwr_sample_ratio = ecx;
> }
>
> static int fam15h_power_probe(struct pci_dev *pdev,
>

2015-08-27 14:56:40

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 08/15] hwmon, fam15h_power: add max compute unit accumulated power

On 08/27/2015 01:07 AM, Huang Rui wrote:
> This patch adds a member in fam15h_power_data which specifies the
> maximum accumulated power in a compute unit.
>
> Signed-off-by: Huang Rui <[email protected]>
> ---
> drivers/hwmon/fam15h_power.c | 24 ++++++++++++++++++++----
> 1 file changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> index d6efcf6..fdfa18e 100644
> --- a/drivers/hwmon/fam15h_power.c
> +++ b/drivers/hwmon/fam15h_power.c
> @@ -26,6 +26,7 @@
> #include <linux/pci.h>
> #include <linux/bitops.h>
> #include <asm/processor.h>
> +#include <asm/msr.h>
>
> MODULE_DESCRIPTION("AMD Family 15h CPU processor power monitor");
> MODULE_AUTHOR("Andreas Herrmann <[email protected]>");
> @@ -43,12 +44,16 @@ MODULE_LICENSE("GPL");
>
> #define FAM15H_MIN_POWER_GROUPS 2
>
> +#define MSR_F15H_CU_MAX_PWR_ACCUMULATOR 0xc001007b
> +
> struct fam15h_power_data {
> struct pci_dev *pdev;
> unsigned int tdp_to_watts;
> unsigned int base_tdp;
> unsigned int processor_pwr_watts;
> unsigned int cpu_pwr_sample_ratio;
> + /* maximum accumulated power of a compute unit */
> + u64 max_cu_acc_power;
> };
>
> static ssize_t show_power(struct device *dev,
> @@ -199,8 +204,8 @@ static int fam15h_power_resume(struct pci_dev *pdev)
> #define fam15h_power_resume NULL
> #endif
>
> -static void fam15h_power_init_data(struct pci_dev *f4,
> - struct fam15h_power_data *data)
> +static int fam15h_power_init_data(struct pci_dev *f4,
> + struct fam15h_power_data *data)
> {
> u32 val, eax, ebx, ecx, edx;
> u64 tmp;
> @@ -228,7 +233,7 @@ static void fam15h_power_init_data(struct pci_dev *f4,
>
> /* CPUID Fn8000_0007:EDX[12] indicates to support accumulated power */
> if (!(edx & BIT(12)))
> - return;
> + return 0;
>
> /*
> * detemine the ratio of the compute unit power accumulator
> @@ -236,6 +241,15 @@ static void fam15h_power_init_data(struct pci_dev *f4,
> * Fn8000_0007:ECX
> */
> data->cpu_pwr_sample_ratio = ecx;
> +
> + if (rdmsrl_safe(MSR_F15H_CU_MAX_PWR_ACCUMULATOR, &tmp)) {
> + pr_err("Failed to read max compute unit power accumulator MSR\n");
> + return -ENODEV;
> + }
> +
> + data->max_cu_acc_power = tmp;
> +
> + return 0;
> }
>
> static int fam15h_power_probe(struct pci_dev *pdev,
> @@ -262,7 +276,9 @@ static int fam15h_power_probe(struct pci_dev *pdev,
> if (fam15h_power_init_attrs(pdev))
> return -ENOMEM;
>
> - fam15h_power_init_data(pdev, data);
> + if (fam15h_power_init_data(pdev, data))
> + return -ENODEV;
> +
This should return the error code from fam15h_power_init_data().

Thanks,
Guenter

2015-08-27 17:27:55

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Thu, Aug 27, 2015 at 04:07:40PM +0800, Huang Rui wrote:
> Add an accessor function amd_get_cores_per_cu() which returns the
> number of cores per compute unit.
>
> In a subsequent patch, we will use this function in fam15h_power
> driver.
>
> Signed-off-by: Huang Rui <[email protected]>
> ---
> arch/x86/include/asm/processor.h | 1 +
> arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++--
> 2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 19577dd..831ad682 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -810,6 +810,7 @@ static inline int mpx_disable_management(void)
>
> extern u16 amd_get_nb_id(int cpu);
> extern u32 amd_get_nodes_per_socket(void);
> +extern u32 amd_get_cores_per_cu(void);
>
> static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
> {
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 51ad2af..8ab939a 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -26,6 +26,9 @@
> */
> static u32 nodes_per_socket = 1;
>
> +/* cores_per_cu: stores the number of cores per compute unit */
> +static u32 cores_per_cu = 1;
> +
Is this value going to be constant even if there are multiple CPUs
in the system ? In other words, if there are multiple CPUs, do
they always have to have the same number of cores per CU ?

Thanks,
Guenter

> static inline int rdmsrl_amd_safe(unsigned msr, unsigned long long *p)
> {
> u32 gprs[8] = { 0 };
> @@ -298,7 +301,6 @@ static int nearby_node(int apicid)
> #ifdef CONFIG_SMP
> static void amd_get_topology(struct cpuinfo_x86 *c)
> {
> - u32 cores_per_cu = 1;
> u8 node_id;
> int cpu = smp_processor_id();
>
> @@ -313,7 +315,6 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
> /* get compute unit information */
> smp_num_siblings = ((ebx >> 8) & 3) + 1;
> c->compute_unit_id = ebx & 0xff;
> - cores_per_cu += ((ebx >> 8) & 3);
> } else if (cpu_has(c, X86_FEATURE_NODEID_MSR)) {
> u64 value;
>
> @@ -379,6 +380,13 @@ u32 amd_get_nodes_per_socket(void)
> }
> EXPORT_SYMBOL_GPL(amd_get_nodes_per_socket);
>
> +/* this function returns the number of cores per compute unit */
> +u32 amd_get_cores_per_cu(void)
> +{
> + return cores_per_cu;
> +}
> +EXPORT_SYMBOL_GPL(amd_get_cores_per_cu);
> +
> static void srat_detect_node(struct cpuinfo_x86 *c)
> {
> #ifdef CONFIG_NUMA
> @@ -506,6 +514,13 @@ static void bsp_init_amd(struct cpuinfo_x86 *c)
> /* A random value per boot for bit slice [12:upper_bit) */
> va_align.bits = get_random_int() & va_align.mask;
> }
> +
> + if (cpu_has_topoext) {
> + u32 cpuid;
> +
> + cpuid = cpuid_ebx(0x8000001e);
> + cores_per_cu += ((cpuid >> 8) & 3);
> + }
> }
>
> static void early_init_amd(struct cpuinfo_x86 *c)
> --
> 1.9.1
>

2015-08-27 17:30:47

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 12/15] hwmon, fam15h_power: introduce a cpu accumulated power reporting algorithm

On Thu, Aug 27, 2015 at 04:07:43PM +0800, Huang Rui wrote:
> This patch introduces an algorithm that computes the average power by
> reading a delta value of “core power accumulator” register during
> measurement interval, and then dividing delta value by the length of
> the time interval.
>
> User is able to use power1_acc entry to measure the processor power
> consumption and power1_acc just needs to be read twice with an needed
> interval in-between.
>
> A simple example:
>
> $ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
> $ sleep 10000s
> $ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
>
> The result is current average processor power consumption in 10000
> seconds. The unit of the result is uWatt.
>
> Signed-off-by: Huang Rui <[email protected]>
> ---
> drivers/hwmon/fam15h_power.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 62 insertions(+)
>
> diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> index d529e4b..3bab797 100644
> --- a/drivers/hwmon/fam15h_power.c
> +++ b/drivers/hwmon/fam15h_power.c
> @@ -60,6 +60,7 @@ struct fam15h_power_data {
> u64 cu_acc_power[MAX_CUS];
> /* performance timestamp counter */
> u64 cpu_sw_pwr_ptsc[MAX_CUS];
> + struct mutex acc_pwr_mutex;
> };
>
> static ssize_t show_power(struct device *dev,
> @@ -121,17 +122,74 @@ static DEVICE_ATTR(power1_crit, S_IRUGO, show_power_crit, NULL);
> static struct attribute_group fam15h_power_group;
> __ATTRIBUTE_GROUPS(fam15h_power);
>
> +static ssize_t show_power_acc(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + int cpu, cu, cu_num, cores_per_cu;
> + u64 curr_cu_acc_power[MAX_CUS],
> + curr_ptsc[MAX_CUS], jdelta[MAX_CUS];
> + u64 tdelta, avg_acc;
> + struct fam15h_power_data *data = dev_get_drvdata(dev);
> +
> + cores_per_cu = amd_get_cores_per_cu();
> + cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
> +
> + for (cpu = 0, avg_acc = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
> + cu = cpu / cores_per_cu;
> + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC, &curr_ptsc[cu])) {
> + pr_err("Failed to read PTSC counter MSR on core%d\n",
> + cpu);
> + return 0;
> + }
> +
> + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
> + &curr_cu_acc_power[cu])) {
> + pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
> + cpu);
> + return 0;
> + }
> +
> + if (curr_cu_acc_power[cu] < data->cu_acc_power[cu]) {
> + jdelta[cu] = data->max_cu_acc_power + curr_cu_acc_power[cu];
> + jdelta[cu] -= data->cu_acc_power[cu];
> + } else {
> + jdelta[cu] = curr_cu_acc_power[cu] - data->cu_acc_power[cu];
> + }
> + tdelta = curr_ptsc[cu] - data->cpu_sw_pwr_ptsc[cu];
> + jdelta[cu] *= data->cpu_pwr_sample_ratio * 1000;
> + do_div(jdelta[cu], tdelta);
> +
> + mutex_lock(&data->acc_pwr_mutex);
> + data->cu_acc_power[cu] = curr_cu_acc_power[cu];
> + data->cpu_sw_pwr_ptsc[cu] = curr_ptsc[cu];
> + mutex_unlock(&data->acc_pwr_mutex);
> +
> + /* the unit is microWatt */
> + avg_acc += jdelta[cu];
> + }
> +
> + return sprintf(buf, "%u\n", (unsigned int) avg_acc);
> +}
> +static DEVICE_ATTR(power1_acc, S_IRUGO, show_power_acc, NULL);

I am not really a friend of introducing a non-standard attribute.
Does the energy attribute not work here ?

Thanks,
Guenter

2015-08-28 06:48:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Thu, Aug 27, 2015 at 04:07:40PM +0800, Huang Rui wrote:
> Add an accessor function amd_get_cores_per_cu() which returns the
> number of cores per compute unit.
>
> In a subsequent patch, we will use this function in fam15h_power
> driver.
>
> Signed-off-by: Huang Rui <[email protected]>
> ---
> arch/x86/include/asm/processor.h | 1 +
> arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++--
> 2 files changed, 18 insertions(+), 2 deletions(-)

Btw, this needs an ACK from a tip person if it goes through the hwmon
tree.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-08-28 08:00:37

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On 08/27/2015 11:48 PM, Borislav Petkov wrote:
> On Thu, Aug 27, 2015 at 04:07:40PM +0800, Huang Rui wrote:
>> Add an accessor function amd_get_cores_per_cu() which returns the
>> number of cores per compute unit.
>>
>> In a subsequent patch, we will use this function in fam15h_power
>> driver.
>>
>> Signed-off-by: Huang Rui <[email protected]>
>> ---
>> arch/x86/include/asm/processor.h | 1 +
>> arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++--
>> 2 files changed, 18 insertions(+), 2 deletions(-)
>
> Btw, this needs an ACK from a tip person if it goes through the hwmon
> tree.
>
Yes, most definitely.

Guenter

2015-08-28 08:03:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 10/15] hwmon, fam15h_power: add compute unit accumulated power


* Huang Rui <[email protected]> wrote:

> This patch adds a member in fam15h_power_data which specifies the
> compute unit accumulated power.
>
> Signed-off-by: Huang Rui <[email protected]>
> ---
> drivers/hwmon/fam15h_power.c | 31 +++++++++++++++++++++++++++----
> 1 file changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> index fdfa18e..f5ff48f 100644
> --- a/drivers/hwmon/fam15h_power.c
> +++ b/drivers/hwmon/fam15h_power.c
> @@ -43,7 +43,9 @@ MODULE_LICENSE("GPL");
> #define REG_TDP_LIMIT3 0xe8
>
> #define FAM15H_MIN_POWER_GROUPS 2
> +#define MAX_CUS 8
>
> +#define MSR_F15H_CU_PWR_ACCUMULATOR 0xc001007a
> #define MSR_F15H_CU_MAX_PWR_ACCUMULATOR 0xc001007b
>
> struct fam15h_power_data {
> @@ -54,6 +56,8 @@ struct fam15h_power_data {
> unsigned int cpu_pwr_sample_ratio;
> /* maximum accumulated power of a compute unit */
> u64 max_cu_acc_power;
> + /* accumulated power of the compute units */
> + u64 cu_acc_power[MAX_CUS];
> };
>
> static ssize_t show_power(struct device *dev,
> @@ -207,6 +211,7 @@ static int fam15h_power_resume(struct pci_dev *pdev)
> static int fam15h_power_init_data(struct pci_dev *f4,
> struct fam15h_power_data *data)
> {
> + int cu_num, cores_per_cu, cpu, cu;
> u32 val, eax, ebx, ecx, edx;
> u64 tmp;
>
> @@ -249,6 +254,21 @@ static int fam15h_power_init_data(struct pci_dev *f4,
>
> data->max_cu_acc_power = tmp;
>
> + cores_per_cu = amd_get_cores_per_cu();
> + cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
> +
> + WARN_ON_ONCE(cu_num > MAX_CUS);
> +
> + for (cpu = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {

so 'cu_num * cores_per_cu' is really a roundabout way to say
boot_cpu_data.x86_max_cores?

> + cu = cpu / cores_per_cu;
> + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
> + &data->cu_acc_power[cu])) {
> + pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
> + cpu);

Please don't break printk lines mid-line - ignore checkpatch in this case.

Also, the message talks about 'core', while a CPU ID is printed.

Thanks,

Ingo

2015-08-28 08:04:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit


* Borislav Petkov <[email protected]> wrote:

> On Thu, Aug 27, 2015 at 04:07:40PM +0800, Huang Rui wrote:
> > Add an accessor function amd_get_cores_per_cu() which returns the
> > number of cores per compute unit.
> >
> > In a subsequent patch, we will use this function in fam15h_power
> > driver.
> >
> > Signed-off-by: Huang Rui <[email protected]>
> > ---
> > arch/x86/include/asm/processor.h | 1 +
> > arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++--
> > 2 files changed, 18 insertions(+), 2 deletions(-)
>
> Btw, this needs an ACK from a tip person if it goes through the hwmon
> tree.

Looks good to me in theory.

I suspect we might want to factor the 'compute unit' logic out a bit more if usage
becomes more widespread - but right now it's hwmon drivers only,right?

Thanks,

Ingo

2015-08-28 08:56:45

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Fri, Aug 28, 2015 at 10:04:18AM +0200, Ingo Molnar wrote:
> Looks good to me in theory.

Thanks.

> I suspect we might want to factor the 'compute unit' logic out a bit
> more if usage becomes more widespread - but right now it's hwmon
> drivers only,right?

Yeah.

My angle is to extend stuff as we go and only when we need it.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-08-28 10:07:28

by Huang Rui

[permalink] [raw]
Subject: Re: [PATCH 03/15] hwmon, fam15h_power: refactor attributes for dynamically added

On Thu, Aug 27, 2015 at 07:46:18AM -0700, Guenter Roeck wrote:
> On 08/27/2015 01:07 AM, Huang Rui wrote:
> >
> >diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> >index 820adf1..65ffb06 100644
> >--- a/drivers/hwmon/fam15h_power.c
> >+++ b/drivers/hwmon/fam15h_power.c
> >@@ -41,6 +41,8 @@ MODULE_LICENSE("GPL");
> > #define REG_TDP_RUNNING_AVERAGE 0xe0
> > #define REG_TDP_LIMIT3 0xe8
> >
> >+#define FAM15H_MIN_POWER_GROUPS 2
>
> This should be something like FAM15H_MIN_NUM_ATTRS.
> There is only one group with a variable number of attributes.
>

Right, it should be attribute. I will rename this macro.

> >+ n = 0;
> >+ fam15h_power_attrs[n++] = &dev_attr_power1_crit.attr;
> >+ if (boot_cpu_data.x86 == 0x15 && boot_cpu_data.x86_model <= 0xf)
> >+ fam15h_power_attrs[n++] = &dev_attr_power1_input.attr;
> >+
> >+ fam15h_power_group.attrs = fam15h_power_attrs;
> >+
> Assuming this will be called for each CPU in a multi-CPU system,
> this will be overwritten each time a new CPU comes online.
> fam15h_power_group and fam15h_power_groups probably need to be moved
> into fam15h_power_data to avoid that. In essence, there should only
> be read-only static variables. Everything else should be allocated.
>

Yes, setting fam15h_power_group as static and not const would cause to
overwrite in multi-CPU platforms. I don't have to set it as read-only
because I use dynamical attributes. :)

I will move them into fam15h_power_data at v2.

> >+ return 0;
> >+}
> >
> > static bool should_load_on_this_node(struct pci_dev *f4)
> > {
> >@@ -221,6 +229,9 @@ static int fam15h_power_probe(struct pci_dev *pdev,
> > if (!data)
> > return -ENOMEM;
> >
> >+ if (fam15h_power_init_attrs(pdev))
> >+ return -ENOMEM;
>
> This should return the error code from fam15h_power_init_attrs().
>

Actually, I do this at the following patch of this serial. But never
mind, I will update it here at v2.

Thanks,
Rui

2015-08-28 10:20:11

by Huang Rui

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Fri, Aug 28, 2015 at 04:04:18PM +0800, Ingo Molnar wrote:
>
> * Borislav Petkov <[email protected]> wrote:
>
> > On Thu, Aug 27, 2015 at 04:07:40PM +0800, Huang Rui wrote:
> > > Add an accessor function amd_get_cores_per_cu() which returns the
> > > number of cores per compute unit.
> > >
> > > In a subsequent patch, we will use this function in fam15h_power
> > > driver.
> > >
> > > Signed-off-by: Huang Rui <[email protected]>
> > > ---
> > > arch/x86/include/asm/processor.h | 1 +
> > > arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++--
> > > 2 files changed, 18 insertions(+), 2 deletions(-)
> >
> > Btw, this needs an ACK from a tip person if it goes through the hwmon
> > tree.
>
> Looks good to me in theory.
>
> I suspect we might want to factor the 'compute unit' logic out a bit more if usage
> becomes more widespread - but right now it's hwmon drivers only,right?
>

Actually, the cores with the same compute unit share some resources
such as power domain, it might be useful in other parts not only for
hwmon. :)

Thanks,
Rui

2015-08-28 10:30:08

by Huang Rui

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Thu, Aug 27, 2015 at 10:27:46AM -0700, Guenter Roeck wrote:
> On Thu, Aug 27, 2015 at 04:07:40PM +0800, Huang Rui wrote:
> > Add an accessor function amd_get_cores_per_cu() which returns the
> > number of cores per compute unit.
> >
> > In a subsequent patch, we will use this function in fam15h_power
> > driver.
> >
> > Signed-off-by: Huang Rui <[email protected]>
> > ---
> > arch/x86/include/asm/processor.h | 1 +
> > arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++--
> > 2 files changed, 18 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> > index 19577dd..831ad682 100644
> > --- a/arch/x86/include/asm/processor.h
> > +++ b/arch/x86/include/asm/processor.h
> > @@ -810,6 +810,7 @@ static inline int mpx_disable_management(void)
> >
> > extern u16 amd_get_nb_id(int cpu);
> > extern u32 amd_get_nodes_per_socket(void);
> > +extern u32 amd_get_cores_per_cu(void);
> >
> > static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
> > {
> > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > index 51ad2af..8ab939a 100644
> > --- a/arch/x86/kernel/cpu/amd.c
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -26,6 +26,9 @@
> > */
> > static u32 nodes_per_socket = 1;
> >
> > +/* cores_per_cu: stores the number of cores per compute unit */
> > +static u32 cores_per_cu = 1;
> > +
> Is this value going to be constant even if there are multiple CPUs
> in the system ? In other words, if there are multiple CPUs, do
> they always have to have the same number of cores per CU ?
>

Yes, the same type of processors have the fixed number of cores per
compute unit. Currently, the number is 2, but it might be more in
future.

Thanks,
Rui

2015-08-28 10:43:52

by Huang Rui

[permalink] [raw]
Subject: Re: [PATCH 10/15] hwmon, fam15h_power: add compute unit accumulated power

On Fri, Aug 28, 2015 at 10:03:35AM +0200, Ingo Molnar wrote:
>
> * Huang Rui <[email protected]> wrote:
>
> > @@ -249,6 +254,21 @@ static int fam15h_power_init_data(struct pci_dev *f4,
> >
> > data->max_cu_acc_power = tmp;
> >
> > + cores_per_cu = amd_get_cores_per_cu();
> > + cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
> > +
> > + WARN_ON_ONCE(cu_num > MAX_CUS);
> > +
> > + for (cpu = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
>
> so 'cu_num * cores_per_cu' is really a roundabout way to say
> boot_cpu_data.x86_max_cores?
>

Oh, yes. :)
I will update it at v2.

> > + cu = cpu / cores_per_cu;
> > + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
> > + &data->cu_acc_power[cu])) {
> > + pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
> > + cpu);
>
> Please don't break printk lines mid-line - ignore checkpatch in this case.
>

OK, I got it.

> Also, the message talks about 'core', while a CPU ID is printed.
>

Yes, but actually this value is for the compute unit which the core
belongs to.
E.X. the MSR_F15H_CU_PWR_ACCUMULATOR value is the same between core 0
and core 1. Because they belong to the same compute unit.

Thanks,
Rui

2015-08-28 10:47:15

by Huang Rui

[permalink] [raw]
Subject: Re: [PATCH 12/15] hwmon, fam15h_power: introduce a cpu accumulated power reporting algorithm

On Thu, Aug 27, 2015 at 10:30:43AM -0700, Guenter Roeck wrote:
> On Thu, Aug 27, 2015 at 04:07:43PM +0800, Huang Rui wrote:
> > This patch introduces an algorithm that computes the average power by
> > reading a delta value of “core power accumulator” register during
> > measurement interval, and then dividing delta value by the length of
> > the time interval.
> >
> > User is able to use power1_acc entry to measure the processor power
> > consumption and power1_acc just needs to be read twice with an needed
> > interval in-between.
> >
> > A simple example:
> >
> > $ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
> > $ sleep 10000s
> > $ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
> >
> > The result is current average processor power consumption in 10000
> > seconds. The unit of the result is uWatt.
> >
> > Signed-off-by: Huang Rui <[email protected]>
> > ---
> > drivers/hwmon/fam15h_power.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 62 insertions(+)
> >
> > diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> > index d529e4b..3bab797 100644
> > --- a/drivers/hwmon/fam15h_power.c
> > +++ b/drivers/hwmon/fam15h_power.c
> > @@ -60,6 +60,7 @@ struct fam15h_power_data {
> > u64 cu_acc_power[MAX_CUS];
> > /* performance timestamp counter */
> > u64 cpu_sw_pwr_ptsc[MAX_CUS];
> > + struct mutex acc_pwr_mutex;
> > };
> >
> > static ssize_t show_power(struct device *dev,
> > @@ -121,17 +122,74 @@ static DEVICE_ATTR(power1_crit, S_IRUGO, show_power_crit, NULL);
> > static struct attribute_group fam15h_power_group;
> > __ATTRIBUTE_GROUPS(fam15h_power);
> >
> > +static ssize_t show_power_acc(struct device *dev,
> > + struct device_attribute *attr, char *buf)
> > +{
> > + int cpu, cu, cu_num, cores_per_cu;
> > + u64 curr_cu_acc_power[MAX_CUS],
> > + curr_ptsc[MAX_CUS], jdelta[MAX_CUS];
> > + u64 tdelta, avg_acc;
> > + struct fam15h_power_data *data = dev_get_drvdata(dev);
> > +
> > + cores_per_cu = amd_get_cores_per_cu();
> > + cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
> > +
> > + for (cpu = 0, avg_acc = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
> > + cu = cpu / cores_per_cu;
> > + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC, &curr_ptsc[cu])) {
> > + pr_err("Failed to read PTSC counter MSR on core%d\n",
> > + cpu);
> > + return 0;
> > + }
> > +
> > + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
> > + &curr_cu_acc_power[cu])) {
> > + pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
> > + cpu);
> > + return 0;
> > + }
> > +
> > + if (curr_cu_acc_power[cu] < data->cu_acc_power[cu]) {
> > + jdelta[cu] = data->max_cu_acc_power + curr_cu_acc_power[cu];
> > + jdelta[cu] -= data->cu_acc_power[cu];
> > + } else {
> > + jdelta[cu] = curr_cu_acc_power[cu] - data->cu_acc_power[cu];
> > + }
> > + tdelta = curr_ptsc[cu] - data->cpu_sw_pwr_ptsc[cu];
> > + jdelta[cu] *= data->cpu_pwr_sample_ratio * 1000;
> > + do_div(jdelta[cu], tdelta);
> > +
> > + mutex_lock(&data->acc_pwr_mutex);
> > + data->cu_acc_power[cu] = curr_cu_acc_power[cu];
> > + data->cpu_sw_pwr_ptsc[cu] = curr_ptsc[cu];
> > + mutex_unlock(&data->acc_pwr_mutex);
> > +
> > + /* the unit is microWatt */
> > + avg_acc += jdelta[cu];
> > + }
> > +
> > + return sprintf(buf, "%u\n", (unsigned int) avg_acc);
> > +}
> > +static DEVICE_ATTR(power1_acc, S_IRUGO, show_power_acc, NULL);
>
> I am not really a friend of introducing a non-standard attribute.
> Does the energy attribute not work here ?
>

You're right. Non-standard attribute might not be good. Could you
please give me some hints if I use "energy" instead?

Thanks,
Rui

2015-08-28 14:05:21

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 12/15] hwmon, fam15h_power: introduce a cpu accumulated power reporting algorithm

On 08/28/2015 03:45 AM, Huang Rui wrote:
> On Thu, Aug 27, 2015 at 10:30:43AM -0700, Guenter Roeck wrote:
>> On Thu, Aug 27, 2015 at 04:07:43PM +0800, Huang Rui wrote:
>>> This patch introduces an algorithm that computes the average power by
>>> reading a delta value of “core power accumulator” register during
>>> measurement interval, and then dividing delta value by the length of
>>> the time interval.
>>>
>>> User is able to use power1_acc entry to measure the processor power
>>> consumption and power1_acc just needs to be read twice with an needed
>>> interval in-between.
>>>
>>> A simple example:
>>>
>>> $ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
>>> $ sleep 10000s
>>> $ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
>>>
>>> The result is current average processor power consumption in 10000
>>> seconds. The unit of the result is uWatt.
>>>
>>> Signed-off-by: Huang Rui <[email protected]>
>>> ---
>>> drivers/hwmon/fam15h_power.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 62 insertions(+)
>>>
>>> diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
>>> index d529e4b..3bab797 100644
>>> --- a/drivers/hwmon/fam15h_power.c
>>> +++ b/drivers/hwmon/fam15h_power.c
>>> @@ -60,6 +60,7 @@ struct fam15h_power_data {
>>> u64 cu_acc_power[MAX_CUS];
>>> /* performance timestamp counter */
>>> u64 cpu_sw_pwr_ptsc[MAX_CUS];
>>> + struct mutex acc_pwr_mutex;
>>> };
>>>
>>> static ssize_t show_power(struct device *dev,
>>> @@ -121,17 +122,74 @@ static DEVICE_ATTR(power1_crit, S_IRUGO, show_power_crit, NULL);
>>> static struct attribute_group fam15h_power_group;
>>> __ATTRIBUTE_GROUPS(fam15h_power);
>>>
>>> +static ssize_t show_power_acc(struct device *dev,
>>> + struct device_attribute *attr, char *buf)
>>> +{
>>> + int cpu, cu, cu_num, cores_per_cu;
>>> + u64 curr_cu_acc_power[MAX_CUS],
>>> + curr_ptsc[MAX_CUS], jdelta[MAX_CUS];
>>> + u64 tdelta, avg_acc;
>>> + struct fam15h_power_data *data = dev_get_drvdata(dev);
>>> +
>>> + cores_per_cu = amd_get_cores_per_cu();
>>> + cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
>>> +
>>> + for (cpu = 0, avg_acc = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
>>> + cu = cpu / cores_per_cu;
>>> + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC, &curr_ptsc[cu])) {
>>> + pr_err("Failed to read PTSC counter MSR on core%d\n",
>>> + cpu);
>>> + return 0;
>>> + }
>>> +
>>> + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
>>> + &curr_cu_acc_power[cu])) {
>>> + pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
>>> + cpu);
>>> + return 0;
>>> + }
>>> +
>>> + if (curr_cu_acc_power[cu] < data->cu_acc_power[cu]) {
>>> + jdelta[cu] = data->max_cu_acc_power + curr_cu_acc_power[cu];
>>> + jdelta[cu] -= data->cu_acc_power[cu];
>>> + } else {
>>> + jdelta[cu] = curr_cu_acc_power[cu] - data->cu_acc_power[cu];
>>> + }
>>> + tdelta = curr_ptsc[cu] - data->cpu_sw_pwr_ptsc[cu];
>>> + jdelta[cu] *= data->cpu_pwr_sample_ratio * 1000;
>>> + do_div(jdelta[cu], tdelta);
>>> +
>>> + mutex_lock(&data->acc_pwr_mutex);
>>> + data->cu_acc_power[cu] = curr_cu_acc_power[cu];
>>> + data->cpu_sw_pwr_ptsc[cu] = curr_ptsc[cu];
>>> + mutex_unlock(&data->acc_pwr_mutex);
>>> +
>>> + /* the unit is microWatt */
>>> + avg_acc += jdelta[cu];
>>> + }
>>> +
>>> + return sprintf(buf, "%u\n", (unsigned int) avg_acc);
>>> +}
>>> +static DEVICE_ATTR(power1_acc, S_IRUGO, show_power_acc, NULL);
>>
>> I am not really a friend of introducing a non-standard attribute.
>> Does the energy attribute not work here ?
>>
>
> You're right. Non-standard attribute might not be good. Could you
> please give me some hints if I use "energy" instead?
>
1 Joule = 1 Watt-second.

Something else, though - did you make sure that your code doesn't overflow ?
Even though you calculate the average in an u64, you display it as unsigned.

100w * 10,000s = 1,000,000ws = 1,000,000,000,000 micro-watt-seconds, which is
a bit large for an unsigned.

Also, the values should not be reset after reading, but accumulate.

Also, I think your code may be vulnerable to overflows on the CPU register side.
How long does it take before the CPU counters overflow ?

Thanks,
Guenter

2015-08-29 09:19:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit


* Ingo Molnar <[email protected]> wrote:

>
> * Borislav Petkov <[email protected]> wrote:
>
> > On Thu, Aug 27, 2015 at 04:07:40PM +0800, Huang Rui wrote:
> > > Add an accessor function amd_get_cores_per_cu() which returns the
> > > number of cores per compute unit.
> > >
> > > In a subsequent patch, we will use this function in fam15h_power
> > > driver.
> > >
> > > Signed-off-by: Huang Rui <[email protected]>
> > > ---
> > > arch/x86/include/asm/processor.h | 1 +
> > > arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++--
> > > 2 files changed, 18 insertions(+), 2 deletions(-)
> >
> > Btw, this needs an ACK from a tip person if it goes through the hwmon
> > tree.
>
> Looks good to me in theory.
>
> I suspect we might want to factor the 'compute unit' logic out a bit more if usage
> becomes more widespread - but right now it's hwmon drivers only,right?

So let me withdraw my ack: the much more important question that I missed first
time around, why is this reporting feature living in hwmon, not in perf? We have
energy reporting facilities in perf that this should be synced to.

Thanks,

Ingo

2015-08-29 16:34:07

by Guenter Roeck

[permalink] [raw]
Subject: Re: [15/15] MAINTAINERS: change the maintainer of fam15h_power driver

On Thu, Aug 27, 2015 at 04:07:46PM +0800, Huang Rui wrote:
> Andreas Herrmann won't take the maintainer of fam15h_power driver. I
> will take it and appreciate him for the great contributions on this
> driver.
>
> Signed-off-by: Huang Rui <[email protected]>
> Cc: Andreas Herrmann <[email protected]>

Please add Andreas to CREDITS.

Andreas, can you Ack this change ?

Thanks,
Guenter

> ---
> MAINTAINERS | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 941d7b7..241d45e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -601,9 +601,9 @@ F: drivers/crypto/ccp/
> F: include/linux/ccp.h
>
> AMD FAM15H PROCESSOR POWER MONITORING DRIVER
> -M: Andreas Herrmann <[email protected]>
> +M: Huang Rui <[email protected]>
> L: [email protected]
> -S: Maintained
> +S: Supported
> F: Documentation/hwmon/fam15h_power
> F: drivers/hwmon/fam15h_power.c
>

2015-08-30 15:53:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Sat, Aug 29, 2015 at 11:19:14AM +0200, Ingo Molnar wrote:
> So let me withdraw my ack: the much more important question that I
> missed first time around, why is this reporting feature living in
> hwmon, not in perf? We have energy reporting facilities in perf that
> this should be synced to.

Because there's already fam15h_power driver which is exactly for that.
Making it part of perf is then a question of cat-ting the same sysfs
file twice, at the beginning and at the end of the trace, which is
trivial.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-08-31 01:22:48

by Huang Rui

[permalink] [raw]
Subject: Re: [15/15] MAINTAINERS: change the maintainer of fam15h_power driver

On Sat, Aug 29, 2015 at 09:33:54AM -0700, Guenter Roeck wrote:
> On Thu, Aug 27, 2015 at 04:07:46PM +0800, Huang Rui wrote:
> > Andreas Herrmann won't take the maintainer of fam15h_power driver. I
> > will take it and appreciate him for the great contributions on this
> > driver.
> >
> > Signed-off-by: Huang Rui <[email protected]>
> > Cc: Andreas Herrmann <[email protected]>
>
> Please add Andreas to CREDITS.
>

OK.

N: Andreas Herrmann
E: [email protected]
E: [email protected]
D: Key developer of x86/AMD64
D: Author of AMD family 15h processor power monintoring driver
D: Maintainer of AMD Athlon 64 and Opteron processor frequency driver
S: Germany

Andreas, Boris, how about above description at CREDITS? Please let me
know if I something is wrong or missed.
BTW, could you give your snail-mail address? I just know Germany :)

Thanks,
Rui

> Andreas, can you Ack this change ?
>
> Thanks,
> Guenter
>
> > ---
> > MAINTAINERS | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 941d7b7..241d45e 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -601,9 +601,9 @@ F: drivers/crypto/ccp/
> > F: include/linux/ccp.h
> >
> > AMD FAM15H PROCESSOR POWER MONITORING DRIVER
> > -M: Andreas Herrmann <[email protected]>
> > +M: Huang Rui <[email protected]>
> > L: [email protected]
> > -S: Maintained
> > +S: Supported
> > F: Documentation/hwmon/fam15h_power
> > F: drivers/hwmon/fam15h_power.c
> >

2015-08-31 04:17:32

by Huang Rui

[permalink] [raw]
Subject: Re: [PATCH 12/15] hwmon, fam15h_power: introduce a cpu accumulated power reporting algorithm

On Fri, Aug 28, 2015 at 07:05:13AM -0700, Guenter Roeck wrote:
> On 08/28/2015 03:45 AM, Huang Rui wrote:
> >On Thu, Aug 27, 2015 at 10:30:43AM -0700, Guenter Roeck wrote:
> >>On Thu, Aug 27, 2015 at 04:07:43PM +0800, Huang Rui wrote:
> >>>This patch introduces an algorithm that computes the average power by
> >>>reading a delta value of “core power accumulator” register during
> >>>measurement interval, and then dividing delta value by the length of
> >>>the time interval.
> >>>
> >>>User is able to use power1_acc entry to measure the processor power
> >>>consumption and power1_acc just needs to be read twice with an needed
> >>>interval in-between.
> >>>
> >>>A simple example:
> >>>
> >>>$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
> >>>$ sleep 10000s
> >>>$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
> >>>
> >>>The result is current average processor power consumption in 10000
> >>>seconds. The unit of the result is uWatt.
> >>>
> >>>Signed-off-by: Huang Rui <[email protected]>
> >>>---
> >>> drivers/hwmon/fam15h_power.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
> >>> 1 file changed, 62 insertions(+)
> >>>
> >>>diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> >>>index d529e4b..3bab797 100644
> >>>--- a/drivers/hwmon/fam15h_power.c
> >>>+++ b/drivers/hwmon/fam15h_power.c
> >>>@@ -60,6 +60,7 @@ struct fam15h_power_data {
> >>> u64 cu_acc_power[MAX_CUS];
> >>> /* performance timestamp counter */
> >>> u64 cpu_sw_pwr_ptsc[MAX_CUS];
> >>>+ struct mutex acc_pwr_mutex;
> >>> };
> >>>
> >>> static ssize_t show_power(struct device *dev,
> >>>@@ -121,17 +122,74 @@ static DEVICE_ATTR(power1_crit, S_IRUGO, show_power_crit, NULL);
> >>> static struct attribute_group fam15h_power_group;
> >>> __ATTRIBUTE_GROUPS(fam15h_power);
> >>>
> >>>+static ssize_t show_power_acc(struct device *dev,
> >>>+ struct device_attribute *attr, char *buf)
> >>>+{
> >>>+ int cpu, cu, cu_num, cores_per_cu;
> >>>+ u64 curr_cu_acc_power[MAX_CUS],
> >>>+ curr_ptsc[MAX_CUS], jdelta[MAX_CUS];
> >>>+ u64 tdelta, avg_acc;
> >>>+ struct fam15h_power_data *data = dev_get_drvdata(dev);
> >>>+
> >>>+ cores_per_cu = amd_get_cores_per_cu();
> >>>+ cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
> >>>+
> >>>+ for (cpu = 0, avg_acc = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
> >>>+ cu = cpu / cores_per_cu;
> >>>+ if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC, &curr_ptsc[cu])) {
> >>>+ pr_err("Failed to read PTSC counter MSR on core%d\n",
> >>>+ cpu);
> >>>+ return 0;
> >>>+ }
> >>>+
> >>>+ if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
> >>>+ &curr_cu_acc_power[cu])) {
> >>>+ pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
> >>>+ cpu);
> >>>+ return 0;
> >>>+ }
> >>>+
> >>>+ if (curr_cu_acc_power[cu] < data->cu_acc_power[cu]) {
> >>>+ jdelta[cu] = data->max_cu_acc_power + curr_cu_acc_power[cu];
> >>>+ jdelta[cu] -= data->cu_acc_power[cu];
> >>>+ } else {
> >>>+ jdelta[cu] = curr_cu_acc_power[cu] - data->cu_acc_power[cu];
> >>>+ }
> >>>+ tdelta = curr_ptsc[cu] - data->cpu_sw_pwr_ptsc[cu];
> >>>+ jdelta[cu] *= data->cpu_pwr_sample_ratio * 1000;
> >>>+ do_div(jdelta[cu], tdelta);
> >>>+
> >>>+ mutex_lock(&data->acc_pwr_mutex);
> >>>+ data->cu_acc_power[cu] = curr_cu_acc_power[cu];
> >>>+ data->cpu_sw_pwr_ptsc[cu] = curr_ptsc[cu];
> >>>+ mutex_unlock(&data->acc_pwr_mutex);
> >>>+
> >>>+ /* the unit is microWatt */
> >>>+ avg_acc += jdelta[cu];
> >>>+ }
> >>>+
> >>>+ return sprintf(buf, "%u\n", (unsigned int) avg_acc);
> >>>+}
> >>>+static DEVICE_ATTR(power1_acc, S_IRUGO, show_power_acc, NULL);
> >>
> >>I am not really a friend of introducing a non-standard attribute.
> >>Does the energy attribute not work here ?
> >>
> >
> >You're right. Non-standard attribute might not be good. Could you
> >please give me some hints if I use "energy" instead?
> >
> 1 Joule = 1 Watt-second.
>
> Something else, though - did you make sure that your code doesn't overflow ?
> Even though you calculate the average in an u64, you display it as unsigned.
>

Thanks to your reminder. It should not be overflow. The maximum power
consumption of processor (AMD CZ and future 15h) is about 15 Watts =
15,000,000 uWatts = 0xE4E1C0 uWatts, the size is 24 < 32 < 64 bits.

Actually, the unit of jdelta is not Joule. Because the tdelta is the
loops (cycles) that PTSC counter (the freqency is about 100 MHz)
counts not seconds.

So avg_acc is the average power consumption not the accumulated energy.

> 100w * 10,000s = 1,000,000ws = 1,000,000,000,000 micro-watt-seconds, which is
> a bit large for an unsigned.
>
> Also, the values should not be reset after reading, but accumulate.
>
> Also, I think your code may be vulnerable to overflows on the CPU register side.
> How long does it take before the CPU counters overflow ?
>

If I use "energy", 15w * 10,000s = 150,000,000,000 microWatt-seconds.
Yes, it's large for an unsigned, but suitable for u64.

The accumulated power of one compute unit is recorded at 64bit MSR.

Let me calculate the extreme case that how long does it take before
overflow:

Use power consumption 15w, max power 2^64 = 1.8 * 10^19
mWatt-ptsc_loops, and PTSC freqency 100 MHz:

1.8 * 10^19 = (15,000) * (Tmax/Tcycle)
1.8 * 10^19 = (15,000) * (Tmax * PTSC_Freq)
1.8 * 10^19 = (15,000) * (Tmax * 100,000,000)
Tmax = 1.2 * 10^7 seconds

Thanks
Rui

2015-08-31 04:30:37

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 12/15] hwmon, fam15h_power: introduce a cpu accumulated power reporting algorithm

On 08/30/2015 09:16 PM, Huang Rui wrote:
> On Fri, Aug 28, 2015 at 07:05:13AM -0700, Guenter Roeck wrote:
>> On 08/28/2015 03:45 AM, Huang Rui wrote:
>>> On Thu, Aug 27, 2015 at 10:30:43AM -0700, Guenter Roeck wrote:
>>>> On Thu, Aug 27, 2015 at 04:07:43PM +0800, Huang Rui wrote:
>>>>> This patch introduces an algorithm that computes the average power by
>>>>> reading a delta value of “core power accumulator” register during
>>>>> measurement interval, and then dividing delta value by the length of
>>>>> the time interval.
>>>>>
>>>>> User is able to use power1_acc entry to measure the processor power
>>>>> consumption and power1_acc just needs to be read twice with an needed
>>>>> interval in-between.
>>>>>
>>>>> A simple example:
>>>>>
>>>>> $ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
>>>>> $ sleep 10000s
>>>>> $ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
>>>>>
>>>>> The result is current average processor power consumption in 10000
>>>>> seconds. The unit of the result is uWatt.
>>>>>
>>>>> Signed-off-by: Huang Rui <[email protected]>
>>>>> ---
>>>>> drivers/hwmon/fam15h_power.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 62 insertions(+)
>>>>>
>>>>> diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
>>>>> index d529e4b..3bab797 100644
>>>>> --- a/drivers/hwmon/fam15h_power.c
>>>>> +++ b/drivers/hwmon/fam15h_power.c
>>>>> @@ -60,6 +60,7 @@ struct fam15h_power_data {
>>>>> u64 cu_acc_power[MAX_CUS];
>>>>> /* performance timestamp counter */
>>>>> u64 cpu_sw_pwr_ptsc[MAX_CUS];
>>>>> + struct mutex acc_pwr_mutex;
>>>>> };
>>>>>
>>>>> static ssize_t show_power(struct device *dev,
>>>>> @@ -121,17 +122,74 @@ static DEVICE_ATTR(power1_crit, S_IRUGO, show_power_crit, NULL);
>>>>> static struct attribute_group fam15h_power_group;
>>>>> __ATTRIBUTE_GROUPS(fam15h_power);
>>>>>
>>>>> +static ssize_t show_power_acc(struct device *dev,
>>>>> + struct device_attribute *attr, char *buf)
>>>>> +{
>>>>> + int cpu, cu, cu_num, cores_per_cu;
>>>>> + u64 curr_cu_acc_power[MAX_CUS],
>>>>> + curr_ptsc[MAX_CUS], jdelta[MAX_CUS];
>>>>> + u64 tdelta, avg_acc;
>>>>> + struct fam15h_power_data *data = dev_get_drvdata(dev);
>>>>> +
>>>>> + cores_per_cu = amd_get_cores_per_cu();
>>>>> + cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
>>>>> +
>>>>> + for (cpu = 0, avg_acc = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
>>>>> + cu = cpu / cores_per_cu;
>>>>> + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC, &curr_ptsc[cu])) {
>>>>> + pr_err("Failed to read PTSC counter MSR on core%d\n",
>>>>> + cpu);
>>>>> + return 0;
>>>>> + }
>>>>> +
>>>>> + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
>>>>> + &curr_cu_acc_power[cu])) {
>>>>> + pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
>>>>> + cpu);
>>>>> + return 0;
>>>>> + }
>>>>> +
>>>>> + if (curr_cu_acc_power[cu] < data->cu_acc_power[cu]) {
>>>>> + jdelta[cu] = data->max_cu_acc_power + curr_cu_acc_power[cu];
>>>>> + jdelta[cu] -= data->cu_acc_power[cu];
>>>>> + } else {
>>>>> + jdelta[cu] = curr_cu_acc_power[cu] - data->cu_acc_power[cu];
>>>>> + }
>>>>> + tdelta = curr_ptsc[cu] - data->cpu_sw_pwr_ptsc[cu];
>>>>> + jdelta[cu] *= data->cpu_pwr_sample_ratio * 1000;
>>>>> + do_div(jdelta[cu], tdelta);
>>>>> +
>>>>> + mutex_lock(&data->acc_pwr_mutex);
>>>>> + data->cu_acc_power[cu] = curr_cu_acc_power[cu];
>>>>> + data->cpu_sw_pwr_ptsc[cu] = curr_ptsc[cu];
>>>>> + mutex_unlock(&data->acc_pwr_mutex);
>>>>> +
>>>>> + /* the unit is microWatt */
>>>>> + avg_acc += jdelta[cu];
>>>>> + }
>>>>> +
>>>>> + return sprintf(buf, "%u\n", (unsigned int) avg_acc);
>>>>> +}
>>>>> +static DEVICE_ATTR(power1_acc, S_IRUGO, show_power_acc, NULL);
>>>>
>>>> I am not really a friend of introducing a non-standard attribute.
>>>> Does the energy attribute not work here ?
>>>>
>>>
>>> You're right. Non-standard attribute might not be good. Could you
>>> please give me some hints if I use "energy" instead?
>>>
>> 1 Joule = 1 Watt-second.
>>
>> Something else, though - did you make sure that your code doesn't overflow ?
>> Even though you calculate the average in an u64, you display it as unsigned.
>>
>
> Thanks to your reminder. It should not be overflow. The maximum power
> consumption of processor (AMD CZ and future 15h) is about 15 Watts =
> 15,000,000 uWatts = 0xE4E1C0 uWatts, the size is 24 < 32 < 64 bits.
>
> Actually, the unit of jdelta is not Joule. Because the tdelta is the
> loops (cycles) that PTSC counter (the freqency is about 100 MHz)
> counts not seconds.
>
> So avg_acc is the average power consumption not the accumulated energy.
>

Would power1_average then be better suitable for the attribute ?
There is also power1_average_interval which could be used to make
the interval configurable.

Thanks,
Guenter

2015-08-31 08:39:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Sun, Aug 30, 2015 at 05:53:22PM +0200, Borislav Petkov wrote:
> On Sat, Aug 29, 2015 at 11:19:14AM +0200, Ingo Molnar wrote:
> > So let me withdraw my ack: the much more important question that I
> > missed first time around, why is this reporting feature living in
> > hwmon, not in perf? We have energy reporting facilities in perf that
> > this should be synced to.
>
> Because there's already fam15h_power driver which is exactly for that.
> Making it part of perf is then a question of cat-ting the same sysfs
> file twice, at the beginning and at the end of the trace, which is
> trivial.

That don't make sense.

Looking at the BKDG Fam 15h 60h-6Fh these MSRs are per compute unit.
This means you can do much finer grained measurements than system wide
-- which is all hwmon seems capable of.

Not to mention the proposed code is horrible, who in their right mind
does two rdmsrl_safe_on_cpu() back to back.

2015-08-31 13:10:43

by Huang Rui

[permalink] [raw]
Subject: Re: [PATCH 12/15] hwmon, fam15h_power: introduce a cpu accumulated power reporting algorithm

On Sun, Aug 30, 2015 at 09:30:28PM -0700, Guenter Roeck wrote:
> On 08/30/2015 09:16 PM, Huang Rui wrote:
> >On Fri, Aug 28, 2015 at 07:05:13AM -0700, Guenter Roeck wrote:
> >>On 08/28/2015 03:45 AM, Huang Rui wrote:
> >>>On Thu, Aug 27, 2015 at 10:30:43AM -0700, Guenter Roeck wrote:
> >>>>On Thu, Aug 27, 2015 at 04:07:43PM +0800, Huang Rui wrote:
> >>>>>This patch introduces an algorithm that computes the average power by
> >>>>>reading a delta value of “core power accumulator” register during
> >>>>>measurement interval, and then dividing delta value by the length of
> >>>>>the time interval.
> >>>>>
> >>>>>User is able to use power1_acc entry to measure the processor power
> >>>>>consumption and power1_acc just needs to be read twice with an needed
> >>>>>interval in-between.
> >>>>>
> >>>>>A simple example:
> >>>>>
> >>>>>$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
> >>>>>$ sleep 10000s
> >>>>>$ cat /sys/bus/pci/devices/0000\:00\:18.4/hwmon/hwmon0/power1_acc
> >>>>>
> >>>>>The result is current average processor power consumption in 10000
> >>>>>seconds. The unit of the result is uWatt.
> >>>>>
> >>>>>Signed-off-by: Huang Rui <[email protected]>
> >>>>>---
> >>>>> drivers/hwmon/fam15h_power.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
> >>>>> 1 file changed, 62 insertions(+)
> >>>>>
> >>>>>diff --git a/drivers/hwmon/fam15h_power.c b/drivers/hwmon/fam15h_power.c
> >>>>>index d529e4b..3bab797 100644
> >>>>>--- a/drivers/hwmon/fam15h_power.c
> >>>>>+++ b/drivers/hwmon/fam15h_power.c
> >>>>>@@ -60,6 +60,7 @@ struct fam15h_power_data {
> >>>>> u64 cu_acc_power[MAX_CUS];
> >>>>> /* performance timestamp counter */
> >>>>> u64 cpu_sw_pwr_ptsc[MAX_CUS];
> >>>>>+ struct mutex acc_pwr_mutex;
> >>>>> };
> >>>>>
> >>>>> static ssize_t show_power(struct device *dev,
> >>>>>@@ -121,17 +122,74 @@ static DEVICE_ATTR(power1_crit, S_IRUGO, show_power_crit, NULL);
> >>>>> static struct attribute_group fam15h_power_group;
> >>>>> __ATTRIBUTE_GROUPS(fam15h_power);
> >>>>>
> >>>>>+static ssize_t show_power_acc(struct device *dev,
> >>>>>+ struct device_attribute *attr, char *buf)
> >>>>>+{
> >>>>>+ int cpu, cu, cu_num, cores_per_cu;
> >>>>>+ u64 curr_cu_acc_power[MAX_CUS],
> >>>>>+ curr_ptsc[MAX_CUS], jdelta[MAX_CUS];
> >>>>>+ u64 tdelta, avg_acc;
> >>>>>+ struct fam15h_power_data *data = dev_get_drvdata(dev);
> >>>>>+
> >>>>>+ cores_per_cu = amd_get_cores_per_cu();
> >>>>>+ cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
> >>>>>+
> >>>>>+ for (cpu = 0, avg_acc = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
> >>>>>+ cu = cpu / cores_per_cu;
> >>>>>+ if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC, &curr_ptsc[cu])) {
> >>>>>+ pr_err("Failed to read PTSC counter MSR on core%d\n",
> >>>>>+ cpu);
> >>>>>+ return 0;
> >>>>>+ }
> >>>>>+
> >>>>>+ if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
> >>>>>+ &curr_cu_acc_power[cu])) {
> >>>>>+ pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
> >>>>>+ cpu);
> >>>>>+ return 0;
> >>>>>+ }
> >>>>>+
> >>>>>+ if (curr_cu_acc_power[cu] < data->cu_acc_power[cu]) {
> >>>>>+ jdelta[cu] = data->max_cu_acc_power + curr_cu_acc_power[cu];
> >>>>>+ jdelta[cu] -= data->cu_acc_power[cu];
> >>>>>+ } else {
> >>>>>+ jdelta[cu] = curr_cu_acc_power[cu] - data->cu_acc_power[cu];
> >>>>>+ }
> >>>>>+ tdelta = curr_ptsc[cu] - data->cpu_sw_pwr_ptsc[cu];
> >>>>>+ jdelta[cu] *= data->cpu_pwr_sample_ratio * 1000;
> >>>>>+ do_div(jdelta[cu], tdelta);
> >>>>>+
> >>>>>+ mutex_lock(&data->acc_pwr_mutex);
> >>>>>+ data->cu_acc_power[cu] = curr_cu_acc_power[cu];
> >>>>>+ data->cpu_sw_pwr_ptsc[cu] = curr_ptsc[cu];
> >>>>>+ mutex_unlock(&data->acc_pwr_mutex);
> >>>>>+
> >>>>>+ /* the unit is microWatt */
> >>>>>+ avg_acc += jdelta[cu];
> >>>>>+ }
> >>>>>+
> >>>>>+ return sprintf(buf, "%u\n", (unsigned int) avg_acc);
> >>>>>+}
> >>>>>+static DEVICE_ATTR(power1_acc, S_IRUGO, show_power_acc, NULL);
> >>>>
> >>>>I am not really a friend of introducing a non-standard attribute.
> >>>>Does the energy attribute not work here ?
> >>>>
> >>>
> >>>You're right. Non-standard attribute might not be good. Could you
> >>>please give me some hints if I use "energy" instead?
> >>>
> >>1 Joule = 1 Watt-second.
> >>
> >>Something else, though - did you make sure that your code doesn't overflow ?
> >>Even though you calculate the average in an u64, you display it as unsigned.
> >>
> >
> >Thanks to your reminder. It should not be overflow. The maximum power
> >consumption of processor (AMD CZ and future 15h) is about 15 Watts =
> >15,000,000 uWatts = 0xE4E1C0 uWatts, the size is 24 < 32 < 64 bits.
> >
> >Actually, the unit of jdelta is not Joule. Because the tdelta is the
> >loops (cycles) that PTSC counter (the freqency is about 100 MHz)
> >counts not seconds.
> >
> >So avg_acc is the average power consumption not the accumulated energy.
> >
>
> Would power1_average then be better suitable for the attribute ?
> There is also power1_average_interval which could be used to make
> the interval configurable.
>

power1_average and power1_average_interval looks better except we
might use a conversion that make interval from milliseconds to PTSC
cycles instead of "cat-ting" the sysfs file twice, right?

Thanks,
Rui

2015-08-31 13:25:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 12/15] hwmon, fam15h_power: introduce a cpu accumulated power reporting algorithm

On Thu, Aug 27, 2015 at 04:07:43PM +0800, Huang Rui wrote:
> +static ssize_t show_power_acc(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + int cpu, cu, cu_num, cores_per_cu;
> + u64 curr_cu_acc_power[MAX_CUS],
> + curr_ptsc[MAX_CUS], jdelta[MAX_CUS];
> + u64 tdelta, avg_acc;
> + struct fam15h_power_data *data = dev_get_drvdata(dev);
> +
> + cores_per_cu = amd_get_cores_per_cu();
> + cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
> +
> + for (cpu = 0, avg_acc = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
> + cu = cpu / cores_per_cu;
> + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC, &curr_ptsc[cu])) {
> + pr_err("Failed to read PTSC counter MSR on core%d\n",
> + cpu);
> + return 0;
> + }
> +
> + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
> + &curr_cu_acc_power[cu])) {
> + pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
> + cpu);
> + return 0;
> + }
> +
> + if (curr_cu_acc_power[cu] < data->cu_acc_power[cu]) {
> + jdelta[cu] = data->max_cu_acc_power + curr_cu_acc_power[cu];
> + jdelta[cu] -= data->cu_acc_power[cu];
> + } else {
> + jdelta[cu] = curr_cu_acc_power[cu] - data->cu_acc_power[cu];
> + }
> + tdelta = curr_ptsc[cu] - data->cpu_sw_pwr_ptsc[cu];
> + jdelta[cu] *= data->cpu_pwr_sample_ratio * 1000;
> + do_div(jdelta[cu], tdelta);
> +
> + mutex_lock(&data->acc_pwr_mutex);
> + data->cu_acc_power[cu] = curr_cu_acc_power[cu];
> + data->cpu_sw_pwr_ptsc[cu] = curr_ptsc[cu];
> + mutex_unlock(&data->acc_pwr_mutex);
> +
> + /* the unit is microWatt */
> + avg_acc += jdelta[cu];
> + }
> +
> + return sprintf(buf, "%u\n", (unsigned int) avg_acc);
> +}
> +static DEVICE_ATTR(power1_acc, S_IRUGO, show_power_acc, NULL);

This is a world readable file that sprays IPIs across the entire system,
how is that a good idea?

2015-08-31 13:26:09

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On 08/31/2015 01:38 AM, Peter Zijlstra wrote:
> On Sun, Aug 30, 2015 at 05:53:22PM +0200, Borislav Petkov wrote:
>> On Sat, Aug 29, 2015 at 11:19:14AM +0200, Ingo Molnar wrote:
>>> So let me withdraw my ack: the much more important question that I
>>> missed first time around, why is this reporting feature living in
>>> hwmon, not in perf? We have energy reporting facilities in perf that
>>> this should be synced to.
>>
>> Because there's already fam15h_power driver which is exactly for that.
>> Making it part of perf is then a question of cat-ting the same sysfs
>> file twice, at the beginning and at the end of the trace, which is
>> trivial.
>
> That don't make sense.
>
> Looking at the BKDG Fam 15h 60h-6Fh these MSRs are per compute unit.
> This means you can do much finer grained measurements than system wide
> -- which is all hwmon seems capable of.
>

Is it ? Why ?

Guenter

2015-08-31 13:38:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Mon, Aug 31, 2015 at 06:26:00AM -0700, Guenter Roeck wrote:
> On 08/31/2015 01:38 AM, Peter Zijlstra wrote:
> >On Sun, Aug 30, 2015 at 05:53:22PM +0200, Borislav Petkov wrote:
> >>On Sat, Aug 29, 2015 at 11:19:14AM +0200, Ingo Molnar wrote:
> >>>So let me withdraw my ack: the much more important question that I
> >>>missed first time around, why is this reporting feature living in
> >>>hwmon, not in perf? We have energy reporting facilities in perf that
> >>>this should be synced to.
> >>
> >>Because there's already fam15h_power driver which is exactly for that.
> >>Making it part of perf is then a question of cat-ting the same sysfs
> >>file twice, at the beginning and at the end of the trace, which is
> >>trivial.
> >
> >That don't make sense.
> >
> >Looking at the BKDG Fam 15h 60h-6Fh these MSRs are per compute unit.
> >This means you can do much finer grained measurements than system wide
> >-- which is all hwmon seems capable of.
> >
>
> Is it ? Why ?

Dunno, because there's that big old loop iterating all cpus?

2015-08-31 13:54:06

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On 08/31/2015 06:38 AM, Peter Zijlstra wrote:
> On Mon, Aug 31, 2015 at 06:26:00AM -0700, Guenter Roeck wrote:
>> On 08/31/2015 01:38 AM, Peter Zijlstra wrote:
>>> On Sun, Aug 30, 2015 at 05:53:22PM +0200, Borislav Petkov wrote:
>>>> On Sat, Aug 29, 2015 at 11:19:14AM +0200, Ingo Molnar wrote:
>>>>> So let me withdraw my ack: the much more important question that I
>>>>> missed first time around, why is this reporting feature living in
>>>>> hwmon, not in perf? We have energy reporting facilities in perf that
>>>>> this should be synced to.
>>>>
>>>> Because there's already fam15h_power driver which is exactly for that.
>>>> Making it part of perf is then a question of cat-ting the same sysfs
>>>> file twice, at the beginning and at the end of the trace, which is
>>>> trivial.
>>>
>>> That don't make sense.
>>>
>>> Looking at the BKDG Fam 15h 60h-6Fh these MSRs are per compute unit.
>>> This means you can do much finer grained measurements than system wide
>>> -- which is all hwmon seems capable of.
>>>
>>
>> Is it ? Why ?
>
> Dunno, because there's that big old loop iterating all cpus?
>

What does that have to do with 'hwmon' ? The current implementation in the
driver may not be a good idea, and maybe for good reasons; I can not
comment on that. However, you concluded from that implementation that
hwmon, the subsystem, would not be able to support 'much finer grained
measurements than system wide'. I would like to understand how you reached
that conclusion.

Thanks,
Guenter

2015-08-31 14:57:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Mon, Aug 31, 2015 at 06:53:58AM -0700, Guenter Roeck wrote:

> What does that have to do with 'hwmon' ? The current implementation in the
> driver may not be a good idea, and maybe for good reasons; I can not
> comment on that. However, you concluded from that implementation that
> hwmon, the subsystem, would not be able to support 'much finer grained
> measurements than system wide'. I would like to understand how you reached
> that conclusion.

Purely going on what this driver does here. I assumed it was
representative of things, if not then good.

2015-08-31 14:59:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 12/15] hwmon, fam15h_power: introduce a cpu accumulated power reporting algorithm

On Mon, Aug 31, 2015 at 03:25:35PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 27, 2015 at 04:07:43PM +0800, Huang Rui wrote:
> > +static ssize_t show_power_acc(struct device *dev,
> > + struct device_attribute *attr, char *buf)
> > +{
> > + int cpu, cu, cu_num, cores_per_cu;
> > + u64 curr_cu_acc_power[MAX_CUS],
> > + curr_ptsc[MAX_CUS], jdelta[MAX_CUS];
> > + u64 tdelta, avg_acc;
> > + struct fam15h_power_data *data = dev_get_drvdata(dev);
> > +
> > + cores_per_cu = amd_get_cores_per_cu();
> > + cu_num = boot_cpu_data.x86_max_cores / cores_per_cu;
> > +
> > + for (cpu = 0, avg_acc = 0; cpu < cu_num * cores_per_cu; cpu += cores_per_cu) {
> > + cu = cpu / cores_per_cu;
> > + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_PTSC, &curr_ptsc[cu])) {
> > + pr_err("Failed to read PTSC counter MSR on core%d\n",
> > + cpu);
> > + return 0;
> > + }
> > +
> > + if (rdmsrl_safe_on_cpu(cpu, MSR_F15H_CU_PWR_ACCUMULATOR,
> > + &curr_cu_acc_power[cu])) {
> > + pr_err("Failed to read compute unit power accumulator MSR on core%d\n",
> > + cpu);
> > + return 0;
> > + }
> > +
> > + if (curr_cu_acc_power[cu] < data->cu_acc_power[cu]) {
> > + jdelta[cu] = data->max_cu_acc_power + curr_cu_acc_power[cu];
> > + jdelta[cu] -= data->cu_acc_power[cu];
> > + } else {
> > + jdelta[cu] = curr_cu_acc_power[cu] - data->cu_acc_power[cu];
> > + }
> > + tdelta = curr_ptsc[cu] - data->cpu_sw_pwr_ptsc[cu];
> > + jdelta[cu] *= data->cpu_pwr_sample_ratio * 1000;
> > + do_div(jdelta[cu], tdelta);
> > +
> > + mutex_lock(&data->acc_pwr_mutex);
> > + data->cu_acc_power[cu] = curr_cu_acc_power[cu];
> > + data->cpu_sw_pwr_ptsc[cu] = curr_ptsc[cu];
> > + mutex_unlock(&data->acc_pwr_mutex);
> > +
> > + /* the unit is microWatt */
> > + avg_acc += jdelta[cu];
> > + }
> > +
> > + return sprintf(buf, "%u\n", (unsigned int) avg_acc);
> > +}
> > +static DEVICE_ATTR(power1_acc, S_IRUGO, show_power_acc, NULL);
>
> This is a world readable file that sprays IPIs across the entire system,
> how is that a good idea?

FWIW this driver is also hotplug challenged. Not to mention it relies on
cpu number layout in unhealthy ways.

2015-08-31 15:11:59

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On 08/31/2015 07:57 AM, Peter Zijlstra wrote:
> On Mon, Aug 31, 2015 at 06:53:58AM -0700, Guenter Roeck wrote:
>
>> What does that have to do with 'hwmon' ? The current implementation in the
>> driver may not be a good idea, and maybe for good reasons; I can not
>> comment on that. However, you concluded from that implementation that
>> hwmon, the subsystem, would not be able to support 'much finer grained
>> measurements than system wide'. I would like to understand how you reached
>> that conclusion.
>
> Purely going on what this driver does here. I assumed it was
> representative of things, if not then good.
>
Corollary: If a driver in subsystem X doesn't support interrupts,
it is a fair conclusion that subsystem X doesn't support interrupts.

The hwmon subsystem may have its limitations, and for sure would deserve
an overhaul, but at the same time people should refrain from bashing it
for no good reason.

Guenter

2015-08-31 15:19:48

by Andreas Herrmann

[permalink] [raw]
Subject: Re: [15/15] MAINTAINERS: change the maintainer of fam15h_power driver

On Sat, Aug 29, 2015 at 09:33:54AM -0700, Guenter Roeck wrote:
> On Thu, Aug 27, 2015 at 04:07:46PM +0800, Huang Rui wrote:
> > Andreas Herrmann won't take the maintainer of fam15h_power driver. I
> > will take it and appreciate him for the great contributions on this
> > driver.
> >
> > Signed-off-by: Huang Rui <[email protected]>
> > Cc: Andreas Herrmann <[email protected]>
>
> Please add Andreas to CREDITS.
>
> Andreas, can you Ack this change ?

Acked-by: Andreas Herrmann <[email protected]>

Thanks,
Andreas

> Thanks,
> Guenter
>
> > ---
> > MAINTAINERS | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 941d7b7..241d45e 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -601,9 +601,9 @@ F: drivers/crypto/ccp/
> > F: include/linux/ccp.h
> >
> > AMD FAM15H PROCESSOR POWER MONITORING DRIVER
> > -M: Andreas Herrmann <[email protected]>
> > +M: Huang Rui <[email protected]>
> > L: [email protected]
> > -S: Maintained
> > +S: Supported
> > F: Documentation/hwmon/fam15h_power
> > F: drivers/hwmon/fam15h_power.c
> >

2015-08-31 16:06:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Mon, Aug 31, 2015 at 10:38:21AM +0200, Peter Zijlstra wrote:
> Looking at the BKDG Fam 15h 60h-6Fh these MSRs are per compute unit.
> This means you can do much finer grained measurements than system wide

Well, we can do finer-grained if needed. I'm all for everything which
has a good use case. The use case we had in mind here was the physical
processor power consumption for a time period.

> -- which is all hwmon seems capable of.

I guess we can do both - perf and hwmon. I don't see why not.

> Not to mention the proposed code is horrible, who in their right mind
> does two rdmsrl_safe_on_cpu() back to back.

That's a good point - I missed that during previous review. Rui, please
put the rdmsrl_safe_on_cpu() accesses in a separate function which you
run on a particular CPU, for your next version.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-08-31 16:19:28

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On 08/31/2015 09:06 AM, Borislav Petkov wrote:
> On Mon, Aug 31, 2015 at 10:38:21AM +0200, Peter Zijlstra wrote:
>> Looking at the BKDG Fam 15h 60h-6Fh these MSRs are per compute unit.
>> This means you can do much finer grained measurements than system wide
>
> Well, we can do finer-grained if needed. I'm all for everything which
> has a good use case. The use case we had in mind here was the physical
> processor power consumption for a time period.
>
>> -- which is all hwmon seems capable of.
>
> I guess we can do both - perf and hwmon. I don't see why not.
>
>> Not to mention the proposed code is horrible, who in their right mind
>> does two rdmsrl_safe_on_cpu() back to back.
>
> That's a good point - I missed that during previous review. Rui, please
> put the rdmsrl_safe_on_cpu() accesses in a separate function which you
> run on a particular CPU, for your next version.
>
... and maybe work with Peter to address the other hotplug related issues.

It might also be worthwhile thinking about per-CU attributes, if that
provides any value (Peter's comments suggested that this might be the case).

Thanks,
Guenter

2015-08-31 20:44:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On Mon, Aug 31, 2015 at 09:19:20AM -0700, Guenter Roeck wrote:
> On 08/31/2015 09:06 AM, Borislav Petkov wrote:

> >That's a good point - I missed that during previous review. Rui, please
> >put the rdmsrl_safe_on_cpu() accesses in a separate function which you
> >run on a particular CPU, for your next version.
> >
> ... and maybe work with Peter to address the other hotplug related issues.
>
> It might also be worthwhile thinking about per-CU attributes, if that
> provides any value (Peter's comments suggested that this might be the case).

Yeah, so it would allow measuring the power of a subset of compute
units. Typically only useful if you've partitioned your workload. But
since the hardware trivially supports it, its a waste to not expose it.

(Note that its not per-cpu, its per compute unit. What we do with perf
is export a cpumask)

My biggest problem is that all this is user readable and unthrottled. It
basically allows DoS (perf does not typically allow user access to CPU
wide resources).

Imagine joe user doing:

for ((i=0; i<1000; i++)); do
(while :; do cat /sys/foo/file > /dev/null ; done) &
done

Even when contained to a subset of CPUs, that will cause an IPI storm on
all (/2) CPUs, even if you've tried really hard to keep users away from
some of them (see the above partitioning) because you're running some
important RT workload or whatnot.


As to hotplug, if you unplug any of the even numbered CPUs the whole
thing bails and returns 0, even if the corresponding odd CPU of the
compute unit it still online and perfectly capable of accessing the MSR.


As to relying on CPU numbering, maybe I should go write an APICID -> cpu
number randomizer, just for kicks to see what else fails.

We have topology information and cpumasks aplenty for things like this.

2015-08-31 21:24:40

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 09/15] x86, amd: add accessor for number of cores per compute unit

On 08/31/2015 01:44 PM, Peter Zijlstra wrote:
> On Mon, Aug 31, 2015 at 09:19:20AM -0700, Guenter Roeck wrote:
>> On 08/31/2015 09:06 AM, Borislav Petkov wrote:
>
>>> That's a good point - I missed that during previous review. Rui, please
>>> put the rdmsrl_safe_on_cpu() accesses in a separate function which you
>>> run on a particular CPU, for your next version.
>>>
>> ... and maybe work with Peter to address the other hotplug related issues.
>>
>> It might also be worthwhile thinking about per-CU attributes, if that
>> provides any value (Peter's comments suggested that this might be the case).
>
> Yeah, so it would allow measuring the power of a subset of compute
> units. Typically only useful if you've partitioned your workload. But
> since the hardware trivially supports it, its a waste to not expose it.
>
> (Note that its not per-cpu, its per compute unit. What we do with perf
> is export a cpumask)
>
> My biggest problem is that all this is user readable and unthrottled. It
> basically allows DoS (perf does not typically allow user access to CPU
> wide resources).
>
> Imagine joe user doing:
>
> for ((i=0; i<1000; i++)); do
> (while :; do cat /sys/foo/file > /dev/null ; done) &
> done
>
> Even when contained to a subset of CPUs, that will cause an IPI storm on
> all (/2) CPUs, even if you've tried really hard to keep users away from
> some of them (see the above partitioning) because you're running some
> important RT workload or whatnot.
>

That is a matter of driver implementation. Very commonly hwmon drivers
implement value caching, where data from the device is only read at
minimum intervals, and the cached values are reported if the information
is polled too rapidly. For the most part, that is used with i2c temperature
sensors, which tend to update their readings only a few times per second
anyway, but the same mechanism can (and possibly should) be used in
situations like this.

At some point I would like to move the caching mechanism into the hwmon core,
but that is going to take a while.

>
> As to hotplug, if you unplug any of the even numbered CPUs the whole
> thing bails and returns 0, even if the corresponding odd CPU of the
> compute unit it still online and perfectly capable of accessing the MSR.
>

Ah yes, I recall there was a similar problem in the coretemp driver
some time ago.

Guenter