2009-11-20 08:15:24

by Clemens Ladisch

[permalink] [raw]
Subject: [PATCH] k10temp: temperature sensor for AMD Family 10h/11h CPUs

This adds a driver for the internal temperature sensor of AMD Family 10h
and 11h CPUs.

Signed-off-by: Clemens Ladisch <[email protected]>
---
Documentation/hwmon/k10temp | 54 +++++++++++++++++
drivers/hwmon/Kconfig | 12 +++
drivers/hwmon/Makefile | 1
drivers/hwmon/k10temp.c | 135 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 202 insertions(+)

--- /dev/null
+++ linux-2.6/Documentation/hwmon/k10temp
@@ -0,0 +1,54 @@
+Kernel driver k10temp
+=====================
+
+Supported chips:
+* AMD Family 10h processors:
+ Socket F: Quad-Core/Six-Core/Embedded AMD Opteron
+ Socket AM2+: Phenom II X3/X4
+ Socket AM3: Quad-Core Opteron, Athlon/Phenom II X2/X3/X4, Sempron II
+ mobile: Athlon II, Sempron, Turion II
+
+ Prefix: 'k10temp'
+ Addresses scanned: PCI space
+ Datasheets: see http://developer.amd.com/documentation/guides/
+ #31116: BIOS and Kernel Developer's Guide For AMD Family 10h Processors
+ #41322: Revision Guide for AMD Family 10h Processors
+
+* AMD Family 11h processors:
+ mobile: Athlon (X2), Sempron (X2), Turion X2 (Ultra)
+
+ Prefix: 'k10temp'
+ Addresses scanned: PCI space
+ Datasheets: see http://developer.amd.com/documentation/guides/
+ #41256: BIOS and Kernel Developer's Guide For AMD Family 11h Processors
+ #41788: Revision Guide for AMD Family 11h Processors
+
+Author: Clemens Ladisch <[email protected]>
+
+Description
+-----------
+
+This driver permits reading of the internal temperature sensor of AMD
+Family 10h and 11h processors.
+
+All these processors have a sensor, but on older revisions of Family 10h
+processors, the sensor returns inconsistent values (erratum 319). The driver
+refuses to load with these revisions (DR-BA, DR-B2, DR-B3: some Embedded
+Opterons on Socket F; and Quad-Core Opteron, Phenom Triple/Quad-Core, and
+Athon Dual-Core on Socket AM2+). All later revisions (RB-C2, BL-C2, DA-C2,
+RB-C3, HY-D0) work fine; see the list above.
+
+There is one temperature value, available as temp1_input in sysfs. It is
+measured in degrees Celsius with a resolution of 1/8th degree. Please note
+that it is defined as a relative value; to quote the AMD manual:
+
+ Tctl is the processor temperatur control value, used by the platform to
+ control cooling systems. Tctl is a non-physical temperature on an arbitrary
+ scale measured in degrees. It does _not_ represent an actual physical
+ temperature like die or case temperature. Instead, it specifies the
+ processor temperature relative to the point at which the system must supply
+ the maximum cooling for the processor's specified maximum case temperature
+ and maximum thermal power dissipation.
+
+The maximum value for Tctl is usually defined as 70 degrees, so, as a rule of
+thumb, this value should not exceed 60 degrees.
--- linux-2.6/drivers/hwmon/Kconfig
+++ linux-2.6/drivers/hwmon/Kconfig
@@ -222,6 +222,18 @@ config SENSORS_K8TEMP
This driver can also be built as a module. If so, the module
will be called k8temp.

+config SENSORS_K10TEMP
+ tristate "AMD Phenom/Sempron/Turion/Opteron temperature sensor"
+ depends on X86 && PCI
+ help
+ If you say yes here you get support for the temperature
+ sensor inside your CPU. Supported are later revisions of
+ the AMD Family 10h and all revisions of the AMD Family 11h
+ microarchitectures.
+
+ This driver can also be built as a module. If so, the module
+ will be called k10temp.
+
config SENSORS_AMS
tristate "Apple Motion Sensor driver"
depends on PPC_PMAC && !PPC64 && INPUT && ((ADB_PMU && I2C = y) || (ADB_PMU && !I2C) || I2C) && EXPERIMENTAL
--- linux-2.6/drivers/hwmon/Makefile
+++ linux-2.6/drivers/hwmon/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_SENSORS_IBMAEM) += ibmaem.o
obj-$(CONFIG_SENSORS_IBMPEX) += ibmpex.o
obj-$(CONFIG_SENSORS_IT87) += it87.o
obj-$(CONFIG_SENSORS_K8TEMP) += k8temp.o
+obj-$(CONFIG_SENSORS_K10TEMP) += k10temp.o
obj-$(CONFIG_SENSORS_LIS3LV02D) += lis3lv02d.o hp_accel.o
obj-$(CONFIG_SENSORS_LIS3_SPI) += lis3lv02d.o lis3lv02d_spi.o
obj-$(CONFIG_SENSORS_LM63) += lm63.o
--- /dev/null
+++ linux-2.6/drivers/hwmon/k10temp.c
@@ -0,0 +1,135 @@
+/*
+ * k10temp.c - AMD Family 10h/11h CPUs hardware monitoring
+ *
+ * Copyright (c) 2009 Clemens Ladisch <[email protected]>
+ *
+ *
+ * This driver is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This driver is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this driver; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/err.h>
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <asm/processor.h>
+
+MODULE_DESCRIPTION("AMD Family 10h/11h CPU core temperature monitor");
+MODULE_AUTHOR("Clemens Ladisch <[email protected]>");
+MODULE_LICENSE("GPL");
+
+#define REG_REPORTED_TEMPERATURE 0xa4
+#define CURTMP_TO_MILLIDEGREES(regval) (((regval) >> 21) * 125)
+
+static ssize_t show_temp(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ u32 regval;
+
+ pci_read_config_dword(to_pci_dev(dev),
+ REG_REPORTED_TEMPERATURE, &regval);
+ return sprintf(buf, "%u\n", CURTMP_TO_MILLIDEGREES(regval));
+}
+
+static ssize_t show_name(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "k10temp\n");
+}
+
+static SENSOR_DEVICE_ATTR(temp1_input, S_IRUGO, show_temp, NULL, 0);
+static DEVICE_ATTR(name, S_IRUGO, show_name, NULL);
+
+static bool __devinit has_erratum_319(void)
+{
+ /*
+ * Erratum 319: The thermal sensor of older Family 10h processors
+ * (B steppings) is unreliable.
+ */
+ return boot_cpu_data.x86 == 0x10 && boot_cpu_data.x86_model <= 2;
+}
+
+static int __devinit k10temp_probe(struct pci_dev *pdev,
+ const struct pci_device_id *id)
+{
+ struct device *hwmon_dev;
+ int err;
+
+ if (has_erratum_319()) {
+ dev_err(&pdev->dev,
+ "unreliable CPU thermal sensor; monitoring disabled\n");
+ err = -ENODEV;
+ goto exit;
+ }
+
+ err = device_create_file(&pdev->dev,
+ &sensor_dev_attr_temp1_input.dev_attr);
+ if (err)
+ goto exit;
+
+ err = device_create_file(&pdev->dev, &dev_attr_name);
+ if (err)
+ goto exit_remove1;
+
+ hwmon_dev = hwmon_device_register(&pdev->dev);
+ if (IS_ERR(hwmon_dev)) {
+ err = PTR_ERR(hwmon_dev);
+ goto exit_remove2;
+ }
+
+ dev_set_drvdata(&pdev->dev, hwmon_dev);
+ return 0;
+
+exit_remove2:
+ device_remove_file(&pdev->dev, &dev_attr_name);
+exit_remove1:
+ device_remove_file(&pdev->dev, &sensor_dev_attr_temp1_input.dev_attr);
+exit:
+ return err;
+}
+
+static void __devexit k10temp_remove(struct pci_dev *pdev)
+{
+ hwmon_device_unregister(dev_get_drvdata(&pdev->dev));
+ device_remove_file(&pdev->dev, &dev_attr_name);
+ device_remove_file(&pdev->dev, &sensor_dev_attr_temp1_input.dev_attr);
+ dev_set_drvdata(&pdev->dev, NULL);
+}
+
+static struct pci_device_id k10temp_id_table[] = {
+ { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_10H_NB_MISC) },
+ { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_11H_NB_MISC) },
+ {}
+};
+MODULE_DEVICE_TABLE(pci, k10temp_id_table);
+
+static struct pci_driver k10temp_driver = {
+ .name = "k10temp",
+ .id_table = k10temp_id_table,
+ .probe = k10temp_probe,
+ .remove = __devexit_p(k10temp_remove),
+};
+
+static int __init k10temp_init(void)
+{
+ return pci_register_driver(&k10temp_driver);
+}
+
+static void __exit k10temp_exit(void)
+{
+ pci_unregister_driver(&k10temp_driver);
+}
+
+module_init(k10temp_init)
+module_exit(k10temp_exit)


2009-11-20 10:52:55

by Serge Belyshev

[permalink] [raw]
Subject: Re: [PATCH] k10temp: temperature sensor for AMD Family 10h/11h CPUs


> +All these processors have a sensor, but on older revisions of Family 10h
> +processors, the sensor returns inconsistent values (erratum 319). The driver
> +refuses to load with these revisions (DR-BA, DR-B2, DR-B3: some Embedded
> +Opterons on Socket F; and Quad-Core Opteron, Phenom Triple/Quad-Core, and
> +Athon Dual-Core on Socket AM2+). All later revisions (RB-C2, BL-C2, DA-C2,
> +RB-C3, HY-D0) work fine; see the list above.

Please note that erratum actually states that the sensor only "may report
inconsistent values.", not that it is always broken. As evident by my
own experience (tested with a userspace application), it actually works
perfectly on all B3 stepping processors that I have.

> +static bool __devinit has_erratum_319(void)
> +{
> + /*
> + * Erratum 319: The thermal sensor of older Family 10h processors
> + * (B steppings) is unreliable.
> + */
> + return boot_cpu_data.x86 == 0x10 && boot_cpu_data.x86_model <= 2;
> +}
...
> + if (has_erratum_319()) {
> + dev_err(&pdev->dev,
> + "unreliable CPU thermal sensor; monitoring disabled\n");
> + err = -ENODEV;
> + goto exit;
> + }

So, please provide an alternative for those who have a working sensor on a
revision B processor and want to use it.

2009-11-20 10:44:48

by Jean Delvare

[permalink] [raw]
Subject: Re: [lm-sensors] [PATCH] k10temp: temperature sensor for AMD Family 10h/11h CPUs

On Fri, 20 Nov 2009 10:22:50 +0000, Serge Belyshev wrote:
>
> > +All these processors have a sensor, but on older revisions of Family 10h
> > +processors, the sensor returns inconsistent values (erratum 319). The driver
> > +refuses to load with these revisions (DR-BA, DR-B2, DR-B3: some Embedded
> > +Opterons on Socket F; and Quad-Core Opteron, Phenom Triple/Quad-Core, and
> > +Athon Dual-Core on Socket AM2+). All later revisions (RB-C2, BL-C2, DA-C2,
> > +RB-C3, HY-D0) work fine; see the list above.
>
> Please note that erratum actually states that the sensor only "may report
> inconsistent values.", not that it is always broken. As evident by my
> own experience (tested with a userspace application), it actually works
> perfectly on all B3 stepping processors that I have.

We don't care that 5% of the CPU have working sensors. 95% [1] of
theses CPUs have broken sensors and their users will ask us for help
and we are fed up with this. So all these CPUs are blacklisted, period.

> > +static bool __devinit has_erratum_319(void)
> > +{
> > + /*
> > + * Erratum 319: The thermal sensor of older Family 10h processors
> > + * (B steppings) is unreliable.
> > + */
> > + return boot_cpu_data.x86 == 0x10 && boot_cpu_data.x86_model <= 2;
> > +}
> ...
> > + if (has_erratum_319()) {
> > + dev_err(&pdev->dev,
> > + "unreliable CPU thermal sensor; monitoring disabled\n");
> > + err = -ENODEV;
> > + goto exit;
> > + }
>
> So, please provide an alternative for those who have a working sensor on a
> revision B processor and want to use it.

The alternative already exists: you can rebuild the driver yourself
without this check.

Or yet another alternative: become the maintainer of the hwmon
subsystem, and do lm-sensors user support for a couple years. Then you
will be allowed to decide what goes in.

And yet another one: instead of asking others to solve your very own
problem, why don't you try and solve it yourself? I'm sure Clemens
would welcome patches to his driver.

Thank you very much.

[1] Yes these numbers are totally made up. There is no reliable way to
tell a broken sensor from a working one anyway.

--
Jean Delvare

2009-11-20 10:47:52

by Clemens Ladisch

[permalink] [raw]
Subject: [PATCH v2] k10temp: temperature sensor for AMD Family 10h/11h CPUs

This adds a driver for the internal temperature sensor of AMD Family 10h
and 11h CPUs.

Signed-off-by: Clemens Ladisch <[email protected]>
---
v2: Erratum 319 now results in a warning, not an error, because it cannot
be reliably detected; Serge Belyshev reports that his CPU sensor works.

Documentation/hwmon/k10temp | 50 ++++++++++++++++
drivers/hwmon/Kconfig | 11 +++
drivers/hwmon/Makefile | 1
drivers/hwmon/k10temp.c | 132 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 194 insertions(+)

--- /dev/null
+++ linux-2.6/Documentation/hwmon/k10temp
@@ -0,0 +1,50 @@
+Kernel driver k10temp
+=====================
+
+Supported chips:
+* AMD Family 10h processors:
+ Socket F: Quad-Core/Six-Core/Embedded AMD Opteron
+ Socket AM2+: Quad-Core Opteron, Phenom (II) X3/X4
+ Socket AM3: Quad-Core Opteron, Athlon/Phenom II X2/X3/X4, Sempron II
+ mobile: Athlon II, Sempron, Turion II
+
+ Prefix: 'k10temp'
+ Addresses scanned: PCI space
+ Datasheets: see http://developer.amd.com/documentation/guides/
+ #31116: BIOS and Kernel Developer's Guide For AMD Family 10h Processors
+ #41322: Revision Guide for AMD Family 10h Processors
+
+* AMD Family 11h processors:
+ mobile: Athlon (X2), Sempron (X2), Turion X2 (Ultra)
+
+ Prefix: 'k10temp'
+ Addresses scanned: PCI space
+ Datasheets: see http://developer.amd.com/documentation/guides/
+ #41256: BIOS and Kernel Developer's Guide For AMD Family 11h Processors
+ #41788: Revision Guide for AMD Family 11h Processors
+
+Author: Clemens Ladisch <[email protected]>
+
+Description
+-----------
+
+This driver permits reading of the internal temperature sensor of AMD
+Family 10h and 11h processors.
+
+All these processors have a sensor, but on older revisions of Family 10h
+processors, the sensor may return inconsistent values (erratum 319).
+
+There is one temperature value, available as temp1_input in sysfs. It is
+measured in degrees Celsius with a resolution of 1/8th degree. Please note
+that it is defined as a relative value; to quote the AMD manual:
+
+ Tctl is the processor temperatur control value, used by the platform to
+ control cooling systems. Tctl is a non-physical temperature on an arbitrary
+ scale measured in degrees. It does _not_ represent an actual physical
+ temperature like die or case temperature. Instead, it specifies the
+ processor temperature relative to the point at which the system must supply
+ the maximum cooling for the processor's specified maximum case temperature
+ and maximum thermal power dissipation.
+
+The maximum value for Tctl is usually defined as 70 degrees, so, as a rule of
+thumb, this value should not exceed 60 degrees.
--- linux-2.6/drivers/hwmon/Kconfig
+++ linux-2.6/drivers/hwmon/Kconfig
@@ -222,6 +222,17 @@ config SENSORS_K8TEMP
This driver can also be built as a module. If so, the module
will be called k8temp.

+config SENSORS_K10TEMP
+ tristate "AMD Phenom/Sempron/Turion/Opteron temperature sensor"
+ depends on X86 && PCI
+ help
+ If you say yes here you get support for the temperature
+ sensor inside your CPU. Supported are all revisions of
+ the AMD Family 10h and 11h microarchitectures.
+
+ This driver can also be built as a module. If so, the module
+ will be called k10temp.
+
config SENSORS_AMS
tristate "Apple Motion Sensor driver"
depends on PPC_PMAC && !PPC64 && INPUT && ((ADB_PMU && I2C = y) || (ADB_PMU && !I2C) || I2C) && EXPERIMENTAL
--- linux-2.6/drivers/hwmon/Makefile
+++ linux-2.6/drivers/hwmon/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_SENSORS_IBMAEM) += ibmaem.o
obj-$(CONFIG_SENSORS_IBMPEX) += ibmpex.o
obj-$(CONFIG_SENSORS_IT87) += it87.o
obj-$(CONFIG_SENSORS_K8TEMP) += k8temp.o
+obj-$(CONFIG_SENSORS_K10TEMP) += k10temp.o
obj-$(CONFIG_SENSORS_LIS3LV02D) += lis3lv02d.o hp_accel.o
obj-$(CONFIG_SENSORS_LIS3_SPI) += lis3lv02d.o lis3lv02d_spi.o
obj-$(CONFIG_SENSORS_LM63) += lm63.o
--- /dev/null
+++ linux-2.6/drivers/hwmon/k10temp.c
@@ -0,0 +1,132 @@
+/*
+ * k10temp.c - AMD Family 10h/11h CPUs hardware monitoring
+ *
+ * Copyright (c) 2009 Clemens Ladisch <[email protected]>
+ *
+ *
+ * This driver is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This driver is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this driver; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/err.h>
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <asm/processor.h>
+
+MODULE_DESCRIPTION("AMD Family 10h/11h CPU core temperature monitor");
+MODULE_AUTHOR("Clemens Ladisch <[email protected]>");
+MODULE_LICENSE("GPL");
+
+#define REG_REPORTED_TEMPERATURE 0xa4
+#define CURTMP_TO_MILLIDEGREES(regval) (((regval) >> 21) * 125)
+
+static ssize_t show_temp(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ u32 regval;
+
+ pci_read_config_dword(to_pci_dev(dev),
+ REG_REPORTED_TEMPERATURE, &regval);
+ return sprintf(buf, "%u\n", CURTMP_TO_MILLIDEGREES(regval));
+}
+
+static ssize_t show_name(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "k10temp\n");
+}
+
+static SENSOR_DEVICE_ATTR(temp1_input, S_IRUGO, show_temp, NULL, 0);
+static DEVICE_ATTR(name, S_IRUGO, show_name, NULL);
+
+static bool __devinit has_erratum_319(void)
+{
+ /*
+ * Erratum 319: The thermal sensor of older Family 10h processors
+ * (B steppings) may be unreliable.
+ */
+ return boot_cpu_data.x86 == 0x10 && boot_cpu_data.x86_model <= 2;
+}
+
+static int __devinit k10temp_probe(struct pci_dev *pdev,
+ const struct pci_device_id *id)
+{
+ struct device *hwmon_dev;
+ int err;
+
+ err = device_create_file(&pdev->dev,
+ &sensor_dev_attr_temp1_input.dev_attr);
+ if (err)
+ goto exit;
+
+ err = device_create_file(&pdev->dev, &dev_attr_name);
+ if (err)
+ goto exit_remove1;
+
+ hwmon_dev = hwmon_device_register(&pdev->dev);
+ if (IS_ERR(hwmon_dev)) {
+ err = PTR_ERR(hwmon_dev);
+ goto exit_remove2;
+ }
+
+ if (has_erratum_319())
+ dev_warn(&pdev->dev, "CPU thermal sensor may report"
+ " inconsistent values; check erratum 319\n");
+
+ dev_set_drvdata(&pdev->dev, hwmon_dev);
+ return 0;
+
+exit_remove2:
+ device_remove_file(&pdev->dev, &dev_attr_name);
+exit_remove1:
+ device_remove_file(&pdev->dev, &sensor_dev_attr_temp1_input.dev_attr);
+exit:
+ return err;
+}
+
+static void __devexit k10temp_remove(struct pci_dev *pdev)
+{
+ hwmon_device_unregister(dev_get_drvdata(&pdev->dev));
+ device_remove_file(&pdev->dev, &dev_attr_name);
+ device_remove_file(&pdev->dev, &sensor_dev_attr_temp1_input.dev_attr);
+ dev_set_drvdata(&pdev->dev, NULL);
+}
+
+static struct pci_device_id k10temp_id_table[] = {
+ { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_10H_NB_MISC) },
+ { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_11H_NB_MISC) },
+ {}
+};
+MODULE_DEVICE_TABLE(pci, k10temp_id_table);
+
+static struct pci_driver k10temp_driver = {
+ .name = "k10temp",
+ .id_table = k10temp_id_table,
+ .probe = k10temp_probe,
+ .remove = __devexit_p(k10temp_remove),
+};
+
+static int __init k10temp_init(void)
+{
+ return pci_register_driver(&k10temp_driver);
+}
+
+static void __exit k10temp_exit(void)
+{
+ pci_unregister_driver(&k10temp_driver);
+}
+
+module_init(k10temp_init)
+module_exit(k10temp_exit)

2009-11-20 11:30:14

by Jean Delvare

[permalink] [raw]
Subject: Re: [lm-sensors] [PATCH v2] k10temp: temperature sensor for AMD Family 10h/11h CPUs

On Fri, 20 Nov 2009 11:47:51 +0100, Clemens Ladisch wrote:
> This adds a driver for the internal temperature sensor of AMD Family 10h
> and 11h CPUs.
>
> Signed-off-by: Clemens Ladisch <[email protected]>
> ---
> v2: Erratum 319 now results in a warning, not an error, because it cannot
> be reliably detected; Serge Belyshev reports that his CPU sensor works.

Nack. Unreliable sensors -> the default must be to NOT bind to these
CPUs. Feel free to provide a way to force the bind to happen (and still
print a big fat warning that this is a very bad idea), but do NOT make
it the default. Otherwise your driver will _never_ make it into the
kernel tree.

As a side note I'd be curious to see what Serge calls "working
sensors". We had several reports in the past of people who claimed that
their sensors were working. After looking at their numbers, I had doubt
this really was the case.

--
Jean Delvare

2009-11-20 11:56:49

by Clemens Ladisch

[permalink] [raw]
Subject: Re: [PATCH v2] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Jean Delvare wrote:
> On Fri, 20 Nov 2009 11:47:51 +0100, Clemens Ladisch wrote:
> > v2: Erratum 319 now results in a warning, not an error, because it cannot
> > be reliably detected; Serge Belyshev reports that his CPU sensor works.
>
> Nack. Unreliable sensors -> the default must be to NOT bind to these
> CPUs. ... Otherwise your driver will _never_ make it into the
> kernel tree.

Then how did the k8temp driver make it into the kernel tree? :-)

> Feel free to provide a way to force the bind to happen (and still
> print a big fat warning that this is a very bad idea), but do NOT make
> it the default.

Okay, I'll add a force parameter.
I'd guess k8temp should get the same?


Best regards,
Clemens

2009-11-20 12:18:53

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH v2] k10temp: temperature sensor for AMD Family 10h/11h CPUs

On Fri, 20 Nov 2009 12:56:50 +0100, Clemens Ladisch wrote:
> Jean Delvare wrote:
> > On Fri, 20 Nov 2009 11:47:51 +0100, Clemens Ladisch wrote:
> > > v2: Erratum 319 now results in a warning, not an error, because it cannot
> > > be reliably detected; Serge Belyshev reports that his CPU sensor works.
> >
> > Nack. Unreliable sensors -> the default must be to NOT bind to these
> > CPUs. ... Otherwise your driver will _never_ make it into the
> > kernel tree.
>
> Then how did the k8temp driver make it into the kernel tree? :-)

For the K8 family, the history is the other way around: early models
worked fine, while later models did not. So there was no objection to
the k8temp driver initially, there really was no reason for it.

It is much harder to remove a driver from the kernel tree than to add
it. Likewise, it is much harder to reject specific models if you have
been supporting them before, because some users may see it as a
regression. This is the reason why I would like us to be much more
cautious with the family 10h CPU support than we have been for the K8
family. Let us learn from our past errors.

> > Feel free to provide a way to force the bind to happen (and still
> > print a big fat warning that this is a very bad idea), but do NOT make
> > it the default.
>
> Okay, I'll add a force parameter.
> I'd guess k8temp should get the same?

It would make sense, yes, even though you will get complaints from some
users.

--
Jean Delvare

2009-11-23 07:45:59

by Clemens Ladisch

[permalink] [raw]
Subject: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

This adds a driver for the internal temperature sensor of AMD Family 10h
and 11h CPUs.

Signed-off-by: Clemens Ladisch <[email protected]>
---
v3: added 'force' parameter for CPUs with buggy sensor; more documentation

Documentation/hwmon/k10temp | 57 ++++++++++++
drivers/hwmon/Kconfig | 12 ++
drivers/hwmon/Makefile | 1
drivers/hwmon/k10temp.c | 142 ++++++++++++++++++++++++++++++++
4 files changed, 212 insertions(+)

--- /dev/null
+++ linux-2.6/Documentation/hwmon/k10temp
@@ -0,0 +1,57 @@
+Kernel driver k10temp
+=====================
+
+Supported chips:
+* AMD Family 10h processors:
+ Socket F: Quad-Core/Six-Core/Embedded Opteron
+ Socket AM2+: Opteron, Phenom (II) X3/X4
+ Socket AM3: Quad-Core Opteron, Athlon/Phenom II X2/X3/X4, Sempron II
+ Socket S1G3: Athlon II, Sempron, Turion II
+* AMD Family 11h processors:
+ Socket S1G2: Athlon (X2), Sempron (X2), Turion X2 (Ultra)
+
+ Prefix: 'k10temp'
+ Addresses scanned: PCI space
+ Datasheets:
+ BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors:
+ http://support.amd.com/us/Processor_TechDocs/31116.pdf
+ BIOS and Kernel Developer's Guide (BKDG) for AMD Family 11h Processors:
+ http://support.amd.com/us/Processor_TechDocs/41256.pdf
+ Revision Guide for AMD Family 10h Processors:
+ http://support.amd.com/us/Processor_TechDocs/41322.pdf
+ Revision Guide for AMD Family 11h Processors:
+ http://support.amd.com/us/Processor_TechDocs/41788.pdf
+ AMD Family 11h Processor Power and Thermal Data Sheet for Notebooks:
+ http://support.amd.com/us/Processor_TechDocs/43373.pdf
+ AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet:
+ http://support.amd.com/us/Processor_TechDocs/43374.pdf
+ AMD Family 10h Desktop Processor Power and Thermal Data Sheet:
+ http://support.amd.com/us/Processor_TechDocs/43375.pdf
+
+Author: Clemens Ladisch <[email protected]>
+
+Description
+-----------
+
+This driver permits reading of the internal temperature sensor of AMD
+Family 10h and 11h processors.
+
+All these processors have a sensor, but on older revisions of Family 10h
+processors, the sensor may return inconsistent values (erratum 319). The
+driver will refuse to load on these revisions unless you specify the
+"force=1" module parameter.
+
+There is one temperature value, available as temp1_input in sysfs. It is
+measured in degrees Celsius with a resolution of 1/8th degree. Please note
+that it is defined as a relative value; to quote the AMD manual:
+
+ Tctl is the processor temperatur control value, used by the platform to
+ control cooling systems. Tctl is a non-physical temperature on an
+ arbitrary scale measured in degrees. It does _not_ represent an actual
+ physical temperature like die or case temperature. Instead, it specifies
+ the processor temperature relative to the point at which the system must
+ supply the maximum cooling for the processor's specified maximum case
+ temperature and maximum thermal power dissipation.
+
+The maximum value for Tctl is defined as 70 degrees, so, as a rule of thumb,
+this value should not exceed 60 degrees.
--- linux-2.6/drivers/hwmon/Kconfig
+++ linux-2.6/drivers/hwmon/Kconfig
@@ -222,6 +222,18 @@ config SENSORS_K8TEMP
This driver can also be built as a module. If so, the module
will be called k8temp.

+config SENSORS_K10TEMP
+ tristate "AMD Phenom/Sempron/Turion/Opteron temperature sensor"
+ depends on X86 && PCI
+ help
+ If you say yes here you get support for the temperature
+ sensor(s) inside your CPU. Supported are later revisions of
+ the AMD Family 10h and all revisions of the AMD Family 11h
+ microarchitectures.
+
+ This driver can also be built as a module. If so, the module
+ will be called k10temp.
+
config SENSORS_AMS
tristate "Apple Motion Sensor driver"
depends on PPC_PMAC && !PPC64 && INPUT && ((ADB_PMU && I2C = y) || (ADB_PMU && !I2C) || I2C) && EXPERIMENTAL
--- linux-2.6/drivers/hwmon/Makefile
+++ linux-2.6/drivers/hwmon/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_SENSORS_IBMAEM) += ibmaem.o
obj-$(CONFIG_SENSORS_IBMPEX) += ibmpex.o
obj-$(CONFIG_SENSORS_IT87) += it87.o
obj-$(CONFIG_SENSORS_K8TEMP) += k8temp.o
+obj-$(CONFIG_SENSORS_K10TEMP) += k10temp.o
obj-$(CONFIG_SENSORS_LIS3LV02D) += lis3lv02d.o hp_accel.o
obj-$(CONFIG_SENSORS_LIS3_SPI) += lis3lv02d.o lis3lv02d_spi.o
obj-$(CONFIG_SENSORS_LM63) += lm63.o
--- /dev/null
+++ linux-2.6/drivers/hwmon/k10temp.c
@@ -0,0 +1,142 @@
+/*
+ * k10temp.c - AMD Family 10h/11h processor hardware monitoring
+ *
+ * Copyright (c) 2009 Clemens Ladisch <[email protected]>
+ *
+ *
+ * This driver is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This driver is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this driver; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/err.h>
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <asm/processor.h>
+
+MODULE_DESCRIPTION("AMD Family 10h/11h CPU core temperature monitor");
+MODULE_AUTHOR("Clemens Ladisch <[email protected]>");
+MODULE_LICENSE("GPL");
+
+static bool force;
+module_param(force, bool, 0444);
+MODULE_PARM_DESC(force, "force loading on processors with erratum 319");
+
+#define REG_REPORTED_TEMPERATURE 0xa4
+#define CURTMP_TO_MILLIDEGREES(regval) (((regval) >> 21) * 125)
+
+static ssize_t show_temp(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ u32 regval;
+
+ pci_read_config_dword(to_pci_dev(dev),
+ REG_REPORTED_TEMPERATURE, &regval);
+ return sprintf(buf, "%u\n", CURTMP_TO_MILLIDEGREES(regval));
+}
+
+static ssize_t show_name(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "k10temp\n");
+}
+
+static SENSOR_DEVICE_ATTR(temp1_input, S_IRUGO, show_temp, NULL, 0);
+static DEVICE_ATTR(name, S_IRUGO, show_name, NULL);
+
+static bool __devinit has_erratum_319(void)
+{
+ /*
+ * Erratum 319: The thermal sensor of older Family 10h processors
+ * (B steppings) may be unreliable.
+ */
+ return boot_cpu_data.x86 == 0x10 && boot_cpu_data.x86_model <= 2;
+}
+
+static int __devinit k10temp_probe(struct pci_dev *pdev,
+ const struct pci_device_id *id)
+{
+ struct device *hwmon_dev;
+ int err;
+
+ if (has_erratum_319() && !force) {
+ dev_err(&pdev->dev,
+ "unreliable CPU thermal sensor; monitoring disabled\n");
+ err = -ENODEV;
+ goto exit;
+ }
+
+ err = device_create_file(&pdev->dev,
+ &sensor_dev_attr_temp1_input.dev_attr);
+ if (err)
+ goto exit;
+
+ err = device_create_file(&pdev->dev, &dev_attr_name);
+ if (err)
+ goto exit_remove1;
+
+ hwmon_dev = hwmon_device_register(&pdev->dev);
+ if (IS_ERR(hwmon_dev)) {
+ err = PTR_ERR(hwmon_dev);
+ goto exit_remove2;
+ }
+ dev_set_drvdata(&pdev->dev, hwmon_dev);
+
+ if (has_erratum_319() && force)
+ dev_warn(&pdev->dev,
+ "unreliable CPU thermal sensor; check erratum 319\n");
+ return 0;
+
+exit_remove2:
+ device_remove_file(&pdev->dev, &dev_attr_name);
+exit_remove1:
+ device_remove_file(&pdev->dev, &sensor_dev_attr_temp1_input.dev_attr);
+exit:
+ return err;
+}
+
+static void __devexit k10temp_remove(struct pci_dev *pdev)
+{
+ hwmon_device_unregister(dev_get_drvdata(&pdev->dev));
+ device_remove_file(&pdev->dev, &dev_attr_name);
+ device_remove_file(&pdev->dev, &sensor_dev_attr_temp1_input.dev_attr);
+ dev_set_drvdata(&pdev->dev, NULL);
+}
+
+static struct pci_device_id k10temp_id_table[] = {
+ { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_10H_NB_MISC) },
+ { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_11H_NB_MISC) },
+ {}
+};
+MODULE_DEVICE_TABLE(pci, k10temp_id_table);
+
+static struct pci_driver k10temp_driver = {
+ .name = "k10temp",
+ .id_table = k10temp_id_table,
+ .probe = k10temp_probe,
+ .remove = __devexit_p(k10temp_remove),
+};
+
+static int __init k10temp_init(void)
+{
+ return pci_register_driver(&k10temp_driver);
+}
+
+static void __exit k10temp_exit(void)
+{
+ pci_unregister_driver(&k10temp_driver);
+}
+
+module_init(k10temp_init)
+module_exit(k10temp_exit)

2009-11-23 13:51:54

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Hi Clemens,

On Mon, 23 Nov 2009 08:45:58 +0100, Clemens Ladisch wrote:
> This adds a driver for the internal temperature sensor of AMD Family 10h
> and 11h CPUs.
>
> Signed-off-by: Clemens Ladisch <[email protected]>
> ---
> v3: added 'force' parameter for CPUs with buggy sensor; more documentation

Review:

>
> Documentation/hwmon/k10temp | 57 ++++++++++++
> drivers/hwmon/Kconfig | 12 ++
> drivers/hwmon/Makefile | 1
> drivers/hwmon/k10temp.c | 142 ++++++++++++++++++++++++++++++++
> 4 files changed, 212 insertions(+)

The name k10temp is a problem, as AMD insists that there is no such
things as K10 and K11, but instead "family 10h" and "family 11h"
processors.

>
> --- /dev/null
> +++ linux-2.6/Documentation/hwmon/k10temp
> @@ -0,0 +1,57 @@
> +Kernel driver k10temp
> +=====================
> +
> +Supported chips:
> +* AMD Family 10h processors:
> + Socket F: Quad-Core/Six-Core/Embedded Opteron
> + Socket AM2+: Opteron, Phenom (II) X3/X4
> + Socket AM3: Quad-Core Opteron, Athlon/Phenom II X2/X3/X4, Sempron II
> + Socket S1G3: Athlon II, Sempron, Turion II
> +* AMD Family 11h processors:
> + Socket S1G2: Athlon (X2), Sempron (X2), Turion X2 (Ultra)
> +
> + Prefix: 'k10temp'
> + Addresses scanned: PCI space
> + Datasheets:
> + BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors:
> + http://support.amd.com/us/Processor_TechDocs/31116.pdf
> + BIOS and Kernel Developer's Guide (BKDG) for AMD Family 11h Processors:
> + http://support.amd.com/us/Processor_TechDocs/41256.pdf
> + Revision Guide for AMD Family 10h Processors:
> + http://support.amd.com/us/Processor_TechDocs/41322.pdf
> + Revision Guide for AMD Family 11h Processors:
> + http://support.amd.com/us/Processor_TechDocs/41788.pdf
> + AMD Family 11h Processor Power and Thermal Data Sheet for Notebooks:
> + http://support.amd.com/us/Processor_TechDocs/43373.pdf
> + AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet:
> + http://support.amd.com/us/Processor_TechDocs/43374.pdf
> + AMD Family 10h Desktop Processor Power and Thermal Data Sheet:
> + http://support.amd.com/us/Processor_TechDocs/43375.pdf
> +
> +Author: Clemens Ladisch <[email protected]>
> +
> +Description
> +-----------
> +
> +This driver permits reading of the internal temperature sensor of AMD
> +Family 10h and 11h processors.
> +
> +All these processors have a sensor, but on older revisions of Family 10h
> +processors, the sensor may return inconsistent values (erratum 319). The
> +driver will refuse to load on these revisions unless you specify the
> +"force=1" module parameter.
> +
> +There is one temperature value, available as temp1_input in sysfs. It is
> +measured in degrees Celsius with a resolution of 1/8th degree. Please note
> +that it is defined as a relative value; to quote the AMD manual:
> +
> + Tctl is the processor temperatur control value, used by the platform to

Spelling: temperature.

> + control cooling systems. Tctl is a non-physical temperature on an
> + arbitrary scale measured in degrees. It does _not_ represent an actual
> + physical temperature like die or case temperature. Instead, it specifies
> + the processor temperature relative to the point at which the system must
> + supply the maximum cooling for the processor's specified maximum case
> + temperature and maximum thermal power dissipation.
> +
> +The maximum value for Tctl is defined as 70 degrees, so, as a rule of thumb,
> +this value should not exceed 60 degrees.

Now I am puzzled. If the temperature value is on an arbitrary scale,
then the value returned by the driver is essentially fake? I can already
hear the users complain that your driver is reporting temperature
values which are different from their on-board hardware monitoring chip
and they want to know who is right. It will be the same mess as with
coretemp :(

Don't we have additional information about the actual maximum Tcase
value for the different supported models, as we have in coretemp? That
would be an acceptable workaround, I guess.

If not, then it might be the right time to introduce a new interface
for relative temperature values. This needs some work, as we must first
define the interface, then implement support in libsensors and sensors,
and other monitoring applications, and then convert the affected
drivers. But apparently we will have to, as major CPU makers are not
able to implement something as simple as an absolute temperature
sensor :(

> --- linux-2.6/drivers/hwmon/Kconfig
> +++ linux-2.6/drivers/hwmon/Kconfig
> @@ -222,6 +222,18 @@ config SENSORS_K8TEMP
> This driver can also be built as a module. If so, the module
> will be called k8temp.
>
> +config SENSORS_K10TEMP
> + tristate "AMD Phenom/Sempron/Turion/Opteron temperature sensor"
> + depends on X86 && PCI
> + help
> + If you say yes here you get support for the temperature
> + sensor(s) inside your CPU. Supported are later revisions of
> + the AMD Family 10h and all revisions of the AMD Family 11h
> + microarchitectures.
> +
> + This driver can also be built as a module. If so, the module
> + will be called k10temp.
> +
> config SENSORS_AMS
> tristate "Apple Motion Sensor driver"
> depends on PPC_PMAC && !PPC64 && INPUT && ((ADB_PMU && I2C = y) || (ADB_PMU && !I2C) || I2C) && EXPERIMENTAL
> --- linux-2.6/drivers/hwmon/Makefile
> +++ linux-2.6/drivers/hwmon/Makefile
> @@ -53,6 +53,7 @@ obj-$(CONFIG_SENSORS_IBMAEM) += ibmaem.o
> obj-$(CONFIG_SENSORS_IBMPEX) += ibmpex.o
> obj-$(CONFIG_SENSORS_IT87) += it87.o
> obj-$(CONFIG_SENSORS_K8TEMP) += k8temp.o
> +obj-$(CONFIG_SENSORS_K10TEMP) += k10temp.o
> obj-$(CONFIG_SENSORS_LIS3LV02D) += lis3lv02d.o hp_accel.o
> obj-$(CONFIG_SENSORS_LIS3_SPI) += lis3lv02d.o lis3lv02d_spi.o
> obj-$(CONFIG_SENSORS_LM63) += lm63.o
> --- /dev/null
> +++ linux-2.6/drivers/hwmon/k10temp.c
> @@ -0,0 +1,142 @@
> +/*
> + * k10temp.c - AMD Family 10h/11h processor hardware monitoring
> + *
> + * Copyright (c) 2009 Clemens Ladisch <[email protected]>
> + *
> + *
> + * This driver is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This driver is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> + * See the GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this driver; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <linux/err.h>
> +#include <linux/hwmon.h>
> +#include <linux/hwmon-sysfs.h>
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <asm/processor.h>
> +
> +MODULE_DESCRIPTION("AMD Family 10h/11h CPU core temperature monitor");
> +MODULE_AUTHOR("Clemens Ladisch <[email protected]>");
> +MODULE_LICENSE("GPL");
> +
> +static bool force;
> +module_param(force, bool, 0444);
> +MODULE_PARM_DESC(force, "force loading on processors with erratum 319");
> +
> +#define REG_REPORTED_TEMPERATURE 0xa4
> +#define CURTMP_TO_MILLIDEGREES(regval) (((regval) >> 21) * 125)

I'd vastly prefer this to be an inline function. Or even not a function
at all, after all you're only using it once.

> +
> +static ssize_t show_temp(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + u32 regval;
> +
> + pci_read_config_dword(to_pci_dev(dev),
> + REG_REPORTED_TEMPERATURE, &regval);
> + return sprintf(buf, "%u\n", CURTMP_TO_MILLIDEGREES(regval));
> +}
> +
> +static ssize_t show_name(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "k10temp\n");
> +}
> +
> +static SENSOR_DEVICE_ATTR(temp1_input, S_IRUGO, show_temp, NULL, 0);

There's no point in using SENSOR_DEVICE_ATTR() if you don't make use of
the last parameter. DEVICE_ATTR is enough, and cheaper. And then you no
longer need <linux/hwmon-sysfs.h>.

> +static DEVICE_ATTR(name, S_IRUGO, show_name, NULL);
> +
> +static bool __devinit has_erratum_319(void)
> +{
> + /*
> + * Erratum 319: The thermal sensor of older Family 10h processors
> + * (B steppings) may be unreliable.
> + */
> + return boot_cpu_data.x86 == 0x10 && boot_cpu_data.x86_model <= 2;
> +}
> +
> +static int __devinit k10temp_probe(struct pci_dev *pdev,
> + const struct pci_device_id *id)
> +{
> + struct device *hwmon_dev;
> + int err;
> +
> + if (has_erratum_319() && !force) {
> + dev_err(&pdev->dev,
> + "unreliable CPU thermal sensor; monitoring disabled\n");
> + err = -ENODEV;
> + goto exit;
> + }
> +
> + err = device_create_file(&pdev->dev,
> + &sensor_dev_attr_temp1_input.dev_attr);
> + if (err)
> + goto exit;
> +
> + err = device_create_file(&pdev->dev, &dev_attr_name);
> + if (err)
> + goto exit_remove1;
> +
> + hwmon_dev = hwmon_device_register(&pdev->dev);
> + if (IS_ERR(hwmon_dev)) {
> + err = PTR_ERR(hwmon_dev);
> + goto exit_remove2;
> + }
> + dev_set_drvdata(&pdev->dev, hwmon_dev);
> +
> + if (has_erratum_319() && force)
> + dev_warn(&pdev->dev,
> + "unreliable CPU thermal sensor; check erratum 319\n");
> + return 0;
> +
> +exit_remove2:
> + device_remove_file(&pdev->dev, &dev_attr_name);
> +exit_remove1:
> + device_remove_file(&pdev->dev, &sensor_dev_attr_temp1_input.dev_attr);
> +exit:
> + return err;
> +}
> +
> +static void __devexit k10temp_remove(struct pci_dev *pdev)
> +{
> + hwmon_device_unregister(dev_get_drvdata(&pdev->dev));
> + device_remove_file(&pdev->dev, &dev_attr_name);
> + device_remove_file(&pdev->dev, &sensor_dev_attr_temp1_input.dev_attr);
> + dev_set_drvdata(&pdev->dev, NULL);
> +}
> +
> +static struct pci_device_id k10temp_id_table[] = {
> + { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_10H_NB_MISC) },
> + { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_11H_NB_MISC) },
> + {}
> +};
> +MODULE_DEVICE_TABLE(pci, k10temp_id_table);
> +
> +static struct pci_driver k10temp_driver = {
> + .name = "k10temp",
> + .id_table = k10temp_id_table,
> + .probe = k10temp_probe,
> + .remove = __devexit_p(k10temp_remove),
> +};
> +
> +static int __init k10temp_init(void)
> +{
> + return pci_register_driver(&k10temp_driver);
> +}
> +
> +static void __exit k10temp_exit(void)
> +{
> + pci_unregister_driver(&k10temp_driver);
> +}
> +
> +module_init(k10temp_init)
> +module_exit(k10temp_exit)

All the rest looks OK.

--
Jean Delvare

2009-11-23 15:29:24

by Clemens Ladisch

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Jean Delvare wrote:
> On Mon, 23 Nov 2009 08:45:58 +0100, Clemens Ladisch wrote:
>> Documentation/hwmon/k10temp | 57 ++++++++++++
>> drivers/hwmon/k10temp.c | 142 ++++++++++++++++++++++++++++++++
>
> The name k10temp is a problem, as AMD insists that there is no such
> things as K10 and K11, but instead "family 10h" and "family 11h"
> processors.

K10 was AMD's internal code name, and is widely used in practice.
I'd like to keep this name since it is consistent with the older
k8temp driver.

What name would you propose instead? "amdfam10temp"?

>> + control cooling systems. Tctl is a non-physical temperature on an
>> + arbitrary scale measured in degrees. It does _not_ represent an actual
>> + physical temperature like die or case temperature. Instead, it specifies
>> + the processor temperature relative to the point at which the system must
>> + supply the maximum cooling for the processor's specified maximum case
>> + temperature and maximum thermal power dissipation.
>> +
>> +The maximum value for Tctl is defined as 70 degrees, so, as a rule of thumb,
>> +this value should not exceed 60 degrees.
>
> Now I am puzzled. If the temperature value is on an arbitrary scale,
> then the value returned by the driver is essentially fake?

Yes (and it's near enough the absolute value to look plausible).

> Don't we have additional information about the actual maximum Tcase
> value for the different supported models, as we have in coretemp?

For AMD, Tcase is the physical temperature. Did you mean Tctl?

I'll add Tctl max (= "100% cooling, please") as temp1_max, and there's
a register that contains the Tctl value at which the processor starts
throttling, which could be exported as temp1_crit(_hyst), if I
understand the lm-sensors documentation correctly.

As for your other comments, I'll integrate them in the next version of
the patch.


> If not, then it might be the right time to introduce a new interface
> for relative temperature values. This needs some work, as we must first
> define the interface, then implement support in libsensors and sensors,
> and other monitoring applications, and then convert the affected
> drivers. But apparently we will have to, as major CPU makers are not
> able to implement something as simple as an absolute temperature
> sensor :(

There still is the built-in diode to be read by the motherboard, but the
internal sensor was never intended to be an absolute measurement but
just as a means for controlling the cooling.



Best regards,
Clemens

2009-11-23 19:05:25

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

On Mon, 23 Nov 2009 16:29:25 +0100, Clemens Ladisch wrote:
> Jean Delvare wrote:
> > On Mon, 23 Nov 2009 08:45:58 +0100, Clemens Ladisch wrote:
> >> Documentation/hwmon/k10temp | 57 ++++++++++++
> >> drivers/hwmon/k10temp.c | 142 ++++++++++++++++++++++++++++++++
> >
> > The name k10temp is a problem, as AMD insists that there is no such
> > things as K10 and K11, but instead "family 10h" and "family 11h"
> > processors.
>
> K10 was AMD's internal code name, and is widely used in practice.
> I'd like to keep this name since it is consistent with the older
> k8temp driver.
>
> What name would you propose instead? "amdfam10temp"?

Not very readable, I admit. "amd10temp" would do, I guess. But I agree
it doesn't matter that much, it's only a driver name after all.

> >> + control cooling systems. Tctl is a non-physical temperature on an
> >> + arbitrary scale measured in degrees. It does _not_ represent an actual
> >> + physical temperature like die or case temperature. Instead, it specifies
> >> + the processor temperature relative to the point at which the system must
> >> + supply the maximum cooling for the processor's specified maximum case
> >> + temperature and maximum thermal power dissipation.
> >> +
> >> +The maximum value for Tctl is defined as 70 degrees, so, as a rule of thumb,
> >> +this value should not exceed 60 degrees.
> >
> > Now I am puzzled. If the temperature value is on an arbitrary scale,
> > then the value returned by the driver is essentially fake?
>
> Yes (and it's near enough the absolute value to look plausible).

I don't know if it is a good or bad idea. Bad, I guess.

> > Don't we have additional information about the actual maximum Tcase
> > value for the different supported models, as we have in coretemp?
>
> For AMD, Tcase is the physical temperature. Did you mean Tctl?

I meant the physical temperature when Tctl = 70. In other words, the
offset between Tctl and the physical temperature.

> I'll add Tctl max (= "100% cooling, please") as temp1_max, and there's

Yes, good idea.

> a register that contains the Tctl value at which the processor starts
> throttling, which could be exported as temp1_crit(_hyst), if I
> understand the lm-sensors documentation correctly.

This is correct.

> As for your other comments, I'll integrate them in the next version of
> the patch.
>
> > If not, then it might be the right time to introduce a new interface
> > for relative temperature values. This needs some work, as we must first
> > define the interface, then implement support in libsensors and sensors,
> > and other monitoring applications, and then convert the affected
> > drivers. But apparently we will have to, as major CPU makers are not
> > able to implement something as simple as an absolute temperature
> > sensor :(
>
> There still is the built-in diode to be read by the motherboard, but the
> internal sensor was never intended to be an absolute measurement but
> just as a means for controlling the cooling.

Still we use it for that purpose at the moment. Maybe we simply should
not?

--
Jean Delvare

2009-11-24 08:43:28

by Clemens Ladisch

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Jean Delvare wrote:
> On Mon, 23 Nov 2009 16:29:25 +0100, Clemens Ladisch wrote:
> > Jean Delvare wrote:
> > > The name k10temp is a problem, as AMD insists that there is no such
> > > things as K10 and K11, but instead "family 10h" and "family 11h"
> > > processors.
> >
> > K10 was AMD's internal code name, and is widely used in practice.
> > I'd like to keep this name since it is consistent with the older
> > k8temp driver.
> >
> > What name would you propose instead? "amdfam10temp"?
>
> Not very readable, I admit. "amd10temp" would do, I guess. But I agree
> it doesn't matter that much, it's only a driver name after all.

In that case, I'll just keep it. :)

> > > Don't we have additional information about the actual maximum Tcase
> > > value for the different supported models, as we have in coretemp?
> >
> > For AMD, Tcase is the physical temperature. Did you mean Tctl?
>
> I meant the physical temperature when Tctl = 70. In other words, the
> offset between Tctl and the physical temperature.

The Power and Thermal datasheets have information like "Tcase Max:
55 °C to 71 °C". So this seems to be different for individual
processors.

It might be possible to get that information through SB-TSI, but AMD
tries to keep that specification secret.

> > There still is the built-in diode to be read by the motherboard, but the
> > internal sensor was never intended to be an absolute measurement but
> > just as a means for controlling the cooling.
>
> Still we use it for that purpose at the moment. Maybe we simply should
> not?

Well, the absolute measurements have essentially the same purpose, and
would not make much sense without comparing them to some absolute limit.

In any case, it might make more sense to show such values as something
like "20 °C below maximum".


Best regards,
Clemens

2009-11-24 08:43:39

by Clemens Ladisch

[permalink] [raw]
Subject: [PATCH v4] k10temp: temperature sensor for AMD Family 10h/11h CPUs

This adds a driver for the internal temperature sensor of AMD Family 10h
and 11h CPUs.

Signed-off-by: Clemens Ladisch <[email protected]>
---
v3: added 'force' parameter for CPUs with buggy sensor; more documentation
v4: added max/crit values, other changes suggested by Jean Delvare

Documentation/hwmon/k10temp | 60 +++++++++
drivers/hwmon/Kconfig | 12 +
drivers/hwmon/Makefile | 1
drivers/hwmon/k10temp.c | 197 ++++++++++++++++++++++++++++++++
4 files changed, 270 insertions(+)

--- /dev/null
+++ linux-2.6/Documentation/hwmon/k10temp
@@ -0,0 +1,60 @@
+Kernel driver k10temp
+=====================
+
+Supported chips:
+* AMD Family 10h processors:
+ Socket F: Quad-Core/Six-Core/Embedded Opteron
+ Socket AM2+: Opteron, Phenom (II) X3/X4
+ Socket AM3: Quad-Core Opteron, Athlon/Phenom II X2/X3/X4, Sempron II
+ Socket S1G3: Athlon II, Sempron, Turion II
+* AMD Family 11h processors:
+ Socket S1G2: Athlon (X2), Sempron (X2), Turion X2 (Ultra)
+
+ Prefix: 'k10temp'
+ Addresses scanned: PCI space
+ Datasheets:
+ BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors:
+ http://support.amd.com/us/Processor_TechDocs/31116.pdf
+ BIOS and Kernel Developer's Guide (BKDG) for AMD Family 11h Processors:
+ http://support.amd.com/us/Processor_TechDocs/41256.pdf
+ Revision Guide for AMD Family 10h Processors:
+ http://support.amd.com/us/Processor_TechDocs/41322.pdf
+ Revision Guide for AMD Family 11h Processors:
+ http://support.amd.com/us/Processor_TechDocs/41788.pdf
+ AMD Family 11h Processor Power and Thermal Data Sheet for Notebooks:
+ http://support.amd.com/us/Processor_TechDocs/43373.pdf
+ AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet:
+ http://support.amd.com/us/Processor_TechDocs/43374.pdf
+ AMD Family 10h Desktop Processor Power and Thermal Data Sheet:
+ http://support.amd.com/us/Processor_TechDocs/43375.pdf
+
+Author: Clemens Ladisch <[email protected]>
+
+Description
+-----------
+
+This driver permits reading of the internal temperature sensor of AMD
+Family 10h and 11h processors.
+
+All these processors have a sensor, but on older revisions of Family 10h
+processors, the sensor may return inconsistent values (erratum 319). The
+driver will refuse to load on these revisions unless you specify the
+"force=1" module parameter.
+
+There is one temperature measurement value, available as temp1_input in
+sysfs. It is measured in degrees Celsius with a resolution of 1/8th degree.
+Please note that it is defined as a relative value; to quote the AMD manual:
+
+ Tctl is the processor temperature control value, used by the platform to
+ control cooling systems. Tctl is a non-physical temperature on an
+ arbitrary scale measured in degrees. It does _not_ represent an actual
+ physical temperature like die or case temperature. Instead, it specifies
+ the processor temperature relative to the point at which the system must
+ supply the maximum cooling for the processor's specified maximum case
+ temperature and maximum thermal power dissipation.
+
+The maximum value for Tctl is available in the file temp1_max.
+
+If the BIOS has enabled hardware temperature control, the threshold at
+which the processor will throttle itself to avoid damage is available in
+temp1_crit and temp1_crit_hyst.
--- linux-2.6/drivers/hwmon/Kconfig
+++ linux-2.6/drivers/hwmon/Kconfig
@@ -222,6 +222,18 @@ config SENSORS_K8TEMP
This driver can also be built as a module. If so, the module
will be called k8temp.

+config SENSORS_K10TEMP
+ tristate "AMD Phenom/Sempron/Turion/Opteron temperature sensor"
+ depends on X86 && PCI
+ help
+ If you say yes here you get support for the temperature
+ sensor(s) inside your CPU. Supported are later revisions of
+ the AMD Family 10h and all revisions of the AMD Family 11h
+ microarchitectures.
+
+ This driver can also be built as a module. If so, the module
+ will be called k10temp.
+
config SENSORS_AMS
tristate "Apple Motion Sensor driver"
depends on PPC_PMAC && !PPC64 && INPUT && ((ADB_PMU && I2C = y) || (ADB_PMU && !I2C) || I2C) && EXPERIMENTAL
--- linux-2.6/drivers/hwmon/Makefile
+++ linux-2.6/drivers/hwmon/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_SENSORS_IBMAEM) += ibmaem.o
obj-$(CONFIG_SENSORS_IBMPEX) += ibmpex.o
obj-$(CONFIG_SENSORS_IT87) += it87.o
obj-$(CONFIG_SENSORS_K8TEMP) += k8temp.o
+obj-$(CONFIG_SENSORS_K10TEMP) += k10temp.o
obj-$(CONFIG_SENSORS_LIS3LV02D) += lis3lv02d.o hp_accel.o
obj-$(CONFIG_SENSORS_LIS3_SPI) += lis3lv02d.o lis3lv02d_spi.o
obj-$(CONFIG_SENSORS_LM63) += lm63.o
--- /dev/null
+++ linux-2.6/drivers/hwmon/k10temp.c
@@ -0,0 +1,197 @@
+/*
+ * k10temp.c - AMD Family 10h/11h processor hardware monitoring
+ *
+ * Copyright (c) 2009 Clemens Ladisch <[email protected]>
+ *
+ *
+ * This driver is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This driver is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this driver; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/err.h>
+#include <linux/hwmon.h>
+#include <linux/hwmon-sysfs.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <asm/processor.h>
+
+MODULE_DESCRIPTION("AMD Family 10h/11h CPU core temperature monitor");
+MODULE_AUTHOR("Clemens Ladisch <[email protected]>");
+MODULE_LICENSE("GPL");
+
+static bool force;
+module_param(force, bool, 0444);
+MODULE_PARM_DESC(force, "force loading on processors with erratum 319");
+
+#define REG_HARDWARE_THERMAL_CONTROL 0x64
+#define HTC_ENABLE 0x00000001
+
+#define REG_REPORTED_TEMPERATURE 0xa4
+
+#define REG_NORTHBRIDGE_CAPABILITIES 0xe8
+#define NB_CAP_HTC 0x00000400
+
+static ssize_t show_temp(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ u32 regval;
+
+ pci_read_config_dword(to_pci_dev(dev),
+ REG_REPORTED_TEMPERATURE, &regval);
+ return sprintf(buf, "%u\n", (regval >> 21) * 125);
+}
+
+static ssize_t show_temp_max(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", 70 * 1000);
+}
+
+static ssize_t show_temp_crit(struct device *dev,
+ struct device_attribute *devattr, char *buf)
+{
+ struct sensor_device_attribute *attr = to_sensor_dev_attr(devattr);
+ int show_hyst = attr->index;
+ u32 regval;
+ int value;
+
+ pci_read_config_dword(to_pci_dev(dev),
+ REG_HARDWARE_THERMAL_CONTROL, &regval);
+ value = ((regval >> 16) & 0x7f) * 500 + 52000;
+ if (show_hyst)
+ value -= ((regval >> 24) & 0xf) * 500;
+ return sprintf(buf, "%d\n", value);
+}
+
+static ssize_t show_name(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "k10temp\n");
+}
+
+static DEVICE_ATTR(temp1_input, S_IRUGO, show_temp, NULL);
+static DEVICE_ATTR(temp1_max, S_IRUGO, show_temp_max, NULL);
+static SENSOR_DEVICE_ATTR(temp1_crit, S_IRUGO, show_temp_crit, NULL, 0);
+static SENSOR_DEVICE_ATTR(temp1_crit_hyst, S_IRUGO, show_temp_crit, NULL, 1);
+static DEVICE_ATTR(name, S_IRUGO, show_name, NULL);
+
+static bool __devinit has_erratum_319(void)
+{
+ /*
+ * Erratum 319: The thermal sensor of older Family 10h processors
+ * (B steppings) may be unreliable.
+ */
+ return boot_cpu_data.x86 == 0x10 && boot_cpu_data.x86_model <= 2;
+}
+
+static int __devinit k10temp_probe(struct pci_dev *pdev,
+ const struct pci_device_id *id)
+{
+ struct device *hwmon_dev;
+ u32 reg_caps, reg_htc;
+ int err;
+
+ if (has_erratum_319() && !force) {
+ dev_err(&pdev->dev,
+ "unreliable CPU thermal sensor; monitoring disabled\n");
+ err = -ENODEV;
+ goto exit;
+ }
+
+ err = device_create_file(&pdev->dev, &dev_attr_temp1_input);
+ if (err)
+ goto exit;
+ err = device_create_file(&pdev->dev, &dev_attr_temp1_max);
+ if (err)
+ goto exit_remove;
+
+ pci_read_config_dword(pdev, REG_NORTHBRIDGE_CAPABILITIES, &reg_caps);
+ pci_read_config_dword(pdev, REG_HARDWARE_THERMAL_CONTROL, &reg_htc);
+ if ((reg_caps & NB_CAP_HTC) && (reg_htc & HTC_ENABLE)) {
+ err = device_create_file(&pdev->dev,
+ &sensor_dev_attr_temp1_crit.dev_attr);
+ if (err)
+ goto exit_remove;
+ err = device_create_file(&pdev->dev,
+ &sensor_dev_attr_temp1_crit_hyst.dev_attr);
+ if (err)
+ goto exit_remove;
+ }
+
+ err = device_create_file(&pdev->dev, &dev_attr_name);
+ if (err)
+ goto exit_remove;
+
+ hwmon_dev = hwmon_device_register(&pdev->dev);
+ if (IS_ERR(hwmon_dev)) {
+ err = PTR_ERR(hwmon_dev);
+ goto exit_remove;
+ }
+ dev_set_drvdata(&pdev->dev, hwmon_dev);
+
+ if (has_erratum_319() && force)
+ dev_warn(&pdev->dev,
+ "unreliable CPU thermal sensor; check erratum 319\n");
+ return 0;
+
+exit_remove:
+ device_remove_file(&pdev->dev, &dev_attr_name);
+ device_remove_file(&pdev->dev, &dev_attr_temp1_input);
+ device_remove_file(&pdev->dev, &dev_attr_temp1_max);
+ device_remove_file(&pdev->dev,
+ &sensor_dev_attr_temp1_crit.dev_attr);
+ device_remove_file(&pdev->dev,
+ &sensor_dev_attr_temp1_crit_hyst.dev_attr);
+exit:
+ return err;
+}
+
+static void __devexit k10temp_remove(struct pci_dev *pdev)
+{
+ hwmon_device_unregister(dev_get_drvdata(&pdev->dev));
+ device_remove_file(&pdev->dev, &dev_attr_name);
+ device_remove_file(&pdev->dev, &dev_attr_temp1_input);
+ device_remove_file(&pdev->dev, &dev_attr_temp1_max);
+ device_remove_file(&pdev->dev,
+ &sensor_dev_attr_temp1_crit.dev_attr);
+ device_remove_file(&pdev->dev,
+ &sensor_dev_attr_temp1_crit_hyst.dev_attr);
+ dev_set_drvdata(&pdev->dev, NULL);
+}
+
+static struct pci_device_id k10temp_id_table[] = {
+ { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_10H_NB_MISC) },
+ { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_11H_NB_MISC) },
+ {}
+};
+MODULE_DEVICE_TABLE(pci, k10temp_id_table);
+
+static struct pci_driver k10temp_driver = {
+ .name = "k10temp",
+ .id_table = k10temp_id_table,
+ .probe = k10temp_probe,
+ .remove = __devexit_p(k10temp_remove),
+};
+
+static int __init k10temp_init(void)
+{
+ return pci_register_driver(&k10temp_driver);
+}
+
+static void __exit k10temp_exit(void)
+{
+ pci_unregister_driver(&k10temp_driver);
+}
+
+module_init(k10temp_init)
+module_exit(k10temp_exit)

2009-11-24 13:26:52

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

On Tue, 24 Nov 2009 09:43:29 +0100, Clemens Ladisch wrote:
> Jean Delvare wrote:
> > On Mon, 23 Nov 2009 16:29:25 +0100, Clemens Ladisch wrote:
> > > There still is the built-in diode to be read by the motherboard, but the
> > > internal sensor was never intended to be an absolute measurement but
> > > just as a means for controlling the cooling.
> >
> > Still we use it for that purpose at the moment. Maybe we simply should
> > not?
>
> Well, the absolute measurements have essentially the same purpose, and
> would not make much sense without comparing them to some absolute limit.

Of course. The problem is that users don't know that the temperature
isn't real, and then get puzzled when comparing with other temperature
sensors in their system, which _do_ report physical temperatures.

> In any case, it might make more sense to show such values as something
> like "20 ?C below maximum".

I think so, yes. Now the difficulty is to come up with a suitable sysfs
interface. Dropping the current interface altogether doesn't sound
right as it will take time before a new version of libsensors is
written and spread out and all applications add support for the new
interface. In the meantime, I guess we want users to still see the
approximate value.

So ideally we would come up with an interface that adds up to the one
we have currently. Future libsensors/applications could read the extra
information to display the value in a different format so that the
users see the difference.

An idea I have about this is adding a sysfs file temp#_relative, which
would contain the fake temperature value that is used as a reference
for the thermal sensor in question. In the case of k10temp, the value
would be 70000. So for example we would have:

temp1_input: 46000
temp1_relative: 70000

Old applications would display this as 46?C while new ones would
display "24?C below the limit". For coretemp this could be 85000 or
100000 (at least in the case where we don't know the limit for sure.)

If there are limits exported (temp1_max, temp1_crit etc.) the same
offset could be applied to them too.

This approach has the advantage of backwards compatibility. It may not
be considered flexible enough though... For example it does not support
sensors with totally arbitrary scales (where 1000 != 1?C.) I seem to
remember we've seen this in the past?

If others have ideas about how we can support this, I'm all ears.

--
Jean Delvare

2009-11-24 14:09:55

by Clemens Ladisch

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Jean Delvare wrote:
> On Tue, 24 Nov 2009 09:43:29 +0100, Clemens Ladisch wrote:
> > In any case, it might make more sense to show such values as something
> > like "20 °C below maximum".
>
> I think so, yes. Now the difficulty is to come up with a suitable sysfs
> interface. Dropping the current interface altogether doesn't sound
> right as it will take time before a new version of libsensors is
> written and spread out and all applications add support for the new
> interface. In the meantime, I guess we want users to still see the
> approximate value.

k10temp is a new driver, so no old systems would break if it introduced
a new class of measurements with names like, e.g., reltemp#_*.
We could add this in parallel with the old interface. If we wanted to.

> So ideally we would come up with an interface that adds up to the one
> we have currently. Future libsensors/applications could read the extra
> information to display the value in a different format so that the
> users see the difference.
>
> An idea I have about this is adding a sysfs file temp#_relative, which
> would contain the fake temperature value that is used as a reference
> for the thermal sensor in question. In the case of k10temp, the value
> would be 70000. So for example we would have:
>
> temp1_input: 46000
> temp1_relative: 70000
>
> Old applications would display this as 46°C while new ones would
> display "24°C below the limit".

In this case, temp1_relative is identical with temp1_max. In the
general case, there always must be some kind of limit (whether "max" or
"crit" or something else) against which the values are measured,
otherwise a relative value would not make sense.

This means that one of the already existing limit values must be the
reference base, so we'd need just a mechanism to specify which of them
is it, i.e., "temp1_relative_base: max". If we'd have
"temp1_relative: 70000", the application would have to search among the
limit values for one with the same value.

(It doesn't really matter which one is the base, as all values are
relative anyway, so we could just define that temp#_max is the base;
so we'd have "temp1_relative: true").

> It may not be considered flexible enough though... For example it does
> not support sensors with totally arbitrary scales (where 1000 != 1°C.)

When the scale differs but is _known_, the driver can just rescale its
internal register values to millidegrees.

When we have some scale like "0%...100%", that should probably be
exported as "pwm#_target" or something like that.


Best regards,
Clemens

2009-11-24 20:11:32

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

On Tue, 24 Nov 2009 15:09:57 +0100, Clemens Ladisch wrote:
> Jean Delvare wrote:
> > On Tue, 24 Nov 2009 09:43:29 +0100, Clemens Ladisch wrote:
> > > In any case, it might make more sense to show such values as something
> > > like "20 ?C below maximum".
> >
> > I think so, yes. Now the difficulty is to come up with a suitable sysfs
> > interface. Dropping the current interface altogether doesn't sound
> > right as it will take time before a new version of libsensors is
> > written and spread out and all applications add support for the new
> > interface. In the meantime, I guess we want users to still see the
> > approximate value.
>
> k10temp is a new driver, so no old systems would break if it introduced
> a new class of measurements with names like, e.g., reltemp#_*.

I wasn't necessarily thinking of k10temp breaking a working system,
more of k10temp just not working on current systems (if we don't opt
for parallel implementations.) And there's also the coretemp case,
where we report relative values in some cases; that case would fit in
what we're discussing now, so we would modify an already existing
driver.

> We could add this in parallel with the old interface. If we wanted to.

I think we want to, yes. I just don't know for how long.

> > So ideally we would come up with an interface that adds up to the one
> > we have currently. Future libsensors/applications could read the extra
> > information to display the value in a different format so that the
> > users see the difference.
> >
> > An idea I have about this is adding a sysfs file temp#_relative, which
> > would contain the fake temperature value that is used as a reference
> > for the thermal sensor in question. In the case of k10temp, the value
> > would be 70000. So for example we would have:
> >
> > temp1_input: 46000
> > temp1_relative: 70000
> >
> > Old applications would display this as 46?C while new ones would
> > display "24?C below the limit".
>
> In this case, temp1_relative is identical with temp1_max. In the
> general case, there always must be some kind of limit (whether "max" or
> "crit" or something else) against which the values are measured,
> otherwise a relative value would not make sense.

I would love this to be true, but I can't see any reason why it would
have to be. I can easily imagine a CPU specification reading: "This
register contains a temperature value on an arbitrary scale; higher
values mean higher temperatures." I'm not claiming it would be
particularly useful... but if a CPU maker ever does this, don't we want
to support that? (This is a real question, maybe we don't.)

I really didn't expect the implementations we've seen in Intel Core or
AMD family 10h processors. So I wouldn't be surprised if future
implementations are different again.

> This means that one of the already existing limit values must be the
> reference base, so we'd need just a mechanism to specify which of them
> is it, i.e., "temp1_relative_base: max". If we'd have
> "temp1_relative: 70000", the application would have to search among the
> limit values for one with the same value.

I fail to see why the application would care about this at all. When in
relative mode, all other values would be offset by the temp#_relative
value. But that value itself would not be displayed (it has no physical
value, otherwise we wouldn't be in absolute mode, would we?)

> (It doesn't really matter which one is the base, as all values are
> relative anyway, so we could just define that temp#_max is the base;
> so we'd have "temp1_relative: true").

This is taking flexibility away from us, for no benefit that I can see.
Am I missing something?

(Additionally it wouldn't fit in libsensors as it exists today. This
doesn't mean it can't be done, but the cost is higher, so it needs to
bring an significant improvement.)

> > It may not be considered flexible enough though... For example it does
> > not support sensors with totally arbitrary scales (where 1000 != 1?C.)
>
> When the scale differs but is _known_, the driver can just rescale its
> internal register values to millidegrees.

Correct, and actually many drivers do that already. But the scale could
also be totally unknown (as in my theoretical example above), only
bounded by the register size (which would lead to pretty bad temp#_min
and _max IMHO.)

> When we have some scale like "0%...100%", that should probably be
> exported as "pwm#_target" or something like that.

That would be difficult, as we generally don't know which pwm output,
if any, is related to a given CPU. ACPI may know more, but the ACPI
interface to cooling zones and temperatures is very limited IMHO, and
hard to connect to physical devices, I doubt we can get anything usable
out of it. And anyway, deciding that cooling must be somehow
proportional to temperature is pretty arbitrary and doesn't match what
I've seen people do so far.

--
Jean Delvare

2009-11-25 09:51:38

by Clemens Ladisch

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Jean Delvare wrote:
> On Tue, 24 Nov 2009 15:09:57 +0100, Clemens Ladisch wrote:
> > This means that one of the already existing limit values must be the
> > reference base, so we'd need just a mechanism to specify which of them
> > is it, i.e., "temp1_relative_base: max". If we'd have
> > "temp1_relative: 70000", the application would have to search among the
> > limit values for one with the same value.
>
> I fail to see why the application would care about this at all. When in
> relative mode, all other values would be offset by the temp#_relative
> value. But that value itself would not be displayed (it has no physical
> value, otherwise we wouldn't be in absolute mode, would we?)
> ...
>
> > temp1_relative: true
>
> This is taking flexibility away from us, for no benefit that I can see.
> Am I missing something?

The application has to display something like "24 °C below the limit",
so how should it know that the 70°C should be named "the limit"?

To use an example, my CPU has these entries like these:
temp1_input: 29875
temp1_max: 70000
temp1_crit: 95000
temp1_crit_hyst: 92500

How should these entries be displayed?
(we know that: "40.1 °C below limit", "limit", "25 °C above limit" etc.)

But what algorithm should the application (or libsensors) use to create
those labels? If we have "temp1_relative: 70000", then this happens to
be the "max" limit; but what if some CPU vendor decides to define, e.g.,
the value 0 as the "normal" operating temperatire, so that the entries
would look like this:
temp1_input: -1000
temp1_max: 40000
temp1_relative: 0
Should the values be labeled as "1 °C below normal" and "40 °C above
normal", and how should the application know that 0 is to be labeled
"normal"? It might make more sense to display the temperature just as
"41 °C below max", in which case the actual value of temp1_relative is
not used at all.

"Relative" means that any value is meaningful only in comparison with
other values/limits, so it does not make sense to declare one point on
the scale as base.

> Additionally it wouldn't fit in libsensors as it exists today.

Then the best bet would probably be an entry like temp#_unit, with
0 = absolute °C (default); 1 = relative °C or °K; other values
"unknown". Even if some silly scale is introduced later, applications
that read this entry then know that they must not display a unit like °C
for unknown unit specifications.


Best regards,
Clemens

2009-11-25 19:47:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] k10temp: temperature sensor for AMD Family 10h/11h CPUs

On Tue, 24 Nov 2009 09:43:39 +0100
Clemens Ladisch <[email protected]> wrote:

> This adds a driver for the internal temperature sensor of AMD Family 10h
> and 11h CPUs.
>
> Signed-off-by: Clemens Ladisch <[email protected]>
> ---
> v3: added 'force' parameter for CPUs with buggy sensor; more documentation
> v4: added max/crit values, other changes suggested by Jean Delvare

Pierre's happy now?

>
> ...
>
> +#define REG_HARDWARE_THERMAL_CONTROL 0x64
> +#define HTC_ENABLE 0x00000001
> +
> +#define REG_REPORTED_TEMPERATURE 0xa4
> +
> +#define REG_NORTHBRIDGE_CAPABILITIES 0xe8
> +#define NB_CAP_HTC 0x00000400

These definitions relate to register addresses within the AMD CPUs,
yes? I wonder if these should be placed in some header file in
arch/x86/include/asm/ where other such things are described.

2009-11-26 07:46:43

by Clemens Ladisch

[permalink] [raw]
Subject: Re: [PATCH v4] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Andrew Morton wrote:
> Clemens Ladisch <[email protected]> wrote:
> > This adds a driver for the internal temperature sensor of AMD Family 10h
> > and 11h CPUs.
> > ...
> > +#define REG_HARDWARE_THERMAL_CONTROL 0x64
> > +#define HTC_ENABLE 0x00000001
> > +
> > +#define REG_REPORTED_TEMPERATURE 0xa4
> > +
> > +#define REG_NORTHBRIDGE_CAPABILITIES 0xe8
> > +#define NB_CAP_HTC 0x00000400
>
> These definitions relate to register addresses within the AMD CPUs,
> yes?

Yes, in the PCI configuration space of the internal northbridge.

> I wonder if these should be placed in some header file in
> arch/x86/include/asm/ where other such things are described.

There is no such header. The files that also access PCI cfg registers:
arch/x86/kernel/k8.c
arch/x86/kernel/quirks.c
arch/x86/oprofile/op_model_amd.c
define their own private symbols or use magic numbers.


Best regards,
Clemens

2009-11-26 20:44:39

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Hi Clemens,

On Wed, 25 Nov 2009 10:51:38 +0100, Clemens Ladisch wrote:
> Jean Delvare wrote:
> > On Tue, 24 Nov 2009 15:09:57 +0100, Clemens Ladisch wrote:
> > > This means that one of the already existing limit values must be the
> > > reference base, so we'd need just a mechanism to specify which of them
> > > is it, i.e., "temp1_relative_base: max". If we'd have
> > > "temp1_relative: 70000", the application would have to search among the
> > > limit values for one with the same value.
> >
> > I fail to see why the application would care about this at all. When in
> > relative mode, all other values would be offset by the temp#_relative
> > value. But that value itself would not be displayed (it has no physical
> > value, otherwise we wouldn't be in absolute mode, would we?)
> > ...
> >
> > > temp1_relative: true
> >
> > This is taking flexibility away from us, for no benefit that I can see.
> > Am I missing something?
>
> The application has to display something like "24 ?C below the limit",
> so how should it know that the 70?C should be named "the limit"?

OK, I get your point now. We have to think about how applications will
present the information to the user. Admittedly my proposal doesn't
address this part of the problem.

> To use an example, my CPU has these entries like these:
> temp1_input: 29875
> temp1_max: 70000
> temp1_crit: 95000
> temp1_crit_hyst: 92500
>
> How should these entries be displayed?
> (we know that: "40.1 ?C below limit", "limit", "25 ?C above limit" etc.)
>
> But what algorithm should the application (or libsensors) use to create
> those labels? If we have "temp1_relative: 70000", then this happens to
> be the "max" limit; but what if some CPU vendor decides to define, e.g.,
> the value 0 as the "normal" operating temperatire, so that the entries
> would look like this:
> temp1_input: -1000
> temp1_max: 40000
> temp1_relative: 0
> Should the values be labeled as "1 ?C below normal" and "40 ?C above
> normal", and how should the application know that 0 is to be labeled
> "normal"? It might make more sense to display the temperature just as
> "41 ?C below max", in which case the actual value of temp1_relative is
> not used at all.

Except that there may be no temp1_max, just a temperature value
relative to the "normal" operating point of the CPU. In that case we
can't fallback to the max limit.

Even your initial proposal doesn't work there yet: the hwmon interface
has no standard name for "normal operating temperature", so we can't put
that name in temp#_relative. And again there's the (potential) case
where we don't know what the reported temperature value is relative to.

I'm wondering if this would make sense to (ab)use the temp#_label
string for that. Or maybe create a new label (temp#_relative_label or
similar) but I'm not sure how we would integrate this into libsensors
and applications. In particular I am worried about translation issues
if we make the drivers too verbose.

> "Relative" means that any value is meaningful only in comparison with
> other values/limits, so it does not make sense to declare one point on
> the scale as base.

The _hardware_ does use one point of the scale as the base, not us. We
have to deal with what the hardware implements. If the base has a
meaning (normal operating temperature, or critical temperature, etc.)
we have to let the user know somehow.

> > Additionally it wouldn't fit in libsensors as it exists today.
>
> Then the best bet would probably be an entry like temp#_unit, with
> 0 = absolute ?C (default); 1 = relative ?C or ?K; other values
> "unknown". Even if some silly scale is introduced later, applications
> that read this entry then know that they must not display a unit like ?C
> for unknown unit specifications.

This could work, yes. Note that current drivers and libsensors don't
have/know about this file yet, and they generally use an absolute ?C
scale. So the absence of temp#_unit file would be interpreted exactly
as if the file was there and contained value 0.

(I'd rather name that file temp#_scale - but that's an implementation
detail.)

Thanks,
--
Jean Delvare

2009-11-27 13:03:28

by Clemens Ladisch

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Jean Delvare wrote:
> On Wed, 25 Nov 2009 10:51:38 +0100, Clemens Ladisch wrote:
> > temp1_input: -1000
> > temp1_max: 40000
> > temp1_relative: 0
> > Should the values be labeled as "1 °C below normal" and "40 °C above
> > normal", and how should the application know that 0 is to be labeled
> > "normal"? It might make more sense to display the temperature just as
> > "41 °C below max", in which case the actual value of temp1_relative is
> > not used at all.
>
> Except that there may be no temp1_max, just a temperature value
> relative to the "normal" operating point of the CPU. In that case we
> can't fallback to the max limit.
>
> Even your initial proposal doesn't work there yet: the hwmon interface
> has no standard name for "normal operating temperature", so we can't put
> that name in temp#_relative. [...]
> If the base has a meaning (normal operating temperature, or critical
> temperature, etc.) we have to let the user know somehow.

I chose that example because "normal" does not exist; and it's a bad
example because "normal" actually has a meaning.

Better take the AMD CPUs: The base of all relative values is zero (by
definition), _not_ 70000, and the meaning of that base is just "70 °C
below the temperature at which the processor wants 100% cooling". This
base value is meaningless for any monitoring purposes.

If any point on the scale has a meaning, it should be reported with some
temp#_whatever file. However, the base itself does not necessarily have
any meaning.

As long as we have some corresponding _max or _crit limit that can be
used for comparisons, we do not need a base value. Only if there is
no known predefined limit do we need a temp#_relative value.

> Or maybe create a new label (temp#_relative_label or similar) but I'm
> not sure how we would integrate this into libsensors and applications.
> In particular I am worried about translation issues if we make the
> drivers too verbose.

All known CPUs with relative temperature scale also have known _max
limits, and I don't think that a CPU with relative scale and both
unknown _max and _crit will ever be designed. In other words,
temp#_relative* is not needed at the moment. I think we should not
try to define how the semantics of such an unknown scale can be
described.

> > > Additionally it wouldn't fit in libsensors as it exists today.
> >
> > Then the best bet would probably be an entry like temp#_unit, with
> > 0 = absolute °C (default); 1 = relative °C or °K; other values
> > "unknown". Even if some silly scale is introduced later, applications
> > that read this entry then know that they must not display a unit like °C
> > for unknown unit specifications.
>
> This could work, yes. Note that current drivers and libsensors don't
> have/know about this file yet, and they generally use an absolute °C
> scale. So the absence of temp#_unit file would be interpreted exactly
> as if the file was there and contained value 0.
>
> (I'd rather name that file temp#_scale - but that's an implementation
> detail.)

Like this?

--- a/Documentation/hwmon/sysfs-interface
+++ b/Documentation/hwmon/sysfs-interface
@@ -314,6 +314,19 @@ temp_reset_history
Reset temp_lowest and temp_highest for all sensors
WO

+temp[1-*]_scale Temperature scale type.
+ Integer
+ RO
+ 0: millidegrees Celsius (default if no _scale entry)
+ 1: relative millidegrees Celsius; see below
+ 2: millivolts; see below
+ other values: unknown
+ When scale=1 (relative), the temperature value 0 does not
+ correspond to zero degrees Celsius but to some unknown
+ temperature. In this case, temperate values should not be
+ interpreted or displayed as absolute values and make sense
+ only when compared to other values of the same channel.
+
Some chips measure temperature using external thermistors and an ADC, and
report the temperature measurement as a voltage. Converting this voltage
back to a temperature (or the other way around for limits) requires


Hmm, which drivers use millivolt temperatures?


Best regards,
Clemens

2009-11-27 15:43:31

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH v4] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Hi Clemens,

On Tue, 24 Nov 2009 09:43:39 +0100, Clemens Ladisch wrote:
> This adds a driver for the internal temperature sensor of AMD Family 10h
> and 11h CPUs.
>
> Signed-off-by: Clemens Ladisch <[email protected]>
> ---
> v3: added 'force' parameter for CPUs with buggy sensor; more documentation
> v4: added max/crit values, other changes suggested by Jean Delvare
>
> Documentation/hwmon/k10temp | 60 +++++++++
> drivers/hwmon/Kconfig | 12 +
> drivers/hwmon/Makefile | 1
> drivers/hwmon/k10temp.c | 197 ++++++++++++++++++++++++++++++++
> 4 files changed, 270 insertions(+)

Looks alright to me.

Acked-by: Jean Delvare <[email protected]>

If this is OK with you, I can add this patch to my hwmon tree and
schedule it for merge in kernel 2.6.33.

--
Jean Delvare

2009-11-28 07:49:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] k10temp: temperature sensor for AMD Family 10h/11h CPUs

On Fri, 27 Nov 2009 16:43:33 +0100 Jean Delvare <[email protected]> wrote:

> Hi Clemens,
>
> On Tue, 24 Nov 2009 09:43:39 +0100, Clemens Ladisch wrote:
> > This adds a driver for the internal temperature sensor of AMD Family 10h
> > and 11h CPUs.
> >
> > Signed-off-by: Clemens Ladisch <[email protected]>
> > ---
> > v3: added 'force' parameter for CPUs with buggy sensor; more documentation
> > v4: added max/crit values, other changes suggested by Jean Delvare
> >
> > Documentation/hwmon/k10temp | 60 +++++++++
> > drivers/hwmon/Kconfig | 12 +
> > drivers/hwmon/Makefile | 1
> > drivers/hwmon/k10temp.c | 197 ++++++++++++++++++++++++++++++++
> > 4 files changed, 270 insertions(+)
>
> Looks alright to me.
>
> Acked-by: Jean Delvare <[email protected]>
>
> If this is OK with you, I can add this patch to my hwmon tree and
> schedule it for merge in kernel 2.6.33.

Sounds good, thanks.

2010-01-10 14:45:56

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Hi Clemens,

Sorry for the late answer...

On Fri, 27 Nov 2009 14:03:29 +0100, Clemens Ladisch wrote:
> Jean Delvare wrote:
> > On Wed, 25 Nov 2009 10:51:38 +0100, Clemens Ladisch wrote:
> > > temp1_input: -1000
> > > temp1_max: 40000
> > > temp1_relative: 0
> > > Should the values be labeled as "1 °C below normal" and "40 °C above
> > > normal", and how should the application know that 0 is to be labeled
> > > "normal"? It might make more sense to display the temperature just as
> > > "41 °C below max", in which case the actual value of temp1_relative is
> > > not used at all.
> >
> > Except that there may be no temp1_max, just a temperature value
> > relative to the "normal" operating point of the CPU. In that case we
> > can't fallback to the max limit.
> >
> > Even your initial proposal doesn't work there yet: the hwmon interface
> > has no standard name for "normal operating temperature", so we can't put
> > that name in temp#_relative. [...]
> > If the base has a meaning (normal operating temperature, or critical
> > temperature, etc.) we have to let the user know somehow.
>
> I chose that example because "normal" does not exist; and it's a bad
> example because "normal" actually has a meaning.

Indeed, we may want to add an standard name for this.

> Better take the AMD CPUs: The base of all relative values is zero (by
> definition), _not_ 70000, and the meaning of that base is just "70 °C
> below the temperature at which the processor wants 100% cooling". This
> base value is meaningless for any monitoring purposes.

Correct. That being said, nothing prevents us from changing the base to
whatever we want it to be, pretty much by definition of relative
temperature reports. For example we could report 0 for temp1_max and
(register value - 70000) for temp1_input.

> If any point on the scale has a meaning, it should be reported with some
> temp#_whatever file. However, the base itself does not necessarily have
> any meaning.
>
> As long as we have some corresponding _max or _crit limit that can be
> used for comparisons, we do not need a base value. Only if there is
> no known predefined limit do we need a temp#_relative value.

I agree (again, unless we always change the base to be one of these
meaningful points... but I'm not sure if we want to do this.)

> > Or maybe create a new label (temp#_relative_label or similar) but I'm
> > not sure how we would integrate this into libsensors and applications.
> > In particular I am worried about translation issues if we make the
> > drivers too verbose.
>
> All known CPUs with relative temperature scale also have known _max
> limits,

Not correct. For some variants of the Intel Core, we do not known the
max limit. I'm not even sure if the max limit value exists.

> and I don't think that a CPU with relative scale and both
> unknown _max and _crit will ever be designed. In other words,
> temp#_relative* is not needed at the moment. I think we should not
> try to define how the semantics of such an unknown scale can be
> described.

I agree that we shouldn't waste time designing an interface for
something that doesn't exist. All I'm saying is that, if we are adding
something to the interface today and it's not trivial, we should make
it reasonably flexible so that we don't have to do it all again next
year or so. Otherwise the interface could turn quite complex and
unappealing.

> > > > Additionally it wouldn't fit in libsensors as it exists today.
> > >
> > > Then the best bet would probably be an entry like temp#_unit, with
> > > 0 = absolute °C (default); 1 = relative °C or °K; other values
> > > "unknown". Even if some silly scale is introduced later, applications
> > > that read this entry then know that they must not display a unit like °C
> > > for unknown unit specifications.
> >
> > This could work, yes. Note that current drivers and libsensors don't
> > have/know about this file yet, and they generally use an absolute °C
> > scale. So the absence of temp#_unit file would be interpreted exactly
> > as if the file was there and contained value 0.
> >
> > (I'd rather name that file temp#_scale - but that's an implementation
> > detail.)
>
> Like this?
>
> --- a/Documentation/hwmon/sysfs-interface
> +++ b/Documentation/hwmon/sysfs-interface
> @@ -314,6 +314,19 @@ temp_reset_history
> Reset temp_lowest and temp_highest for all sensors
> WO
>
> +temp[1-*]_scale Temperature scale type.
> + Integer
> + RO
> + 0: millidegrees Celsius (default if no _scale entry)
> + 1: relative millidegrees Celsius; see below
> + 2: millivolts; see below
> + other values: unknown
> + When scale=1 (relative), the temperature value 0 does not
> + correspond to zero degrees Celsius but to some unknown
> + temperature. In this case, temperate values should not be
> + interpreted or displayed as absolute values and make sense
> + only when compared to other values of the same channel.
> +
> Some chips measure temperature using external thermistors and an ADC, and
> report the temperature measurement as a voltage. Converting this voltage
> back to a temperature (or the other way around for limits) requires

Maybe, yes. I am a little worried that older versions of libsensors
will ignore this attribute. The good thing about this is that users
will still get some value until they upgrade. The bad thing is that
they will not know that the value isn't absolute. They are likely to
get frightened by unexpected values and/or to complain to us about them.

I am wondering if a totally separate channel type wouldn't be
preferable. The pros and cons would be inverted of course: older
versions of libsensors would have zero support for that, and all
applications would have to be updated to support it, but at least the
meaning of the value would be totally clear. This would come at the
price of some code duplication both in libsensors and applications
though.

I guess it basically depends whether we want to consider a thermal
margin as a "temperature measurement except that it's relative" or as
something completely separate. Honestly, I've been thinking about this
for some time now and I simply don't know what we'd rather do :(

> Hmm, which drivers use millivolt temperatures?

vt1211, vt8231, pc87360. These devices use thermistors for temperature
measurements, they measure the voltage at the pin (exactly as for a
voltage input) and leave it to user-space to turn the voltage value
back into a temperature value. The exact conversion formula depends on
the thermistor model and the value of the resistor used for the second
part of the bridge.

Note that virtually every chip monitoring voltages can be used to
monitor temperature instead, using thermistors. And this was seen once
already:
http://article.gmane.org/gmane.linux.drivers.sensors/14564/
We do not have support for this at the moment though.

--
Jean Delvare

2010-01-15 10:06:46

by Clemens Ladisch

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Jean Delvare wrote:
> On Fri, 27 Nov 2009 14:03:29 +0100, Clemens Ladisch wrote:
> > +temp[1-*]_scale Temperature scale type.
> > + 0: millidegrees Celsius (default if no _scale entry)
> > + 1: relative millidegrees Celsius; see below
> > + 2: millivolts; see below
> > + other values: unknown
>
> Maybe, yes. I am a little worried that older versions of libsensors
> will ignore this attribute. The good thing about this is that users
> will still get some value until they upgrade. The bad thing is that
> they will not know that the value isn't absolute. They are likely to
> get frightened by unexpected values and/or to complain to us about them.
>
> I am wondering if a totally separate channel type wouldn't be
> preferable. The pros and cons would be inverted of course: older
> versions of libsensors would have zero support for that, and all
> applications would have to be updated to support it, but at least the
> meaning of the value would be totally clear. This would come at the
> price of some code duplication both in libsensors and applications
> though.
>
> I guess it basically depends whether we want to consider a thermal
> margin as a "temperature measurement except that it's relative" or as
> something completely separate.

It's different from the millidegree/millivolt issue; millivolts can be
transparently converted by libsensors, while relative values must be
handled/displayed differently by the application. So I think at least
in the libsensors API, relative values should be different.

(In any case, we should add temp#_scale at least for millivolts.)

> Honestly, I've been thinking about this
> for some time now and I simply don't know what we'd rather do :(

The sysfs interface is a somewhat internal interface; I think the main
question is whether old userspace (old libsensors or old apps using a
new libsensors) should be able to see relative values without knowing
that they are relative.

Coretemp and k10temp already exist and show relative values. If we
introduced a new channel for relative values now, we would still have
problems with those drivers, so it might already be too late to avoid
problems for old userspace.


Best regards,
Clemens

2010-01-15 13:31:46

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH v3] k10temp: temperature sensor for AMD Family 10h/11h CPUs

Hi Clemens,

On Fri, 15 Jan 2010 10:57:58 +0100, Clemens Ladisch wrote:
> Jean Delvare wrote:
> > On Fri, 27 Nov 2009 14:03:29 +0100, Clemens Ladisch wrote:
> > > +temp[1-*]_scale Temperature scale type.
> > > + 0: millidegrees Celsius (default if no _scale entry)
> > > + 1: relative millidegrees Celsius; see below
> > > + 2: millivolts; see below
> > > + other values: unknown
> >
> > Maybe, yes. I am a little worried that older versions of libsensors
> > will ignore this attribute. The good thing about this is that users
> > will still get some value until they upgrade. The bad thing is that
> > they will not know that the value isn't absolute. They are likely to
> > get frightened by unexpected values and/or to complain to us about them.
> >
> > I am wondering if a totally separate channel type wouldn't be
> > preferable. The pros and cons would be inverted of course: older
> > versions of libsensors would have zero support for that, and all
> > applications would have to be updated to support it, but at least the
> > meaning of the value would be totally clear. This would come at the
> > price of some code duplication both in libsensors and applications
> > though.
> >
> > I guess it basically depends whether we want to consider a thermal
> > margin as a "temperature measurement except that it's relative" or as
> > something completely separate.
>
> It's different from the millidegree/millivolt issue; millivolts can be
> transparently converted by libsensors, while relative values must be
> handled/displayed differently by the application. So I think at least
> in the libsensors API, relative values should be different.
>
> (In any case, we should add temp#_scale at least for millivolts.)

I agree it would make sense. So far, libsensors worked without it,
thanks to a large default configuration file that included a default
voltage->temperature conversion for each affected chip. Now that we
have opted for a much smaller configuration file, completely weird
temperature values will be displayed by default. Ideally, libsensors
would refuse to display temperature values that are reported in mV in
the absence of a corresponding conversion formula in the configuration
file. I'm not sure how easy that would be to implement in libsensors
though.

> > Honestly, I've been thinking about this
> > for some time now and I simply don't know what we'd rather do :(
>
> The sysfs interface is a somewhat internal interface;

NO, it is not. I would love to consider it internal, and we certainly
discourage people from accessing the sysfs interface directly. However
some applications _do_ access sysfs directly (fancontrol and pwmconfig
come to mind but there are others.) On top of that, older and newer
versions of libsensors are being used. So we have to preserve
compatibility in every change we make to the sysfs interface. Thus you
can't claim it is internal.

> I think the main
> question is whether old userspace (old libsensors or old apps using a
> new libsensors) should be able to see relative values without knowing
> that they are relative.

Yes, this is the main question. There are pros and cons both ways, as I
explained before.

> Coretemp and k10temp already exist and show relative values.

Not really. There is a significant difference between showing absolute
values which may be incorrect because we are unsure of the base point
(coretemp), showing relative values which are arbitrarily offset to
look like absolute values (k10temp) and showing totally relative
values (thermal margins actually.) In the former two cases, there can
be some confusion for the user (for example when comparing two sensor
values in the same machine) but I don't expect the user to be
frightened. While showing "negative temperatures" would probably
frighten the users. Showing values which decrease when temperature
increases (which is still an option) would also be a lot more confusing
than what we have today.

As a matter of fact, we will have to handle the 3 cases I just
described differently, at least to some extent, for compatibility
reasons (see below.)

> If we
> introduced a new channel for relative values now, we would still have
> problems with those drivers, so it might already be too late to avoid
> problems for old userspace.

We can't change old user-space by definition. All we can do is think of
how it will react to sysfs interface changes.

One thing you must keep in mind is that trading one form of imperfection
for another is not considered OK. Users get used to the initial form of
imperfection and will still see the change as a regression (I do
software maintenance for a living if you couldn't tell.) This is
especially true when things have been the way they are for long (which
is the case of coretemp.) For k10temp we have a grace period because
the driver is still new and its users are few. But this won't last.

Concretely, this means that, just because coretemp doesn't always
report correct temperature values, we can't change it to exclusively
report raw thermal margins today. Whatever change we come up with, it
shouldn't alter what older versions of libsensors would report.

I am fine with your temp[1-*]_scale proposal, but this is only one
piece of the puzzle. We must also define what newer versions of
libsensors will do with the value in that file, both for older drivers
such as coretemp (for which we can't change the interface too much for
compatibility reasons) and for hypothetical new drivers reporting truly
relative temperatures (with k10temp being an intermediate case.)

I suspect we will have change the output of "sensors" too. So far we
did not print any category names, under the assumption that the unit
used for each value was sufficient. Now if °C can mean absolute
temperature or (relative) thermal margin, this becomes ambiguous.

I've created a new ticket on lm-sensors.org to track this issue:
http://www.lm-sensors.org/ticket/2376

(Sorry, the site is awfully slow these days, when responsive at all...)

I can create a wiki account for you if you want to keep participating
in the lm-sensors project.

--
Jean Delvare