2019-12-15 17:46:24

by Guenter Roeck

[permalink] [raw]
Subject: [PATCH v2 0/1] Summary: hwmon driver for temperature sensors on SATA drives

In the past, several attempts have been made to add support for reporting
SCSI/[S]ATA drive temperatures to the Linux kernel. This is desirable to
have a means to report drive temperatures to userspace without root
privileges and in a standard format, but also to be able to tie reported
temperatures with the thermal subsystem.

The most recent attempt was [1] by Linus Walleij. It went through a total
of seven iterations. At the end, it was rejected for a number of reasons;
see the provided link for details. This implementation resides in the
SCSI core. It originally resided in libata but was moved to SCSI per
maintainer request, where it was ultimately rejected.

The feedback on this approach suggests to use the SCSI Temperature log page
[0x0d] as means to access drive temperature information. It is unknown
if this is implemented in any real SCSI drive. The feedback also suggests to
obtain temperature from ATA drives, convert it into the SCSI temperature log
page in libata-scsi, and to use that information in a hardware monitoring
driver. The format and method to do this is documented in [3]. This is not
currently implemented in the Linux kernel.

An earlier submission of a driver to report SCSI/SATA drive temperatures
was made back in 2009 by Constantin Baranov [2]. This submission resides
in the hardware monitoring subsystem. It does not rely on changes in the
SCSI subsystem or in libata-scsi. Instead, it registers itself with the
SCSI subsystem using scsi_register_interface(). It was rejected primarily
because it executes ATA passthrough commands without verification that it
is actually connected to an ATA drive.

Both submissions use SMART attributes to read drive temperature information.
[1] also tries to identify temperature limits from those attributes.
Unfortunately, SMART attributes are not well defined, resulting in relative
complex code trying to identify the exact format of the reported data.

With the available information and feedback, we can make a number of
observations and conclusions.
a) Using available (S)ATA drive temperature information and convert it to
a SCSI log page is an interesting idea. On the downside, it would add a
substantial amount of complexity to libata-scsi. The code would either
have to be optional, or it would have to be built into the kernel even
if it is never used on a given system. Without access to SCSI drives
supporting this feature, it would be all but impossible to test the code
against such a drive. It would neither be possible to test correctness
of the code in libata-scsi nor in the driver using that information.
Overall it would be much easier and much less risky to implement such
code on the receiving side (ie in a driver reporting the temperatures)
instead of trying to convert the information from one format to another
first. In summary, it is neither practical nor feasible. On top of that,
there is no guarantee that code implementing this functionality would
ever be accepted into the kernel for this very reason.
b) The code needed to read and analyze SCSI temperature log pages is quite
complex (see smartmontools [5]). There is no existing support code
in the Linux kernel; such code would have to be written. This makes
the approach discussed in a) even more risky and less practical.
c) Overall, any attempt to report temperature information for anything
but SATA drives in the kernel is not practical due to the complexity
involved, and due to the inability to test the resulting code with
non-SATA drives.
d) Using SMART data for anything but basic temperature reporting is not
really feasible due to the lack of standardization. Any attempt to do
this would add a substantial amount of code, ambiguity, and risk.

This submission implements a driver to report the temperature of SATA
drives through the hardware monitoring subsystem. It is implemented as
stand-alone driver in the hardware monitoring subsystem. The driver uses
the mechanism from submission [1] to register with the SCSI subsystem.
By using this mechanism, changes in the SCSI or ATA subsystems are not
required. To reduce risk and complexity, it only instantiates after
reliably validating that it is connected to a SATA drive. It does not
attempt to report the temperature of non-SATA drives.

The driver uses the SCT Command Transport feature set as specified in
ATA8-ACS [4] to read and report the temperature as well as temperature
limits and lowest/highest temperature information (if available) for
SATA drives. If a drive does not support SCT Command Transport, the driver
attempts to access a limited set of well known SMART attributes to read
the drive temperature. In that case, only the current drive temperature
is reported.

---
v2: scsi_cmd variable is no longer static
Fixed drive name in Kconfig
Describe heuristics used to select SCT or SMART in commit message
Added Reviewed-by: from Linus Walleij

Note:
I thought about waiting for more feedback, but maybe improvements
can be made with follow-up patches.

---
References:
[1] https://patchwork.kernel.org/patch/10688021/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] http://www.t10.org/cgi-bin/ac.pl?t=f&f=sat5r02.pdf
Information technology - SCSI / ATA Translation - 5 (SAT-5),
section 10.3.8 (Temperature log page).
[4] http://www.t13.org/documents/uploadeddocuments/docs2008/d1699r6a-ata8-acs.pdf
ANS T13/1699-D "Information technology - AT Attachment 8 - ATA/ATAPI Command
Set (ATA8-ACS)"
[5] https://github.com/mirror/smartmontools.git


2019-12-15 17:46:27

by Guenter Roeck

[permalink] [raw]
Subject: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Reading the hard drive temperature has been supported for years
by userspace tools such as smarttools or hddtemp. The downside of
such tools is that they need to run with super-user privilege, that
the temperatures are not reported by standard tools such as 'sensors'
or 'libsensors', and that drive temperatures are not available for use
in the kernel's thermal subsystem.

This driver solves this problem by adding support for reading the
temperature of SATA drives from the kernel using the hwmon API and
by adding a temperature zone for each drive.

With this driver, the hard disk temperature can be read using the
unprivileged 'sensors' application:

$ sensors satatemp-scsi-1-0
satatemp-scsi-1-0
Adapter: SCSI adapter
temp1: +23.0°C

or directly from sysfs:

$ grep . /sys/class/hwmon/hwmon9/{name,temp1_input}
/sys/class/hwmon/hwmon9/name:satatemp
/sys/class/hwmon/hwmon9/temp1_input:23000

If the drive supports SCT transport and reports temperature limits,
those are reported as well.

satatemp-scsi-0-0
Adapter: SCSI adapter
temp1: +27.0°C (low = +0.0°C, high = +60.0°C)
(crit low = -41.0°C, crit = +85.0°C)
(lowest = +23.0°C, highest = +34.0°C)

The driver attempts to use SCT Command Transport to read the drive
temperature. If the SCT Command Transport feature set is not available,
or if it does not report the drive temperature, drive temperatures may
be readable through SMART attributes. Since SMART attributes are not well
defined, this method is only used as fallback mechanism.

Cc: Chris Healy <[email protected]>
Cc: Linus Walleij <[email protected]>
Cc: Martin K. Petersen <[email protected]>
Cc: Bart Van Assche <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
---
v2: scsi_cmd variable is no longer static
Fixed drive name in Kconfig
Describe heuristics used to select SCT or SMART in commit message
Added Reviewed-by: from Linus Walleij

Documentation/hwmon/index.rst | 1 +
Documentation/hwmon/satatemp.rst | 48 +++
drivers/hwmon/Kconfig | 10 +
drivers/hwmon/Makefile | 1 +
drivers/hwmon/satatemp.c | 575 +++++++++++++++++++++++++++++++
5 files changed, 635 insertions(+)
create mode 100644 Documentation/hwmon/satatemp.rst
create mode 100644 drivers/hwmon/satatemp.c

diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
index 230ad59b462b..ecf1832dd013 100644
--- a/Documentation/hwmon/index.rst
+++ b/Documentation/hwmon/index.rst
@@ -133,6 +133,7 @@ Hardware Monitoring Kernel Drivers
pxe1610
pwm-fan
raspberrypi-hwmon
+ satatemp
sch5627
sch5636
scpi-hwmon
diff --git a/Documentation/hwmon/satatemp.rst b/Documentation/hwmon/satatemp.rst
new file mode 100644
index 000000000000..59b105f3c79a
--- /dev/null
+++ b/Documentation/hwmon/satatemp.rst
@@ -0,0 +1,48 @@
+Kernel driver satatemp
+======================
+
+
+References
+----------
+
+ANS T13/1699-D
+Information technology - AT Attachment 8 - ATA/ATAPI Command Set (ATA8-ACS)
+
+ANS Project T10/BSR INCITS 513
+Information technology - SCSI Primary Commands - 4 (SPC-4)
+
+ANS Project INCITS 557
+Information technology - SCSI / ATA Translation - 5 (SAT-5)
+
+
+Description
+-----------
+
+This driver supports reporting the temperature of SATA drives.
+If supported, it uses the SCT Command Transport feature to read
+the current drive temperature and, if available, temperature limits
+as well as historic minimum and maximum temperatures. If SCT Command
+Transport is not supported, the driver uses SMART attributes to read
+the drive temperature.
+
+
+Sysfs entries
+-------------
+
+Only the temp1_input attribute is always available. Other attributes are
+available only if reported by the drive. All temperatures are reported in
+milli-degrees Celsius.
+
+======================= =====================================================
+temp1_input Current drive temperature
+temp1_lcrit Minimum temperature limit. Operating the device below
+ this temperature may cause physical damage to the
+ device.
+temp1_min Minimum recommended continuous operating limit
+temp1_max Maximum recommended continuous operating temperature
+temp1_crit Maximum temperature limit. Operating the device above
+ this temperature may cause physical damage to the
+ device.
+temp1_lowest Minimum temperature seen this power cycle
+temp1_highest Maximum temperature seen this power cycle
+======================= =====================================================
diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig
index 13a6b4afb4b3..97acf4ebd5ca 100644
--- a/drivers/hwmon/Kconfig
+++ b/drivers/hwmon/Kconfig
@@ -1346,6 +1346,16 @@ config SENSORS_RASPBERRYPI_HWMON
This driver can also be built as a module. If so, the module
will be called raspberrypi-hwmon.

+config SENSORS_SATATEMP
+ tristate "SATA hard disk drives with temperature sensors"
+ depends on SCSI && ATA
+ help
+ If you say yes you get support for the temperature sensor on
+ SATA hard disk drives.
+
+ This driver can also be built as a module. If so, the module
+ will be called satatemp.
+
config SENSORS_SHT15
tristate "Sensiron humidity and temperature sensors. SHT15 and compat."
depends on GPIOLIB || COMPILE_TEST
diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile
index 40c036ea45e6..fe55b8f76af9 100644
--- a/drivers/hwmon/Makefile
+++ b/drivers/hwmon/Makefile
@@ -148,6 +148,7 @@ obj-$(CONFIG_SENSORS_S3C) += s3c-hwmon.o
obj-$(CONFIG_SENSORS_SCH56XX_COMMON)+= sch56xx-common.o
obj-$(CONFIG_SENSORS_SCH5627) += sch5627.o
obj-$(CONFIG_SENSORS_SCH5636) += sch5636.o
+obj-$(CONFIG_SENSORS_SATATEMP) += satatemp.o
obj-$(CONFIG_SENSORS_SHT15) += sht15.o
obj-$(CONFIG_SENSORS_SHT21) += sht21.o
obj-$(CONFIG_SENSORS_SHT3x) += sht3x.o
diff --git a/drivers/hwmon/satatemp.c b/drivers/hwmon/satatemp.c
new file mode 100644
index 000000000000..9608716d422c
--- /dev/null
+++ b/drivers/hwmon/satatemp.c
@@ -0,0 +1,575 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Hwmon client for SATA hard disk drives with temperature sensors
+ * Copyright (C) 2019 Zodiac Inflight Innovations
+ *
+ * With input from:
+ * Hwmon client for S.M.A.R.T. hard disk drives with temperature sensors.
+ * (C) 2018 Linus Walleij
+ *
+ * hwmon: Driver for SCSI/ATA temperature sensors
+ * by Constantin Baranov <[email protected]>, submitted September 2009
+ *
+ * The primary means to read hard drive temperatures and temperature limits
+ * is the SCT Command Transport feature set as specified in ATA8-ACS.
+ * It can be used to read the current drive temperature, temperature limits,
+ * and historic minimum and maximum temperatures. The SCT Command Transport
+ * feature set is documented in "AT Attachment 8 - ATA/ATAPI Command Set
+ * (ATA8-ACS)".
+ *
+ * If the SCT Command Transport feature set is not available, drive temperatures
+ * may be readable through SMART attributes. Since SMART attributes are not well
+ * defined, this method is only used as fallback mechanism.
+ *
+ * There are three SMART attributes which may report drive temperatures.
+ * Those are defined as follows (from
+ * http://www.cropel.com/library/smart-attribute-list.aspx).
+ *
+ * 190 Temperature Temperature, monitored by a sensor somewhere inside
+ * the drive. Raw value typicaly holds the actual
+ * temperature (hexadecimal) in its rightmost two digits.
+ *
+ * 194 Temperature Temperature, monitored by a sensor somewhere inside
+ * the drive. Raw value typicaly holds the actual
+ * temperature (hexadecimal) in its rightmost two digits.
+ *
+ * 231 Temperature Temperature, monitored by a sensor somewhere inside
+ * the drive. Raw value typicaly holds the actual
+ * temperature (hexadecimal) in its rightmost two digits.
+ *
+ * Wikipedia defines attributes a bit differently.
+ *
+ * 190 Temperature Value is equal to (100-temp. °C), allowing manufacturer
+ * Difference or to set a minimum threshold which corresponds to a
+ * Airflow maximum temperature. This also follows the convention of
+ * Temperature 100 being a best-case value and lower values being
+ * undesirable. However, some older drives may instead
+ * report raw Temperature (identical to 0xC2) or
+ * Temperature minus 50 here.
+ * 194 Temperature or Indicates the device temperature, if the appropriate
+ * Temperature sensor is fitted. Lowest byte of the raw value contains
+ * Celsius the exact temperature value (Celsius degrees).
+ * 231 Life Left Indicates the approximate SSD life left, in terms of
+ * (SSDs) or program/erase cycles or available reserved blocks.
+ * Temperature A normalized value of 100 represents a new drive, with
+ * a threshold value at 10 indicating a need for
+ * replacement. A value of 0 may mean that the drive is
+ * operating in read-only mode to allow data recovery.
+ * Previously (pre-2010) occasionally used for Drive
+ * Temperature (more typically reported at 0xC2).
+ *
+ * Common denominator is that the first raw byte reports the temperature
+ * in degrees C on almost all drives. Some drives may report a fractional
+ * temperature in the second raw byte.
+ *
+ * Known exceptions (from libatasmart):
+ * - SAMSUNG SV0412H and SAMSUNG SV1204H) report the temperature in 10th
+ * degrees C in the first two raw bytes.
+ * - A few Maxtor drives report an unknown or bad value in attribute 194.
+ * - Certain Apple SSD drives report an unknown value in attribute 190.
+ * Only certain firmware versions are affected.
+ *
+ * Those exceptions affect older ATA drives and are currently ignored.
+ * Also, the second raw byte (possibly reporting the fractional temperature)
+ * is currently ignored.
+ *
+ * Many drives also report temperature limits in additional SMART data raw
+ * bytes. The format of those is not well defined and varies widely.
+ * The driver does not currently attempt to report those limits.
+ *
+ * According to data in smartmontools, attribute 231 is rarely used to report
+ * drive temperatures. At the same time, several drives report SSD life left
+ * in attribute 231, but do not support temperature sensors. For this reason,
+ * attribute 231 is currently ignored.
+ *
+ * Following above definitions, temperatures are reported as follows.
+ * If SCT Command Transport is supported, it is used to read the
+ * temperature and, if available, temperature limits.
+ * - Otherwise, if SMART attribute 194 is supported, it is used to read
+ * the temperature.
+ * - Otherwise, if SMART attribute 190 is supported, it is used to read
+ * the temperature.
+ */
+
+#include <linux/ata.h>
+#include <linux/bits.h>
+#include <linux/device.h>
+#include <linux/hwmon.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_driver.h>
+#include <scsi/scsi_proto.h>
+
+struct satatemp_data {
+ struct list_head list; /* list of instantiated devices */
+ struct mutex lock; /* protect data buffer accesses */
+ struct scsi_device *sdev; /* SCSI device */
+ struct device *dev; /* instantiating device */
+ struct device *hwdev; /* hardware monitoring device */
+ u8 smartdata[ATA_SECT_SIZE]; /* local buffer */
+ int (*get_temp)(struct satatemp_data *st, u32 attr, long *val);
+ bool have_temp_lowest; /* lowest temp in SCT status */
+ bool have_temp_highest; /* highest temp in SCT status */
+ bool have_temp_min; /* have min temp */
+ bool have_temp_max; /* have max temp */
+ bool have_temp_lcrit; /* have lower critical limit */
+ bool have_temp_crit; /* have critical limit */
+ int temp_min; /* min temp */
+ int temp_max; /* max temp */
+ int temp_lcrit; /* lower critical limit */
+ int temp_crit; /* critical limit */
+};
+
+static LIST_HEAD(satatemp_devlist);
+
+#define ATA_MAX_SMART_ATTRS 30
+#define SMART_TEMP_PROP_190 190
+#define SMART_TEMP_PROP_194 194
+
+#define SCT_STATUS_REQ_ADDR 0xe0
+#define SCT_STATUS_VERSION_LOW 0 /* log byte offsets */
+#define SCT_STATUS_VERSION_HIGH 1
+#define SCT_STATUS_TEMP 200
+#define SCT_STATUS_TEMP_LOWEST 201
+#define SCT_STATUS_TEMP_HIGHEST 202
+#define SCT_READ_LOG_ADDR 0xe1
+#define SMART_READ_LOG 0xd5
+#define SMART_WRITE_LOG 0xd6
+
+#define INVALID_TEMP 0x80
+
+#define temp_is_valid(temp) ((temp) != INVALID_TEMP)
+#define temp_from_sct(temp) (((s8)(temp)) * 1000)
+
+static inline bool ata_id_smart_supported(u16 *id)
+{
+ return id[ATA_ID_COMMAND_SET_1] & BIT(0);
+}
+
+static inline bool ata_id_smart_enabled(u16 *id)
+{
+ return id[ATA_ID_CFS_ENABLE_1] & BIT(0);
+}
+
+static int satatemp_scsi_command(struct satatemp_data *st,
+ u8 ata_command, u8 feature,
+ u8 lba_low, u8 lba_mid, u8 lba_high)
+{
+ u8 scsi_cmd[MAX_COMMAND_SIZE];
+ int data_dir;
+
+ memset(scsi_cmd, 0, sizeof(scsi_cmd));
+ scsi_cmd[0] = ATA_16;
+ if (ata_command == ATA_CMD_SMART && feature == SMART_WRITE_LOG) {
+ scsi_cmd[1] = (5 << 1); /* PIO Data-out */
+ /*
+ * No off.line or cc, write to dev, block count in sector count
+ * field.
+ */
+ scsi_cmd[2] = 0x06;
+ data_dir = DMA_TO_DEVICE;
+ } else {
+ scsi_cmd[1] = (4 << 1); /* PIO Data-in */
+ /*
+ * No off.line or cc, read from dev, block count in sector count
+ * field.
+ */
+ scsi_cmd[2] = 0x0e;
+ data_dir = DMA_FROM_DEVICE;
+ }
+ scsi_cmd[4] = feature;
+ scsi_cmd[6] = 1; /* 1 sector */
+ scsi_cmd[8] = lba_low;
+ scsi_cmd[10] = lba_mid;
+ scsi_cmd[12] = lba_high;
+ scsi_cmd[14] = ata_command;
+
+ return scsi_execute_req(st->sdev, scsi_cmd, data_dir,
+ st->smartdata, ATA_SECT_SIZE, NULL, HZ, 5,
+ NULL);
+}
+
+static int satatemp_ata_command(struct satatemp_data *st, u8 feature, u8 select)
+{
+ return satatemp_scsi_command(st, ATA_CMD_SMART, feature, select,
+ ATA_SMART_LBAM_PASS, ATA_SMART_LBAH_PASS);
+}
+
+static int satatemp_get_smarttemp(struct satatemp_data *st, u32 attr,
+ long *temp)
+{
+ u8 *buf = st->smartdata;
+ bool have_temp = false;
+ u8 temp_raw;
+ u8 csum;
+ int err;
+ int i;
+
+ err = satatemp_ata_command(st, ATA_SMART_READ_VALUES, 0);
+ if (err)
+ return err;
+
+ /* Checksum the read value table */
+ csum = 0;
+ for (i = 0; i < ATA_SECT_SIZE; i++)
+ csum += buf[i];
+ if (csum) {
+ dev_dbg(&st->sdev->sdev_gendev,
+ "checksum error reading SMART values\n");
+ return -EIO;
+ }
+
+ for (i = 0; i < ATA_MAX_SMART_ATTRS; i++) {
+ u8 *attr = buf + i * 12;
+ int id = attr[2];
+
+ if (!id)
+ continue;
+
+ if (id == SMART_TEMP_PROP_190) {
+ temp_raw = attr[7];
+ have_temp = true;
+ }
+ if (id == SMART_TEMP_PROP_194) {
+ temp_raw = attr[7];
+ have_temp = true;
+ break;
+ }
+ }
+
+ if (have_temp) {
+ *temp = temp_raw * 1000;
+ return 0;
+ }
+
+ return -ENXIO;
+}
+
+static int satatemp_get_scttemp(struct satatemp_data *st, u32 attr, long *val)
+{
+ u8 *buf = st->smartdata;
+ int err;
+
+ err = satatemp_ata_command(st, SMART_READ_LOG, SCT_STATUS_REQ_ADDR);
+ if (err)
+ return err;
+ switch (attr) {
+ case hwmon_temp_input:
+ *val = temp_from_sct(buf[SCT_STATUS_TEMP]);
+ break;
+ case hwmon_temp_lowest:
+ *val = temp_from_sct(buf[SCT_STATUS_TEMP_LOWEST]);
+ break;
+ case hwmon_temp_highest:
+ *val = temp_from_sct(buf[SCT_STATUS_TEMP_HIGHEST]);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+ return err;
+}
+
+static int satatemp_identify(struct satatemp_data *st)
+{
+ struct scsi_device *sdev = st->sdev;
+ u8 *buf = st->smartdata;
+ bool is_ata, is_sata;
+ bool have_sct_data_table;
+ bool have_sct_temp;
+ bool have_smart;
+ bool have_sct;
+ u16 *ata_id;
+ u16 version;
+ long temp;
+ u8 *vpd;
+ int err;
+
+ /* bail out immediately if there is no inquiry data */
+ if (!sdev->inquiry || sdev->inquiry_len < 16)
+ return -ENODEV;
+
+ /*
+ * Inquiry data sanity checks (per SAT-5):
+ * - peripheral qualifier must be 0
+ * - peripheral device type must be 0x0 (Direct access block device)
+ * - SCSI Vendor ID is "ATA "
+ */
+ if (sdev->inquiry[0] ||
+ strncmp(&sdev->inquiry[8], "ATA ", 8))
+ return -ENODEV;
+
+ vpd = kzalloc(1024, GFP_KERNEL);
+ if (!vpd)
+ return -ENOMEM;
+
+ err = scsi_get_vpd_page(sdev, 0x89, vpd, 1024);
+ if (err) {
+ kfree(vpd);
+ return err;
+ }
+
+ /*
+ * More sanity checks.
+ * For VPD offsets and values see ANS Project INCITS 557,
+ * "Information technology - SCSI / ATA Translation - 5 (SAT-5)".
+ */
+ if (vpd[1] != 0x89 || vpd[2] != 0x02 || vpd[3] != 0x38 ||
+ vpd[36] != 0x34 || vpd[56] != ATA_CMD_ID_ATA) {
+ kfree(vpd);
+ return -ENODEV;
+ }
+ ata_id = (u16 *)&vpd[60];
+ is_ata = ata_id_is_ata(ata_id);
+ is_sata = ata_id_is_sata(ata_id);
+ have_sct = ata_id_sct_supported(ata_id);
+ have_sct_data_table = ata_id_sct_data_tables(ata_id);
+ have_smart = ata_id_smart_supported(ata_id) &&
+ ata_id_smart_enabled(ata_id);
+
+ kfree(vpd);
+
+ /* bail out if this is not a SATA device */
+ if (!is_ata || !is_sata)
+ return -ENODEV;
+ if (!have_sct)
+ goto skip_sct;
+
+ err = satatemp_ata_command(st, SMART_READ_LOG, SCT_STATUS_REQ_ADDR);
+ if (err)
+ goto skip_sct;
+
+ version = (buf[SCT_STATUS_VERSION_HIGH] << 8) |
+ buf[SCT_STATUS_VERSION_LOW];
+ if (version != 2 && version != 3)
+ goto skip_sct;
+
+ have_sct_temp = temp_is_valid(buf[SCT_STATUS_TEMP]);
+ if (!have_sct_temp)
+ goto skip_sct;
+
+ st->have_temp_lowest = temp_is_valid(buf[SCT_STATUS_TEMP_LOWEST]);
+ st->have_temp_highest = temp_is_valid(buf[SCT_STATUS_TEMP_HIGHEST]);
+
+ if (!have_sct_data_table)
+ goto skip_sct;
+
+ /* Request and read temperature history table */
+ memset(buf, '\0', sizeof(st->smartdata));
+ buf[0] = 5; /* data table command */
+ buf[2] = 1; /* read table */
+ buf[4] = 2; /* temperature history table */
+
+ err = satatemp_ata_command(st, SMART_WRITE_LOG, SCT_STATUS_REQ_ADDR);
+ if (err)
+ goto skip_sct_data;
+
+ err = satatemp_ata_command(st, SMART_READ_LOG, SCT_READ_LOG_ADDR);
+ if (err)
+ goto skip_sct_data;
+
+ /*
+ * Temperature limits per AT Attachment 8 -
+ * ATA/ATAPI Command Set (ATA8-ACS)
+ */
+ st->have_temp_max = temp_is_valid(buf[6]);
+ st->have_temp_crit = temp_is_valid(buf[7]);
+ st->have_temp_min = temp_is_valid(buf[8]);
+ st->have_temp_lcrit = temp_is_valid(buf[9]);
+
+ st->temp_max = temp_from_sct(buf[6]);
+ st->temp_crit = temp_from_sct(buf[7]);
+ st->temp_min = temp_from_sct(buf[8]);
+ st->temp_lcrit = temp_from_sct(buf[9]);
+
+skip_sct_data:
+ if (have_sct_temp) {
+ st->get_temp = satatemp_get_scttemp;
+ return 0;
+ }
+skip_sct:
+ if (!have_smart)
+ return -ENODEV;
+ st->get_temp = satatemp_get_smarttemp;
+ return satatemp_get_smarttemp(st, hwmon_temp_input, &temp);
+}
+
+static int satatemp_read(struct device *dev, enum hwmon_sensor_types type,
+ u32 attr, int channel, long *val)
+{
+ struct satatemp_data *st = dev_get_drvdata(dev);
+ int err = 0;
+
+ if (type != hwmon_temp)
+ return -EINVAL;
+
+ switch (attr) {
+ case hwmon_temp_input:
+ case hwmon_temp_lowest:
+ case hwmon_temp_highest:
+ mutex_lock(&st->lock);
+ err = st->get_temp(st, attr, val);
+ mutex_unlock(&st->lock);
+ break;
+ case hwmon_temp_lcrit:
+ *val = st->temp_lcrit;
+ break;
+ case hwmon_temp_min:
+ *val = st->temp_min;
+ break;
+ case hwmon_temp_max:
+ *val = st->temp_max;
+ break;
+ case hwmon_temp_crit:
+ *val = st->temp_crit;
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+ return err;
+}
+
+static umode_t satatemp_is_visible(const void *data,
+ enum hwmon_sensor_types type,
+ u32 attr, int channel)
+{
+ const struct satatemp_data *st = data;
+
+ switch (type) {
+ case hwmon_temp:
+ switch (attr) {
+ case hwmon_temp_input:
+ return 0444;
+ case hwmon_temp_lowest:
+ if (st->have_temp_lowest)
+ return 0444;
+ break;
+ case hwmon_temp_highest:
+ if (st->have_temp_highest)
+ return 0444;
+ break;
+ case hwmon_temp_min:
+ if (st->have_temp_min)
+ return 0444;
+ break;
+ case hwmon_temp_max:
+ if (st->have_temp_max)
+ return 0444;
+ break;
+ case hwmon_temp_lcrit:
+ if (st->have_temp_lcrit)
+ return 0444;
+ break;
+ case hwmon_temp_crit:
+ if (st->have_temp_crit)
+ return 0444;
+ break;
+ default:
+ break;
+ }
+ break;
+ default:
+ break;
+ }
+ return 0;
+}
+
+static const struct hwmon_channel_info *satatemp_info[] = {
+ HWMON_CHANNEL_INFO(chip,
+ HWMON_C_REGISTER_TZ),
+ HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT |
+ HWMON_T_LOWEST | HWMON_T_HIGHEST |
+ HWMON_T_MIN | HWMON_T_MAX |
+ HWMON_T_LCRIT | HWMON_T_CRIT),
+ NULL
+};
+
+static const struct hwmon_ops satatemp_ops = {
+ .is_visible = satatemp_is_visible,
+ .read = satatemp_read,
+};
+
+static const struct hwmon_chip_info satatemp_chip_info = {
+ .ops = &satatemp_ops,
+ .info = satatemp_info,
+};
+
+/*
+ * The device argument points to sdev->sdev_dev. Its parent is
+ * sdev->sdev_gendev, which we can use to get the scsi_device pointer.
+ */
+static int satatemp_add(struct device *dev, struct class_interface *intf)
+{
+ struct scsi_device *sdev = to_scsi_device(dev->parent);
+ struct satatemp_data *st;
+ int err;
+
+ st = kzalloc(sizeof(*st), GFP_KERNEL);
+ if (!st)
+ return -ENOMEM;
+
+ st->sdev = sdev;
+ st->dev = dev;
+ mutex_init(&st->lock);
+
+ if (satatemp_identify(st)) {
+ err = -ENODEV;
+ goto abort;
+ }
+
+ st->hwdev = hwmon_device_register_with_info(dev->parent, "satatemp",
+ st, &satatemp_chip_info,
+ NULL);
+ if (IS_ERR(st->hwdev)) {
+ err = PTR_ERR(st->hwdev);
+ goto abort;
+ }
+
+ list_add(&st->list, &satatemp_devlist);
+ return 0;
+
+abort:
+ kfree(st);
+ return err;
+}
+
+static void satatemp_remove(struct device *dev, struct class_interface *intf)
+{
+ struct satatemp_data *st, *tmp;
+
+ list_for_each_entry_safe(st, tmp, &satatemp_devlist, list) {
+ if (st->dev == dev) {
+ list_del(&st->list);
+ hwmon_device_unregister(st->hwdev);
+ kfree(st);
+ break;
+ }
+ }
+}
+
+static struct class_interface satatemp_interface = {
+ .add_dev = satatemp_add,
+ .remove_dev = satatemp_remove,
+};
+
+static int __init satatemp_init(void)
+{
+ return scsi_register_interface(&satatemp_interface);
+}
+
+static void __exit satatemp_exit(void)
+{
+ scsi_unregister_interface(&satatemp_interface);
+}
+
+module_init(satatemp_init);
+module_exit(satatemp_exit);
+
+MODULE_AUTHOR("Guenter Roeck <[email protected]>");
+MODULE_DESCRIPTION("ATA temperature monitor");
+MODULE_LICENSE("GPL");
--
2.17.1

2019-12-19 00:16:47

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives


Guenter,

> This driver solves this problem by adding support for reading the
> temperature of SATA drives from the kernel using the hwmon API and
> by adding a temperature zone for each drive.

My working tree is available here:

https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.6/drivetemp

A few notes:

- Before applying your patch I did s/satatemp/drivetemp/

- I get a crash in the driver core during probe if the drivetemp module
is loaded prior to loading ahci or a SCSI HBA driver. This crash is
unrelated to my changes. Haven't had time to debug.

- I tweaked your ATA detection heuristics and now use the cached VPD
page 0x89 instead of fetching one from the device.

- I also added support for reading the temperature log page on SCSI
drives.

- Tested with a mixed bag of about 40 SCSI and SATA drives attached.

- I still think sensor naming needs work. How and where are the
"drivetemp-scsi-8-140" names generated?

I'll tinker some more but thought I'd share what I have for now.

--
Martin K. Petersen Oracle Linux Engineering

2019-12-19 00:34:28

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On Wed, Dec 18, 2019 at 07:15:07PM -0500, Martin K. Petersen wrote:
>
> Guenter,
>
> > This driver solves this problem by adding support for reading the
> > temperature of SATA drives from the kernel using the hwmon API and
> > by adding a temperature zone for each drive.
>
> My working tree is available here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.6/drivetemp
>
> A few notes:
>
> - Before applying your patch I did s/satatemp/drivetemp/
>
> - I get a crash in the driver core during probe if the drivetemp module
> is loaded prior to loading ahci or a SCSI HBA driver. This crash is
> unrelated to my changes. Haven't had time to debug.
>
> - I tweaked your ATA detection heuristics and now use the cached VPD
> page 0x89 instead of fetching one from the device.
>
> - I also added support for reading the temperature log page on SCSI
> drives.
>
> - Tested with a mixed bag of about 40 SCSI and SATA drives attached.
>
> - I still think sensor naming needs work. How and where are the
> "drivetemp-scsi-8-140" names generated?
>
Quick one: In libsensors, outside the kernel. The naming is generic,
along the line of <driver name>-<bus name>-<bus index>-<slot>.

> I'll tinker some more but thought I'd share what I have for now.
>
Thanks for sharing. I'll be out on vacation until January 1. I'll look
at the code after I am back.

Guenter

> --
> Martin K. Petersen Oracle Linux Engineering

2019-12-19 07:39:44

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Hi Martin,

On 12/18/19 4:15 PM, Martin K. Petersen wrote:
>
> Guenter,
>
>> This driver solves this problem by adding support for reading the
>> temperature of SATA drives from the kernel using the hwmon API and
>> by adding a temperature zone for each drive.
>
> My working tree is available here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.6/drivetemp
>

I had a quick look at the patch. Looks good overall. I think you should be able
to use the buffer allocated in struct drivetemp_data (named smartdata). Maybe
rename it to something more generic. If it needs to be dma-aligned, maybe we
can use kzalloc() to allocate it. Only reason I didn't use it for vpd was because
that needed 1024 bytes.

> A few notes:
>
> - Before applying your patch I did s/satatemp/drivetemp/
>
> - I get a crash in the driver core during probe if the drivetemp module
> is loaded prior to loading ahci or a SCSI HBA driver. This crash is
> unrelated to my changes. Haven't had time to debug.
>
Definitely something we'll need to look into. Do you have a traceback ?

Thanks,
Guenter

> - I tweaked your ATA detection heuristics and now use the cached VPD
> page 0x89 instead of fetching one from the device.
>
> - I also added support for reading the temperature log page on SCSI
> drives.
>
> - Tested with a mixed bag of about 40 SCSI and SATA drives attached.
>
> - I still think sensor naming needs work. How and where are the
> "drivetemp-scsi-8-140" names generated?
>
> I'll tinker some more but thought I'd share what I have for now.
>

2020-01-01 17:47:27

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Hi Martin,

On 12/18/19 4:15 PM, Martin K. Petersen wrote:
>
> Guenter,
>
>> This driver solves this problem by adding support for reading the
>> temperature of SATA drives from the kernel using the hwmon API and
>> by adding a temperature zone for each drive.
>
> My working tree is available here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.6/drivetemp
>
> A few notes:
>
> - Before applying your patch I did s/satatemp/drivetemp/
>
> - I get a crash in the driver core during probe if the drivetemp module
> is loaded prior to loading ahci or a SCSI HBA driver. This crash is
> unrelated to my changes. Haven't had time to debug.
>

Any idea how I might be able to reproduce this ? So far I have been
unsuccessful.

Building drivetemp into the kernel, with ahci and everything SCSI
built as module, doesn't trigger the crash for me. This is with the
drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add
sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7.

Thanks,
Guenter

> - I tweaked your ATA detection heuristics and now use the cached VPD
> page 0x89 instead of fetching one from the device.
>
> - I also added support for reading the temperature log page on SCSI
> drives.
>
> - Tested with a mixed bag of about 40 SCSI and SATA drives attached.
>
> - I still think sensor naming needs work. How and where are the
> "drivetemp-scsi-8-140" names generated?
>
> I'll tinker some more but thought I'd share what I have for now.
>

2020-01-03 03:07:44

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives


Guenter,

>> - I get a crash in the driver core during probe if the drivetemp module
>> is loaded prior to loading ahci or a SCSI HBA driver. This crash is
>> unrelated to my changes. Haven't had time to debug.
>>
>
> Any idea how I might be able to reproduce this ? So far I have been
> unsuccessful.

I'm chipping away at a mountain of email. Will take a look tomorrow.

--
Martin K. Petersen Oracle Linux Engineering

2020-01-07 04:11:40

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives


Hi Guenter!

>> - I still think sensor naming needs work. How and where are the
>> "drivetemp-scsi-8-140" names generated?
>>
> Quick one: In libsensors, outside the kernel. The naming is generic,
> along the line of <driver name>-<bus name>-<bus index>-<slot>.

I understand that there are sensors that may not have an obvious
associated topology and therefore necessitate coming up with a suitable
naming or enumeration scheme. But in this case we already have a
well-defined SCSI device name. Any particular reason you don't shift the
chip.addr back and print the H:C:T:L format that you used as input?

However arcane H:C:T:L may seem, I think that predictable naming would
make things a lot easier for users that need to identify which device
matches which sensor...

--
Martin K. Petersen Oracle Linux Engineering

2020-01-07 13:02:00

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On 1/6/20 8:10 PM, Martin K. Petersen wrote:
>
> Hi Guenter!
>
>>> - I still think sensor naming needs work. How and where are the
>>> "drivetemp-scsi-8-140" names generated?
>>>
>> Quick one: In libsensors, outside the kernel. The naming is generic,
>> along the line of <driver name>-<bus name>-<bus index>-<slot>.
>
> I understand that there are sensors that may not have an obvious
> associated topology and therefore necessitate coming up with a suitable
> naming or enumeration scheme. But in this case we already have a
> well-defined SCSI device name. Any particular reason you don't shift the
> chip.addr back and print the H:C:T:L format that you used as input?
>
> However arcane H:C:T:L may seem, I think that predictable naming would
> make things a lot easier for users that need to identify which device
> matches which sensor...
>

Not sure I understand. Do you mean to add "H:C:T:L" to "drivetemp" ?
That would make it something like "drivetemp:H:C:T:L-scsi-8-140".
Not sure if that is really useful, and it would at least be partially
redundant.

"scsi-8-140" is created by libsensors, so any change in that would
have to be made there, not in the kernel driver.

Guenter

2020-01-08 01:13:52

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives


Guenter,

> Any idea how I might be able to reproduce this ? So far I have been
> unsuccessful.
>
> Building drivetemp into the kernel, with ahci and everything SCSI
> built as module, doesn't trigger the crash for me. This is with the
> drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add
> sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7.

This is with 5.5-rc1. I'll try another kernel.

My repro is:

# modprobe drivetemp
# modprobe <any SCSI driver, including ahci>

--
Martin K. Petersen Oracle Linux Engineering

2020-01-08 01:31:28

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives


Guenter,

> "scsi-8-140" is created by libsensors, so any change in that would
> have to be made there, not in the kernel driver.

Yes. Something like the patch below which will produce actual SCSI
device instance names:

drivetemp-scsi-7:0:29:0
drivetemp-scsi-8:0:30:0
drivetemp-scsi-8:0:15:0
drivetemp-scsi-7:0:24:0

Instead of the current:

drivetemp-scsi-7-1d0
drivetemp-scsi-8-1e0
drivetemp-scsi-8-f0
drivetemp-scsi-7-180

Other question: Does hwmon have any notion of sensor topology? As I
mentioned earlier, SCSI installations typically rely on SAF-TE or SES
instead of the physical drive sensors. SES also includes monitoring of
fans, power supplies, etc. And more importantly, it provides a
representation of the location of a given component. E.g.: Tray number
#4, disk drive bay #5.

So it would be helpful if libsensors had a way to represent sensors in a
way that mimics the physical device layout reported by SES.

--
Martin K. Petersen Oracle Linux Engineering


diff --git a/lib/data.c b/lib/data.c
index c5aea42967a6..06cfa86f353b 100644
--- a/lib/data.c
+++ b/lib/data.c
@@ -202,8 +202,9 @@ int sensors_snprintf_chip_name(char *str, size_t size,
return snprintf(str, size, "%s-mdio-%x", chip->prefix,
chip->addr);
case SENSORS_BUS_TYPE_SCSI:
- return snprintf(str, size, "%s-scsi-%hd-%x", chip->prefix,
- chip->bus.nr, chip->addr);
+ return snprintf(str, size, "%s-scsi-%u:%u:%u:%lu", chip->prefix,
+ chip->bus.nr, chip->addr, chip->sub_addr,
+ chip->sub_sub_addr);
}

return -SENSORS_ERR_CHIP_NAME;
diff --git a/lib/sensors.h b/lib/sensors.h
index 94f6f23051d2..f468cccabc72 100644
--- a/lib/sensors.h
+++ b/lib/sensors.h
@@ -65,6 +65,8 @@ typedef struct sensors_chip_name {
char *prefix;
sensors_bus_id bus;
int addr;
+ unsigned int sub_addr;
+ unsigned long sub_sub_addr;
char *path;
} sensors_chip_name;

diff --git a/lib/sysfs.c b/lib/sysfs.c
index e63688b72aba..f76b4a99aa7d 100644
--- a/lib/sysfs.c
+++ b/lib/sysfs.c
@@ -620,6 +620,7 @@ static int classify_device(const char *dev_name,
sensors_chip_features *entry)
{
int domain, bus, slot, fn, vendor, product, id;
+ unsigned long lun;
char bus_path[NAME_MAX];
char *bus_attr;
int ret = 1;
@@ -687,11 +688,13 @@ static int classify_device(const char *dev_name,
entry->chip.bus.nr = 0;
} else
if (subsys && !strcmp(subsys, "scsi") &&
- sscanf(dev_name, "%d:%d:%d:%x", &domain, &bus, &slot, &fn) == 4) {
+ sscanf(dev_name, "%u:%u:%u:%lu", &domain, &bus, &id, &lun) == 4) {
/* adapter(host), channel(bus), id(target), lun */
- entry->chip.addr = (bus << 8) + (slot << 4) + fn;
- entry->chip.bus.type = SENSORS_BUS_TYPE_SCSI;
entry->chip.bus.nr = domain;
+ entry->chip.addr = bus;
+ entry->chip.sub_addr = id;
+ entry->chip.sub_sub_addr = lun;
+ entry->chip.bus.type = SENSORS_BUS_TYPE_SCSI;
} else {
/* Unknown device */
ret = 0;

2020-01-08 16:01:57

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On Tue, Jan 07, 2020 at 08:12:06PM -0500, Martin K. Petersen wrote:
>
> Guenter,
>
> > Any idea how I might be able to reproduce this ? So far I have been
> > unsuccessful.
> >
> > Building drivetemp into the kernel, with ahci and everything SCSI
> > built as module, doesn't trigger the crash for me. This is with the
> > drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add
> > sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7.
>
> This is with 5.5-rc1. I'll try another kernel.
>
> My repro is:
>
> # modprobe drivetemp
> # modprobe <any SCSI driver, including ahci>
>
No luck on my side. Can you provide a traceback ? Maybe we can use it
to find out what is happening.

Thanks,
Guenter

2020-01-08 16:01:59

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Hi Martin,

On Tue, Jan 07, 2020 at 08:29:40PM -0500, Martin K. Petersen wrote:
>
> Guenter,
>
> > "scsi-8-140" is created by libsensors, so any change in that would
> > have to be made there, not in the kernel driver.
>
> Yes. Something like the patch below which will produce actual SCSI
> device instance names:
>
> drivetemp-scsi-7:0:29:0
> drivetemp-scsi-8:0:30:0
> drivetemp-scsi-8:0:15:0
> drivetemp-scsi-7:0:24:0
>
> Instead of the current:
>
> drivetemp-scsi-7-1d0
> drivetemp-scsi-8-1e0
> drivetemp-scsi-8-f0
> drivetemp-scsi-7-180
>
> Other question: Does hwmon have any notion of sensor topology? As I
> mentioned earlier, SCSI installations typically rely on SAF-TE or SES
> instead of the physical drive sensors. SES also includes monitoring of
> fans, power supplies, etc. And more importantly, it provides a
> representation of the location of a given component. E.g.: Tray number
> #4, disk drive bay #5.
>

Please go ahead and submit the patch for libsensors.

> So it would be helpful if libsensors had a way to represent sensors in a
> way that mimics the physical device layout reported by SES.
>

I guess it would be up to you to come up with a proposal.

Thanks,
Guenter

2020-01-11 20:29:09

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On 1/8/20 7:33 AM, Guenter Roeck wrote:
> On Tue, Jan 07, 2020 at 08:12:06PM -0500, Martin K. Petersen wrote:
>>
>> Guenter,
>>
>>> Any idea how I might be able to reproduce this ? So far I have been
>>> unsuccessful.
>>>
>>> Building drivetemp into the kernel, with ahci and everything SCSI
>>> built as module, doesn't trigger the crash for me. This is with the
>>> drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add
>>> sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7.
>>
>> This is with 5.5-rc1. I'll try another kernel.
>>
>> My repro is:
>>
>> # modprobe drivetemp
>> # modprobe <any SCSI driver, including ahci>
>>
> No luck on my side. Can you provide a traceback ? Maybe we can use it
> to find out what is happening.
>

I tried again, this time with v5.5-rc5. Loading and unloading ahci and
drivetemp in any order does not cause any problems for me.

At this point I don't know what else I could test. I went ahead and
applied the drivetemp patch to hwmon-next. Maybe we'll get some additional
test feedback this way.

Guenter

2020-01-12 11:20:31

by Gabriel C

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Am Sa., 11. Jan. 2020 um 21:24 Uhr schrieb Guenter Roeck <[email protected]>:
>
> On 1/8/20 7:33 AM, Guenter Roeck wrote:
> > On Tue, Jan 07, 2020 at 08:12:06PM -0500, Martin K. Petersen wrote:
> >>
> >> Guenter,
> >>
> >>> Any idea how I might be able to reproduce this ? So far I have been
> >>> unsuccessful.
> >>>
> >>> Building drivetemp into the kernel, with ahci and everything SCSI
> >>> built as module, doesn't trigger the crash for me. This is with the
> >>> drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add
> >>> sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7.
> >>
> >> This is with 5.5-rc1. I'll try another kernel.
> >>
> >> My repro is:
> >>
> >> # modprobe drivetemp
> >> # modprobe <any SCSI driver, including ahci>
> >>
> > No luck on my side. Can you provide a traceback ? Maybe we can use it
> > to find out what is happening.
> >
>
> I tried again, this time with v5.5-rc5. Loading and unloading ahci and
> drivetemp in any order does not cause any problems for me.
>
> At this point I don't know what else I could test. I went ahead and
> applied the drivetemp patch to hwmon-next. Maybe we'll get some additional
> test feedback this way.

I've tested Linus git tree from right now + hwmon-next and I cannot
make it crash.
The driver seems to work fine here and temperature reportings are very accurate
on all HDDs on that box. ( 8 x Seagate IronWolf 2 TB (ST2000VN004) )

What I've noticed however is the nvme temperature low/high values on
the Sensors X are strange here.
I'm not sure it is a v5.5 issue or a hwmon-next one right now, I
didn't boot a vanilla v5.5-rc5 yet.

Both nvme's are Samsung SSD 960 EVO 250GB.

They look like this:

nvme-pci-1300
Adapter: PCI adapter
Composite: +27.9°C (low = -273.1°C, high = +76.8°C)
(crit = +78.8°C)
Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)

nvme-pci-6100
Adapter: PCI adapter
Composite: +23.9°C (low = -273.1°C, high = +76.8°C)
(crit = +78.8°C)
Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)

Best Regards,

Gabriel C.

2020-01-12 11:23:13

by Linus Walleij

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <[email protected]> wrote:

> What I've noticed however is the nvme temperature low/high values on
> the Sensors X are strange here.
(...)
> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
(...)
> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)

That doesn't look strange to me. It seems like reasonable defaults
from the firmware if either it doesn't really log the min/max temperatures
or hasn't been through a cycle of updating these yet. Just set both
to absolute min/max temperatures possible.

Yours,
Linus Walleij

2020-01-12 12:07:26

by Gabriel C

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij
<[email protected]>:
>
> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <[email protected]> wrote:
>
> > What I've noticed however is the nvme temperature low/high values on
> > the Sensors X are strange here.
> (...)
> > Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
> > Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
> (...)
> > Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
> > Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)
>
> That doesn't look strange to me. It seems like reasonable defaults
> from the firmware if either it doesn't really log the min/max temperatures
> or hasn't been through a cycle of updating these yet. Just set both
> to absolute min/max temperatures possible.

Ok I'll check that.

Do you mean by setting the temperatures to use a lmsensors config?
Or is there a way to set these with a nvme command?

2020-01-12 12:09:23

by Linus Walleij

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <[email protected]> wrote:
> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij
> <[email protected]>:
> >
> > On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <[email protected]> wrote:
> >
> > > What I've noticed however is the nvme temperature low/high values on
> > > the Sensors X are strange here.
> > (...)
> > > Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
> > > Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
> > (...)
> > > Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
> > > Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)
> >
> > That doesn't look strange to me. It seems like reasonable defaults
> > from the firmware if either it doesn't really log the min/max temperatures
> > or hasn't been through a cycle of updating these yet. Just set both
> > to absolute min/max temperatures possible.
>
> Ok I'll check that.
>
> Do you mean by setting the temperatures to use a lmsensors config?
> Or is there a way to set these with a nvme command?

Not that I know of.

The min/max are the minumum and maximum temperatures the
device has experienced during this power-on cycle.

(If I understood things right!)

If the device firmware doesn't log that, or the firmware hasn't
ran through a log point, it makes sense to report absolute
min/max of the scales.

Yours,
Linus Walleij

2020-01-12 13:27:32

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On 1/12/20 4:07 AM, Linus Walleij wrote:
> On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <[email protected]> wrote:
>> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij
>> <[email protected]>:
>>>
>>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <[email protected]> wrote:
>>>
>>>> What I've noticed however is the nvme temperature low/high values on
>>>> the Sensors X are strange here.
>>> (...)
>>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
>>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
>>> (...)
>>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
>>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)
>>>
>>> That doesn't look strange to me. It seems like reasonable defaults
>>> from the firmware if either it doesn't really log the min/max temperatures
>>> or hasn't been through a cycle of updating these yet. Just set both
>>> to absolute min/max temperatures possible.
>>
>> Ok I'll check that.
>>
>> Do you mean by setting the temperatures to use a lmsensors config?
>> Or is there a way to set these with a nvme command?
>
> Not that I know of.
>
> The min/max are the minumum and maximum temperatures the
> device has experienced during this power-on cycle.
>

No, that would be lowest/highest. The above are (or should be) per-sensor
setpoints. The default for those is typically the absolute minimum /
maximum of the supported range.

Some SATA drives report the lowest/highest temperatures experienced
since power cycle, like here.

drivetemp-scsi-5-0
Adapter: SCSI adapter
temp1: +23.0°C (low = +0.0°C, high = +60.0°C)
(crit low = -41.0°C, crit = +85.0°C)
(lowest = +20.0°C, highest = +31.0°C)

Guenter

2020-01-12 13:47:51

by Gabriel C

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <[email protected]>:
>
> On 1/12/20 4:07 AM, Linus Walleij wrote:
> > On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <[email protected]> wrote:
> >> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij
> >> <[email protected]>:
> >>>
> >>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <[email protected]> wrote:
> >>>
> >>>> What I've noticed however is the nvme temperature low/high values on
> >>>> the Sensors X are strange here.
> >>> (...)
> >>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
> >>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
> >>> (...)
> >>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
> >>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)
> >>>
> >>> That doesn't look strange to me. It seems like reasonable defaults
> >>> from the firmware if either it doesn't really log the min/max temperatures
> >>> or hasn't been through a cycle of updating these yet. Just set both
> >>> to absolute min/max temperatures possible.
> >>
> >> Ok I'll check that.
> >>
> >> Do you mean by setting the temperatures to use a lmsensors config?
> >> Or is there a way to set these with a nvme command?
> >
> > Not that I know of.
> >
> > The min/max are the minumum and maximum temperatures the
> > device has experienced during this power-on cycle.
> >
>
> No, that would be lowest/highest. The above are (or should be) per-sensor
> setpoints. The default for those is typically the absolute minimum /
> maximum of the supported range.
>
> Some SATA drives report the lowest/highest temperatures experienced
> since power cycle, like here.
>
> drivetemp-scsi-5-0
> Adapter: SCSI adapter
> temp1: +23.0°C (low = +0.0°C, high = +60.0°C)
> (crit low = -41.0°C, crit = +85.0°C)
> (lowest = +20.0°C, highest = +31.0°C)
>

The SATA temperatures are fine and reported like this here too, just
the nvme ones are strange.

drivetemp-scsi-4-0
Adapter: SCSI adapter
temp1: +28.0°C (low = +1.0°C, high = +61.0°C)
(crit low = +2.0°C, crit = +60.0°C)
(lowest = +16.0°C, highest = +31.0°C)

drivetemp-scsi-12-0
Adapter: SCSI adapter
temp1: +29.0°C (low = +1.0°C, high = +61.0°C)
(crit low = +2.0°C, crit = +60.0°C)
(lowest = +18.0°C, highest = +32.0°C)

and so on.

Btw, where I can find the code does these calculations?

Best regards,

Gabriel C.

2020-01-12 15:34:22

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On 1/12/20 5:45 AM, Gabriel C wrote:
> Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <[email protected]>:
>>
>> On 1/12/20 4:07 AM, Linus Walleij wrote:
>>> On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <[email protected]> wrote:
>>>> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij
>>>> <[email protected]>:
>>>>>
>>>>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <[email protected]> wrote:
>>>>>
>>>>>> What I've noticed however is the nvme temperature low/high values on
>>>>>> the Sensors X are strange here.
>>>>> (...)
>>>>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
>>>>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
>>>>> (...)
>>>>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
>>>>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)
>>>>>
>>>>> That doesn't look strange to me. It seems like reasonable defaults
>>>>> from the firmware if either it doesn't really log the min/max temperatures
>>>>> or hasn't been through a cycle of updating these yet. Just set both
>>>>> to absolute min/max temperatures possible.
>>>>
>>>> Ok I'll check that.
>>>>
>>>> Do you mean by setting the temperatures to use a lmsensors config?
>>>> Or is there a way to set these with a nvme command?
>>>
>>> Not that I know of.
>>>
>>> The min/max are the minumum and maximum temperatures the
>>> device has experienced during this power-on cycle.
>>>
>>
>> No, that would be lowest/highest. The above are (or should be) per-sensor
>> setpoints. The default for those is typically the absolute minimum /
>> maximum of the supported range.
>>
>> Some SATA drives report the lowest/highest temperatures experienced
>> since power cycle, like here.
>>
>> drivetemp-scsi-5-0
>> Adapter: SCSI adapter
>> temp1: +23.0°C (low = +0.0°C, high = +60.0°C)
>> (crit low = -41.0°C, crit = +85.0°C)
>> (lowest = +20.0°C, highest = +31.0°C)
>>
>
> The SATA temperatures are fine and reported like this here too, just
> the nvme ones are strange.
>
> drivetemp-scsi-4-0
> Adapter: SCSI adapter
> temp1: +28.0°C (low = +1.0°C, high = +61.0°C)
> (crit low = +2.0°C, crit = +60.0°C)
> (lowest = +16.0°C, highest = +31.0°C)
>
> drivetemp-scsi-12-0
> Adapter: SCSI adapter
> temp1: +29.0°C (low = +1.0°C, high = +61.0°C)
> (crit low = +2.0°C, crit = +60.0°C)
> (lowest = +18.0°C, highest = +32.0°C)
>
> and so on.
>
> Btw, where I can find the code does these calculations?
>

Not sure if that is what you are looking for, but the nvme hardware
monitoring driver is at drivers/nvme/host/hwmon.c, the SATA hardware
monitoring driver is at drivers/hwmon/drivetemp.c.

The limits on nvme drives are configurable.

root@server:/sys/class/hwmon# sensors nvme-pci-0100
nvme-pci-0100
Adapter: PCI adapter
Composite: +40.9°C (low = -273.1°C, high = +84.8°C)
(crit = +84.8°C)
Sensor 1: +40.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +43.9°C (low = -273.1°C, high = +65261.8°C)

root@server:/sys/class/hwmon# echo 0 > hwmon1/temp2_min
root@server:/sys/class/hwmon# echo 100000 > hwmon1/temp2_max

root@server:/sys/class/hwmon# sensors nvme-pci-0100
nvme-pci-0100
Adapter: PCI adapter
Composite: +38.9°C (low = -273.1°C, high = +84.8°C)
(crit = +84.8°C)
Sensor 1: +38.9°C (low = -0.1°C, high = +99.8°C)
Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C)

If you dislike the defaults, just configure whatever you think is
appropriate for your system.

Thanks,
Guenter

2020-01-12 18:40:31

by Gabriel C

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Am So., 12. Jan. 2020 um 16:26 Uhr schrieb Guenter Roeck <[email protected]>:
>
> On 1/12/20 5:45 AM, Gabriel C wrote:
> > Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <[email protected]>:
> >>
> >> On 1/12/20 4:07 AM, Linus Walleij wrote:
> >>> On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <[email protected]> wrote:
> >>>> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij
> >>>> <[email protected]>:
> >>>>>
> >>>>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <[email protected]> wrote:
> >>>>>
> >>>>>> What I've noticed however is the nvme temperature low/high values on
> >>>>>> the Sensors X are strange here.
> >>>>> (...)
> >>>>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
> >>>>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
> >>>>> (...)
> >>>>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
> >>>>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)
> >>>>>
> >>>>> That doesn't look strange to me. It seems like reasonable defaults
> >>>>> from the firmware if either it doesn't really log the min/max temperatures
> >>>>> or hasn't been through a cycle of updating these yet. Just set both
> >>>>> to absolute min/max temperatures possible.
> >>>>
> >>>> Ok I'll check that.
> >>>>
> >>>> Do you mean by setting the temperatures to use a lmsensors config?
> >>>> Or is there a way to set these with a nvme command?
> >>>
> >>> Not that I know of.
> >>>
> >>> The min/max are the minumum and maximum temperatures the
> >>> device has experienced during this power-on cycle.
> >>>
> >>
> >> No, that would be lowest/highest. The above are (or should be) per-sensor
> >> setpoints. The default for those is typically the absolute minimum /
> >> maximum of the supported range.
> >>
> >> Some SATA drives report the lowest/highest temperatures experienced
> >> since power cycle, like here.
> >>
> >> drivetemp-scsi-5-0
> >> Adapter: SCSI adapter
> >> temp1: +23.0°C (low = +0.0°C, high = +60.0°C)
> >> (crit low = -41.0°C, crit = +85.0°C)
> >> (lowest = +20.0°C, highest = +31.0°C)
> >>
> >
> > The SATA temperatures are fine and reported like this here too, just
> > the nvme ones are strange.
> >
> > drivetemp-scsi-4-0
> > Adapter: SCSI adapter
> > temp1: +28.0°C (low = +1.0°C, high = +61.0°C)
> > (crit low = +2.0°C, crit = +60.0°C)
> > (lowest = +16.0°C, highest = +31.0°C)
> >
> > drivetemp-scsi-12-0
> > Adapter: SCSI adapter
> > temp1: +29.0°C (low = +1.0°C, high = +61.0°C)
> > (crit low = +2.0°C, crit = +60.0°C)
> > (lowest = +18.0°C, highest = +32.0°C)
> >
> > and so on.
> >
> > Btw, where I can find the code does these calculations?
> >
>
> Not sure if that is what you are looking for, but the nvme hardware
> monitoring driver is at drivers/nvme/host/hwmon.c, the SATA hardware
> monitoring driver is at drivers/hwmon/drivetemp.c.
>

I have a look thanks.

I'm using your v2 patch for the nvme part since you posted it on 5.4 kernels.
This is probably why I find the way the temperatures are now reported
very strange.

The ADATA XPG SX8200 Pro in my laptop seems to work better:

nvme-pci-0200
Adapter: PCI adapter
Composite: +37.9°C (low = -0.1°C, high = +74.8°C)
(crit = +79.8°C)

Low is 0° which is what the spec suggests.

> The limits on nvme drives are configurable.

Yes, I found this out already.

> root@server:/sys/class/hwmon# sensors nvme-pci-0100
> nvme-pci-0100
> Adapter: PCI adapter
> Composite: +40.9°C (low = -273.1°C, high = +84.8°C)
> (crit = +84.8°C)
> Sensor 1: +40.9°C (low = -273.1°C, high = +65261.8°C)
> Sensor 2: +43.9°C (low = -273.1°C, high = +65261.8°C)
>
> root@server:/sys/class/hwmon# echo 0 > hwmon1/temp2_min
> root@server:/sys/class/hwmon# echo 100000 > hwmon1/temp2_max

An lm-sensors configuration will work too.

> root@server:/sys/class/hwmon# sensors nvme-pci-0100
> nvme-pci-0100
> Adapter: PCI adapter
> Composite: +38.9°C (low = -273.1°C, high = +84.8°C)
> (crit = +84.8°C)
> Sensor 1: +38.9°C (low = -0.1°C, high = +99.8°C)
> Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C)
>
> If you dislike the defaults, just configure whatever you think is
> appropriate for your system.

It's not about disliking the values. I want to find out if these Samsung models
don't support that, or it is a bug somewhere in writing/calculating the values.

In the case, Samsung and others don't support such a thing wouldn't be
better to just ignore
the bogus reading altogether?

2020-01-12 20:26:46

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On 1/12/20 10:37 AM, Gabriel C wrote:
> Am So., 12. Jan. 2020 um 16:26 Uhr schrieb Guenter Roeck <[email protected]>:
>>
>> On 1/12/20 5:45 AM, Gabriel C wrote:
>>> Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <[email protected]>:
>>>>
>>>> On 1/12/20 4:07 AM, Linus Walleij wrote:
>>>>> On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <[email protected]> wrote:
>>>>>> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij
>>>>>> <[email protected]>:
>>>>>>>
>>>>>>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <[email protected]> wrote:
>>>>>>>
>>>>>>>> What I've noticed however is the nvme temperature low/high values on
>>>>>>>> the Sensors X are strange here.
>>>>>>> (...)
>>>>>>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
>>>>>>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
>>>>>>> (...)
>>>>>>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
>>>>>>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)
>>>>>>>
>>>>>>> That doesn't look strange to me. It seems like reasonable defaults
>>>>>>> from the firmware if either it doesn't really log the min/max temperatures
>>>>>>> or hasn't been through a cycle of updating these yet. Just set both
>>>>>>> to absolute min/max temperatures possible.
>>>>>>
>>>>>> Ok I'll check that.
>>>>>>
>>>>>> Do you mean by setting the temperatures to use a lmsensors config?
>>>>>> Or is there a way to set these with a nvme command?
>>>>>
>>>>> Not that I know of.
>>>>>
>>>>> The min/max are the minumum and maximum temperatures the
>>>>> device has experienced during this power-on cycle.
>>>>>
>>>>
>>>> No, that would be lowest/highest. The above are (or should be) per-sensor
>>>> setpoints. The default for those is typically the absolute minimum /
>>>> maximum of the supported range.
>>>>
>>>> Some SATA drives report the lowest/highest temperatures experienced
>>>> since power cycle, like here.
>>>>
>>>> drivetemp-scsi-5-0
>>>> Adapter: SCSI adapter
>>>> temp1: +23.0°C (low = +0.0°C, high = +60.0°C)
>>>> (crit low = -41.0°C, crit = +85.0°C)
>>>> (lowest = +20.0°C, highest = +31.0°C)
>>>>
>>>
>>> The SATA temperatures are fine and reported like this here too, just
>>> the nvme ones are strange.
>>>
>>> drivetemp-scsi-4-0
>>> Adapter: SCSI adapter
>>> temp1: +28.0°C (low = +1.0°C, high = +61.0°C)
>>> (crit low = +2.0°C, crit = +60.0°C)
>>> (lowest = +16.0°C, highest = +31.0°C)
>>>
>>> drivetemp-scsi-12-0
>>> Adapter: SCSI adapter
>>> temp1: +29.0°C (low = +1.0°C, high = +61.0°C)
>>> (crit low = +2.0°C, crit = +60.0°C)
>>> (lowest = +18.0°C, highest = +32.0°C)
>>>
>>> and so on.
>>>
>>> Btw, where I can find the code does these calculations?
>>>
>>
>> Not sure if that is what you are looking for, but the nvme hardware
>> monitoring driver is at drivers/nvme/host/hwmon.c, the SATA hardware
>> monitoring driver is at drivers/hwmon/drivetemp.c.
>>
>
> I have a look thanks.
>
> I'm using your v2 patch for the nvme part since you posted it on 5.4 kernels.
> This is probably why I find the way the temperatures are now reported
> very strange.
>
> The ADATA XPG SX8200 Pro in my laptop seems to work better:
>
> nvme-pci-0200
> Adapter: PCI adapter
> Composite: +37.9°C (low = -0.1°C, high = +74.8°C)
> (crit = +79.8°C)
>
> Low is 0° which is what the spec suggests.
>
>> The limits on nvme drives are configurable.
>
> Yes, I found this out already.
>
>> root@server:/sys/class/hwmon# sensors nvme-pci-0100
>> nvme-pci-0100
>> Adapter: PCI adapter
>> Composite: +40.9°C (low = -273.1°C, high = +84.8°C)
>> (crit = +84.8°C)
>> Sensor 1: +40.9°C (low = -273.1°C, high = +65261.8°C)
>> Sensor 2: +43.9°C (low = -273.1°C, high = +65261.8°C)
>>
>> root@server:/sys/class/hwmon# echo 0 > hwmon1/temp2_min
>> root@server:/sys/class/hwmon# echo 100000 > hwmon1/temp2_max
>
> An lm-sensors configuration will work too.
>
Sure, the above was just an example.

>> root@server:/sys/class/hwmon# sensors nvme-pci-0100
>> nvme-pci-0100
>> Adapter: PCI adapter
>> Composite: +38.9°C (low = -273.1°C, high = +84.8°C)
>> (crit = +84.8°C)
>> Sensor 1: +38.9°C (low = -0.1°C, high = +99.8°C)
>> Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C)
>>
>> If you dislike the defaults, just configure whatever you think is
>> appropriate for your system.
>
> It's not about disliking the values. I want to find out if these Samsung models
> don't support that, or it is a bug somewhere in writing/calculating the values.
>
No, this is not a bug. It is perfectly valid for individual sensors to have
uninitialized limits. If I recall correctly, the NVME specification
specifically states that the default settings for individual sensors
shall be those values (0 and 65535 Kelvin, specifically).

And, yes, I would agree that is a bit odd that NVME drives report temperatures
in Kelvin, but such is the world.

> In the case, Samsung and others don't support such a thing wouldn't be
> better to just ignore
> the bogus reading altogether?

Again, you can set whatever limits you like. The default limits on many
hardware sensor chips have odd values. Just looking at my system:

nct6797-isa-0a20
Adapter: ISA adapter
in0: +0.48 V (min = +0.00 V, max = +1.74 V)
in1: +1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in2: +3.39 V (min = +0.00 V, max = +0.00 V) ALARM
in3: +3.31 V (min = +0.00 V, max = +0.00 V) ALARM
in4: +1.00 V (min = +0.00 V, max = +0.00 V) ALARM
in5: +0.14 V (min = +0.00 V, max = +0.00 V) ALARM
in6: +0.82 V (min = +0.00 V, max = +0.00 V) ALARM
in7: +3.38 V (min = +0.00 V, max = +0.00 V) ALARM
in8: +3.26 V (min = +0.00 V, max = +0.00 V) ALARM
in9: +1.82 V (min = +0.00 V, max = +0.00 V) ALARM
in10: +0.00 V (min = +0.00 V, max = +0.00 V)
in11: +0.74 V (min = +0.00 V, max = +0.00 V) ALARM
in12: +1.20 V (min = +0.00 V, max = +0.00 V) ALARM
in13: +0.68 V (min = +0.00 V, max = +0.00 V) ALARM
in14: +1.50 V (min = +0.00 V, max = +0.00 V) ALARM


Are you suggesting that we should not support setting min/max values for
all drivers just because they are often not initialized to reasonable values
by default ?

Guenter

2020-01-12 22:45:40

by Gabriel C

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Am So., 12. Jan. 2020 um 21:08 Uhr schrieb Guenter Roeck <[email protected]>:
>
> On 1/12/20 10:37 AM, Gabriel C wrote:
> > Am So., 12. Jan. 2020 um 16:26 Uhr schrieb Guenter Roeck <[email protected]>:
> >>
> >> On 1/12/20 5:45 AM, Gabriel C wrote:
> >>> Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <[email protected]>:
> >>>>
> >>>> On 1/12/20 4:07 AM, Linus Walleij wrote:
> >>>>> On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <[email protected]> wrote:
> >>>>>> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij
> >>>>>> <[email protected]>:
> >>>>>>>
> >>>>>>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <[email protected]> wrote:
> >>>>>>>
> >>>>>>>> What I've noticed however is the nvme temperature low/high values on
> >>>>>>>> the Sensors X are strange here.
> >>>>>>> (...)
> >>>>>>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C)
> >>>>>>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C)
> >>>>>>> (...)
> >>>>>>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C)
> >>>>>>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C)
> >>>>>>>
> >>>>>>> That doesn't look strange to me. It seems like reasonable defaults
> >>>>>>> from the firmware if either it doesn't really log the min/max temperatures
> >>>>>>> or hasn't been through a cycle of updating these yet. Just set both
> >>>>>>> to absolute min/max temperatures possible.
> >>>>>>
> >>>>>> Ok I'll check that.
> >>>>>>
> >>>>>> Do you mean by setting the temperatures to use a lmsensors config?
> >>>>>> Or is there a way to set these with a nvme command?
> >>>>>
> >>>>> Not that I know of.
> >>>>>
> >>>>> The min/max are the minumum and maximum temperatures the
> >>>>> device has experienced during this power-on cycle.
> >>>>>
> >>>>
> >>>> No, that would be lowest/highest. The above are (or should be) per-sensor
> >>>> setpoints. The default for those is typically the absolute minimum /
> >>>> maximum of the supported range.
> >>>>
> >>>> Some SATA drives report the lowest/highest temperatures experienced
> >>>> since power cycle, like here.
> >>>>
> >>>> drivetemp-scsi-5-0
> >>>> Adapter: SCSI adapter
> >>>> temp1: +23.0°C (low = +0.0°C, high = +60.0°C)
> >>>> (crit low = -41.0°C, crit = +85.0°C)
> >>>> (lowest = +20.0°C, highest = +31.0°C)
> >>>>
> >>>
> >>> The SATA temperatures are fine and reported like this here too, just
> >>> the nvme ones are strange.
> >>>
> >>> drivetemp-scsi-4-0
> >>> Adapter: SCSI adapter
> >>> temp1: +28.0°C (low = +1.0°C, high = +61.0°C)
> >>> (crit low = +2.0°C, crit = +60.0°C)
> >>> (lowest = +16.0°C, highest = +31.0°C)
> >>>
> >>> drivetemp-scsi-12-0
> >>> Adapter: SCSI adapter
> >>> temp1: +29.0°C (low = +1.0°C, high = +61.0°C)
> >>> (crit low = +2.0°C, crit = +60.0°C)
> >>> (lowest = +18.0°C, highest = +32.0°C)
> >>>
> >>> and so on.
> >>>
> >>> Btw, where I can find the code does these calculations?
> >>>
> >>
> >> Not sure if that is what you are looking for, but the nvme hardware
> >> monitoring driver is at drivers/nvme/host/hwmon.c, the SATA hardware
> >> monitoring driver is at drivers/hwmon/drivetemp.c.
> >>
> >
> > I have a look thanks.
> >
> > I'm using your v2 patch for the nvme part since you posted it on 5.4 kernels.
> > This is probably why I find the way the temperatures are now reported
> > very strange.
> >
> > The ADATA XPG SX8200 Pro in my laptop seems to work better:
> >
> > nvme-pci-0200
> > Adapter: PCI adapter
> > Composite: +37.9°C (low = -0.1°C, high = +74.8°C)
> > (crit = +79.8°C)
> >
> > Low is 0° which is what the spec suggests.
> >
> >> The limits on nvme drives are configurable.
> >
> > Yes, I found this out already.
> >
> >> root@server:/sys/class/hwmon# sensors nvme-pci-0100
> >> nvme-pci-0100
> >> Adapter: PCI adapter
> >> Composite: +40.9°C (low = -273.1°C, high = +84.8°C)
> >> (crit = +84.8°C)
> >> Sensor 1: +40.9°C (low = -273.1°C, high = +65261.8°C)
> >> Sensor 2: +43.9°C (low = -273.1°C, high = +65261.8°C)
> >>
> >> root@server:/sys/class/hwmon# echo 0 > hwmon1/temp2_min
> >> root@server:/sys/class/hwmon# echo 100000 > hwmon1/temp2_max
> >
> > An lm-sensors configuration will work too.
> >
> Sure, the above was just an example.
>
> >> root@server:/sys/class/hwmon# sensors nvme-pci-0100
> >> nvme-pci-0100
> >> Adapter: PCI adapter
> >> Composite: +38.9°C (low = -273.1°C, high = +84.8°C)
> >> (crit = +84.8°C)
> >> Sensor 1: +38.9°C (low = -0.1°C, high = +99.8°C)
> >> Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C)
> >>
> >> If you dislike the defaults, just configure whatever you think is
> >> appropriate for your system.
> >
> > It's not about disliking the values. I want to find out if these Samsung models
> > don't support that, or it is a bug somewhere in writing/calculating the values.
> >
> No, this is not a bug. It is perfectly valid for individual sensors to have
> uninitialized limits. If I recall correctly, the NVME specification
> specifically states that the default settings for individual sensors
> shall be those values (0 and 65535 Kelvin, specifically).
>
> And, yes, I would agree that is a bit odd that NVME drives report temperatures
> in Kelvin, but such is the world.
>
> > In the case, Samsung and others don't support such a thing wouldn't be
> > better to just ignore
> > the bogus reading altogether?
>
> Again, you can set whatever limits you like. The default limits on many
> hardware sensor chips have odd values. Just looking at my system:
>
> nct6797-isa-0a20
> Adapter: ISA adapter
> in0: +0.48 V (min = +0.00 V, max = +1.74 V)
> in1: +1.02 V (min = +0.00 V, max = +0.00 V) ALARM
> in2: +3.39 V (min = +0.00 V, max = +0.00 V) ALARM
> in3: +3.31 V (min = +0.00 V, max = +0.00 V) ALARM
> in4: +1.00 V (min = +0.00 V, max = +0.00 V) ALARM
> in5: +0.14 V (min = +0.00 V, max = +0.00 V) ALARM
> in6: +0.82 V (min = +0.00 V, max = +0.00 V) ALARM
> in7: +3.38 V (min = +0.00 V, max = +0.00 V) ALARM
> in8: +3.26 V (min = +0.00 V, max = +0.00 V) ALARM
> in9: +1.82 V (min = +0.00 V, max = +0.00 V) ALARM
> in10: +0.00 V (min = +0.00 V, max = +0.00 V)
> in11: +0.74 V (min = +0.00 V, max = +0.00 V) ALARM
> in12: +1.20 V (min = +0.00 V, max = +0.00 V) ALARM
> in13: +0.68 V (min = +0.00 V, max = +0.00 V) ALARM
> in14: +1.50 V (min = +0.00 V, max = +0.00 V) ALARM
>
>
> Are you suggesting that we should not support setting min/max values for
> all drivers just because they are often not initialized to reasonable values
> by default ?

No, I'm not suggesting that. I'm aware of strange I/O monitoring chips values
and the lack of documentation, so in this case, something is better
than nothing.

In the nvme case, these are only 2 values who either are working/supported
by firmware or not, so I thought it would be reasonable to have
known-good values
instead of 65261.8°C, which will probably cause users to report that
as a bug a lot.

Can we at least have that documented and explain how the values can be
set/changed?

2020-01-14 03:19:55

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives


Hi Guenter!

> I tried again, this time with v5.5-rc5. Loading and unloading ahci and
> drivetemp in any order does not cause any problems for me.

I tried your hwmon-next branch and it still happens for me. Both in qemu
and on real hw. I'm really low on bandwidth the next couple of days.
Will try to look later this week unless you beat me to it. I get lots of
these warnings after modprobe drivetemp; modprobe ahci:

[ 1055.611922] WARNING: CPU: 3 PID: 3233 at drivers/base/dd.c:519 really_probe+0x436/0x4f0

A quick test forcing synchronous SCSI scanning made no difference.

--
Martin K. Petersen Oracle Linux Engineering

2020-01-14 05:22:01

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Hi Martin,

On 1/13/20 7:03 PM, Martin K. Petersen wrote:
>
> Hi Guenter!
>
>> I tried again, this time with v5.5-rc5. Loading and unloading ahci and
>> drivetemp in any order does not cause any problems for me.
>
> I tried your hwmon-next branch and it still happens for me. Both in qemu
> and on real hw. I'm really low on bandwidth the next couple of days.
> Will try to look later this week unless you beat me to it. I get lots of
> these warnings after modprobe drivetemp; modprobe ahci:
>
> [ 1055.611922] WARNING: CPU: 3 PID: 3233 at drivers/base/dd.c:519 really_probe+0x436/0x4f0
>
> A quick test forcing synchronous SCSI scanning made no difference.
>

The hwmon-next branch is based on v5.5-rc1. It might be better to either
merge hwmon-next into mainline, or to apply the drivetemp patch to mainline,
and test the result. I have seen some (unrelated) weird tracebacks
in the driver core with v5.5-rc1, so that may not be the best baseline
for a test.

Thanks,
Guenter

2020-01-16 05:16:22

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives


Guenter,

> The hwmon-next branch is based on v5.5-rc1. It might be better to
> either merge hwmon-next into mainline, or to apply the drivetemp patch
> to mainline, and test the result. I have seen some (unrelated) weird
> tracebacks in the driver core with v5.5-rc1, so that may not be the
> best baseline for a test.

I'm afraid the warnings still happen with hwmon-next on top of
linus/master.

--
Martin K. Petersen Oracle Linux Engineering

2020-01-16 05:33:56

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

Hi Martin,

On 1/15/20 8:12 PM, Martin K. Petersen wrote:
>
> Guenter,
>
>> The hwmon-next branch is based on v5.5-rc1. It might be better to
>> either merge hwmon-next into mainline, or to apply the drivetemp patch
>> to mainline, and test the result. I have seen some (unrelated) weird
>> tracebacks in the driver core with v5.5-rc1, so that may not be the
>> best baseline for a test.
>
> I'm afraid the warnings still happen with hwmon-next on top of
> linus/master.
>

Can you possibly provide details, like the configuration you use for
qemu, the qemu command line, and the exact command sequence you use
in qemu to reproduce the problem ?

Thanks,
Guenter

2020-01-16 22:05:48

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On Wed, Jan 15, 2020 at 11:12:23PM -0500, Martin K. Petersen wrote:
>
> Guenter,
>
> > The hwmon-next branch is based on v5.5-rc1. It might be better to
> > either merge hwmon-next into mainline, or to apply the drivetemp patch
> > to mainline, and test the result. I have seen some (unrelated) weird
> > tracebacks in the driver core with v5.5-rc1, so that may not be the
> > best baseline for a test.
>
> I'm afraid the warnings still happen with hwmon-next on top of
> linus/master.
>
Can you by any chance provide a full traceback ?

The warning you reported suggests that a devm_ function was called on a
device pointer prior to a device registration. I don't immediately see
how that happens. A full traceback might give us an idea.

I suspect that the underlying problem is in hwmon_device_register_with_info(),
which uses devm_ functions to allocate memory associated with the device
pointer passed to it, in this case the SCSI device. This is inherently
wrong (independent of this driver), since the lifetime of the hardware
monitoring device does not necessarily match the lifetime of its parent.
The impact here is that we may get a memory leak under some circumstances.
I'll have to fix that in the hardware monitoring core.

Either case, I would like to track down how the warning happens, so any
information you can provide that lets me reproduce the problem would be
very helpful.

Thanks,
Guenter

2020-01-17 03:57:40

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives


Guenter,

> Can you by any chance provide a full traceback ?

My test machines are tied up with something else right now. This is from
a few days ago (pristine hwmon-next, I believe):

[ 1055.611912] ------------[ cut here ]------------
[ 1055.611922] WARNING: CPU: 3 PID: 3233 at drivers/base/dd.c:519 really_probe+0x436/0x4f0
[ 1055.611925] Modules linked in: sd_mod sg ahci libahci libata drivetemp scsi_mod crc32c_intel igb i2c_algo_bit i2c_core dca hwmon ipv6 nf_defrag_ipv6 crc_ccitt
[ 1055.611955] CPU: 3 PID: 3233 Comm: kworker/u17:1 Tainted: G W 5.5.0-rc1+ #21
[ 1055.611965] Workqueue: events_unbound async_run_entry_fn
[ 1055.611973] RIP: 0010:really_probe+0x436/0x4f0
[ 1055.611979] Code: c7 30 69 f8 82 e8 ba 94 e5 ff e9 60 ff ff ff 48 8d 7b 38 e8 cc d9 b4 ff 48 8b 43 38 48 85 c0 0f 85 41 fd ff ff e9 4f fd ff ff <0f> 0b e9 66 fc ff ff 48 8d 7d 50 e8 aa d9 b4 ff 4c 8b 6d 50 4d 85
[ 1055.611983] RSP: 0018:ffff8881edb77c98 EFLAGS: 00010287
[ 1055.611989] RAX: ffff8881e1f8fb80 RBX: ffffffffa033a000 RCX: ffffffff8182e583
[ 1055.611993] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8881dec506a8
[ 1055.611997] RBP: ffff8881dec50238 R08: 0000000000000001 R09: fffffbfff09629ed
[ 1055.612000] R10: fffffbfff09629ec R11: 0000000000000003 R12: 0000000000000000
[ 1055.612004] R13: ffff8881dec506a8 R14: ffffffff8182eca0 R15: 000000000000000b
[ 1055.612009] FS: 0000000000000000(0000) GS:ffff8881f8900000(0000) knlGS:0000000000000000
[ 1055.612013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1055.612017] CR2: 00007f957884a000 CR3: 00000001df5ec003 CR4: 00000000000606e0
[ 1055.612020] Call Trace:
[ 1055.612038] ? driver_probe_device+0x170/0x170
[ 1055.612045] driver_probe_device+0x82/0x170
[ 1055.612058] ? driver_probe_device+0x170/0x170
[ 1055.612064] __driver_attach_async_helper+0xa3/0xe0
[ 1055.612076] async_run_entry_fn+0x68/0x2a0
[ 1055.612094] process_one_work+0x4df/0x990
[ 1055.612121] ? pwq_dec_nr_in_flight+0x110/0x110
[ 1055.612127] ? do_raw_spin_lock+0x113/0x1d0
[ 1055.612161] worker_thread+0x78/0x5c0
[ 1055.612190] ? process_one_work+0x990/0x990
[ 1055.612195] kthread+0x1be/0x1e0
[ 1055.612202] ? kthread_create_worker_on_cpu+0xd0/0xd0
[ 1055.612215] ret_from_fork+0x3a/0x50
[ 1055.612251] irq event stamp: 3512
[ 1055.612259] hardirqs last enabled at (3511): [<ffffffff81d2b874>] _raw_spin_unlock_irq+0x24/0x30
[ 1055.612265] hardirqs last disabled at (3512): [<ffffffff810029c9>] trace_hardirqs_off_thunk+0x1a/0x1c
[ 1055.612272] softirqs last enabled at (3500): [<ffffffff820003a5>] __do_softirq+0x3a5/0x5a8
[ 1055.612281] softirqs last disabled at (3489): [<ffffffff810cec7b>] irq_exit+0xfb/0x100
[ 1055.612284] ---[ end trace f0a8dd9a37bea031 ]---

> Either case, I would like to track down how the warning happens, so any
> information you can provide that lets me reproduce the problem would be
> very helpful.

The three systems that exhibit the problem are stock (2010/2012/2014
vintage) x86_64 servers with onboard AHCI and a variety of 4-6 SATA
drives each.

For the qemu test I didn't have ahci configured but I had my SCSI temp
patch on top of yours and ran modprobe drivetemp; modprobe scsi_debug to
trigger the warnings.

--
Martin K. Petersen Oracle Linux Engineering

2020-01-17 05:34:15

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v2] hwmon: Driver for temperature sensors on SATA drives

On 1/16/20 5:43 PM, Martin K. Petersen wrote:
>
> Guenter,
>
>> Can you by any chance provide a full traceback ?
>
> My test machines are tied up with something else right now. This is from
> a few days ago (pristine hwmon-next, I believe):
>
> [ 1055.611912] ------------[ cut here ]------------
> [ 1055.611922] WARNING: CPU: 3 PID: 3233 at drivers/base/dd.c:519 really_probe+0x436/0x4f0
> [ 1055.611925] Modules linked in: sd_mod sg ahci libahci libata drivetemp scsi_mod crc32c_intel igb i2c_algo_bit i2c_core dca hwmon ipv6 nf_defrag_ipv6 crc_ccitt
> [ 1055.611955] CPU: 3 PID: 3233 Comm: kworker/u17:1 Tainted: G W 5.5.0-rc1+ #21
> [ 1055.611965] Workqueue: events_unbound async_run_entry_fn
> [ 1055.611973] RIP: 0010:really_probe+0x436/0x4f0
> [ 1055.611979] Code: c7 30 69 f8 82 e8 ba 94 e5 ff e9 60 ff ff ff 48 8d 7b 38 e8 cc d9 b4 ff 48 8b 43 38 48 85 c0 0f 85 41 fd ff ff e9 4f fd ff ff <0f> 0b e9 66 fc ff ff 48 8d 7d 50 e8 aa d9 b4 ff 4c 8b 6d 50 4d 85
> [ 1055.611983] RSP: 0018:ffff8881edb77c98 EFLAGS: 00010287
> [ 1055.611989] RAX: ffff8881e1f8fb80 RBX: ffffffffa033a000 RCX: ffffffff8182e583
> [ 1055.611993] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8881dec506a8
> [ 1055.611997] RBP: ffff8881dec50238 R08: 0000000000000001 R09: fffffbfff09629ed
> [ 1055.612000] R10: fffffbfff09629ec R11: 0000000000000003 R12: 0000000000000000
> [ 1055.612004] R13: ffff8881dec506a8 R14: ffffffff8182eca0 R15: 000000000000000b
> [ 1055.612009] FS: 0000000000000000(0000) GS:ffff8881f8900000(0000) knlGS:0000000000000000
> [ 1055.612013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1055.612017] CR2: 00007f957884a000 CR3: 00000001df5ec003 CR4: 00000000000606e0
> [ 1055.612020] Call Trace:
> [ 1055.612038] ? driver_probe_device+0x170/0x170
> [ 1055.612045] driver_probe_device+0x82/0x170
> [ 1055.612058] ? driver_probe_device+0x170/0x170
> [ 1055.612064] __driver_attach_async_helper+0xa3/0xe0
> [ 1055.612076] async_run_entry_fn+0x68/0x2a0
> [ 1055.612094] process_one_work+0x4df/0x990
> [ 1055.612121] ? pwq_dec_nr_in_flight+0x110/0x110
> [ 1055.612127] ? do_raw_spin_lock+0x113/0x1d0
> [ 1055.612161] worker_thread+0x78/0x5c0
> [ 1055.612190] ? process_one_work+0x990/0x990
> [ 1055.612195] kthread+0x1be/0x1e0
> [ 1055.612202] ? kthread_create_worker_on_cpu+0xd0/0xd0
> [ 1055.612215] ret_from_fork+0x3a/0x50
> [ 1055.612251] irq event stamp: 3512
> [ 1055.612259] hardirqs last enabled at (3511): [<ffffffff81d2b874>] _raw_spin_unlock_irq+0x24/0x30
> [ 1055.612265] hardirqs last disabled at (3512): [<ffffffff810029c9>] trace_hardirqs_off_thunk+0x1a/0x1c
> [ 1055.612272] softirqs last enabled at (3500): [<ffffffff820003a5>] __do_softirq+0x3a5/0x5a8
> [ 1055.612281] softirqs last disabled at (3489): [<ffffffff810cec7b>] irq_exit+0xfb/0x100
> [ 1055.612284] ---[ end trace f0a8dd9a37bea031 ]---
>
>> Either case, I would like to track down how the warning happens, so any
>> information you can provide that lets me reproduce the problem would be
>> very helpful.
>
> The three systems that exhibit the problem are stock (2010/2012/2014
> vintage) x86_64 servers with onboard AHCI and a variety of 4-6 SATA
> drives each.
>
> For the qemu test I didn't have ahci configured but I had my SCSI temp
> patch on top of yours and ran modprobe drivetemp; modprobe scsi_debug to
> trigger the warnings.
>

Interesting. Looks like your system performs asynchronous probing.
No idea how that can result in that kind of problem, but who knows.
Can you send me the qemu command line ?

Thanks,
Guenter