2023-05-16 20:41:55

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH 0/6] Enhance AMD SMN Error Checking

Hi all,

This set implements more robust error checking for AMD System Management
Network (SMN) accesses.

This set is a follow up to this discussion:
https://lore.kernel.org/lkml/[email protected]/

Patches 1-3:
- Pre-patches in AMD64 EDAC and K10Temp modules.
- Required in order to avoid build warnings with the
introduction of the __must_check attribute in patch 4.
Patch 4:
- Introduces __must_check attribute for SMN access functions.
- Handles "PCI Error Response" behavior for SMN reads.
Patches 5-6:
- Optional cleanup patches in k10temp.
- Not required for the SMN access issue, but I thought they may
be good to do.

I've included x86 platform driver folks for awareness, since there are
some AMD SMN users there.

Thanks,
Yazen

Yazen Ghannam (6):
EDAC/amd64: Remove unused register accesses
EDAC/amd64: Check return value of amd_smn_read()
hwmon: (k10temp) Check return value of amd_smn_read()
x86/amd_nb: Enhance SMN access error checking
hwmon: (k10temp) Define helper function to read CCD temp
hwmon: (k10temp) Reduce k10temp_get_ccd_support() parameters

arch/x86/include/asm/amd_nb.h | 4 +--
arch/x86/kernel/amd_nb.c | 46 +++++++++++++++++++++++++----
drivers/edac/amd64_edac.c | 55 ++++++++++++++++++-----------------
drivers/edac/amd64_edac.h | 4 ---
drivers/hwmon/k10temp.c | 45 ++++++++++++++++++----------
5 files changed, 100 insertions(+), 54 deletions(-)

--
2.34.1



2023-05-16 20:44:05

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH 3/6] hwmon: (k10temp) Check return value of amd_smn_read()

Check the return value of amd_smn_read() before saving a value. This
ensures invalid values aren't saved or used.

There are three cases here with slightly different behavior.

1) read_tempreg_nb_zen():
This is a function pointer which does not include a return code.
In this case, set the register value to 0 on failure. This
enforces Read-as-Zero behavior.

2) k10temp_read_temp():
This function does have return codes, so return -EINVAL on a
failed register read. Continued operation is not necessary,
since there is no valid data from the register. Furthermore, if
the register value was set to 0, then the following operation
would underflow.

3) k10temp_get_ccd_support():
This function reads the same register from multiple CCD
instances in a loop. And a bitmask is formed if a specific bit
is set in each register instance. The loop should continue on a
failed register read, skipping the bit check.

Furthermore, the __must_check attribute will be added to amd_smn_read().
Therefore, this change is required to avoid compile-time warnings.

Signed-off-by: Yazen Ghannam <[email protected]>
Cc: [email protected]
---
drivers/hwmon/k10temp.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c
index 7b177b9fbb09..6ea1fa62b7c1 100644
--- a/drivers/hwmon/k10temp.c
+++ b/drivers/hwmon/k10temp.c
@@ -145,8 +145,9 @@ static void read_tempreg_nb_f15(struct pci_dev *pdev, u32 *regval)

static void read_tempreg_nb_zen(struct pci_dev *pdev, u32 *regval)
{
- amd_smn_read(amd_pci_dev_to_node_id(pdev),
- ZEN_REPORTED_TEMP_CTRL_BASE, regval);
+ if (amd_smn_read(amd_pci_dev_to_node_id(pdev),
+ ZEN_REPORTED_TEMP_CTRL_BASE, regval))
+ *regval = 0;
}

static long get_raw_temp(struct k10temp_data *data)
@@ -213,9 +214,11 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
*val = 0;
break;
case 2 ... 13: /* Tccd{1-12} */
- amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
- ZEN_CCD_TEMP(data->ccd_offset, channel - 2),
- &regval);
+ if (amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
+ ZEN_CCD_TEMP(data->ccd_offset, channel - 2),
+ &regval))
+ return -EINVAL;
+
*val = (regval & ZEN_CCD_TEMP_MASK) * 125 - 49000;
break;
default:
@@ -373,8 +376,10 @@ static void k10temp_get_ccd_support(struct pci_dev *pdev,
int i;

for (i = 0; i < limit; i++) {
- amd_smn_read(amd_pci_dev_to_node_id(pdev),
- ZEN_CCD_TEMP(data->ccd_offset, i), &regval);
+ if (amd_smn_read(amd_pci_dev_to_node_id(pdev),
+ ZEN_CCD_TEMP(data->ccd_offset, i), &regval))
+ continue;
+
if (regval & ZEN_CCD_TEMP_VALID)
data->show_temp |= BIT(TCCD_BIT(i));
}
--
2.34.1


2023-05-16 20:45:00

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH 2/6] EDAC/amd64: Check return value of amd_smn_read()

Check the return value of amd_smn_read() before saving a value. This
ensures invalid values aren't saved. The struct umc instance is
initialized to 0 during memory allocation. Therefore, a bad read will
keep the value as 0 providing the expected Read-as-Zero behavior.

Furthermore, the __must_check attribute will be added to amd_smn_read().
Therefore, this change is required to avoid compile-time warnings.

Signed-off-by: Yazen Ghannam <[email protected]>
Cc: [email protected]
---
drivers/edac/amd64_edac.c | 38 ++++++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index fda6537c80be..e9aa54e42edc 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1626,6 +1626,7 @@ static void umc_read_base_mask(struct amd64_pvt *pvt)
u32 *base, *base_sec;
u32 *mask, *mask_sec;
int cs, umc;
+ u32 tmp;

for_each_umc(umc) {
umc_base_reg = get_umc_base(umc) + UMCCH_BASE_ADDR;
@@ -1638,13 +1639,17 @@ static void umc_read_base_mask(struct amd64_pvt *pvt)
base_reg = umc_base_reg + (cs * 4);
base_reg_sec = umc_base_reg_sec + (cs * 4);

- if (!amd_smn_read(pvt->mc_node_id, base_reg, base))
+ if (!amd_smn_read(pvt->mc_node_id, base_reg, &tmp)) {
+ *base = tmp;
edac_dbg(0, " DCSB%d[%d]=0x%08x reg: 0x%x\n",
umc, cs, *base, base_reg);
+ }

- if (!amd_smn_read(pvt->mc_node_id, base_reg_sec, base_sec))
+ if (!amd_smn_read(pvt->mc_node_id, base_reg_sec, &tmp)) {
+ *base_sec = tmp;
edac_dbg(0, " DCSB_SEC%d[%d]=0x%08x reg: 0x%x\n",
umc, cs, *base_sec, base_reg_sec);
+ }
}

umc_mask_reg = get_umc_base(umc) + UMCCH_ADDR_MASK;
@@ -1657,13 +1662,17 @@ static void umc_read_base_mask(struct amd64_pvt *pvt)
mask_reg = umc_mask_reg + (cs * 4);
mask_reg_sec = umc_mask_reg_sec + (cs * 4);

- if (!amd_smn_read(pvt->mc_node_id, mask_reg, mask))
+ if (!amd_smn_read(pvt->mc_node_id, mask_reg, &tmp)) {
+ *mask = tmp;
edac_dbg(0, " DCSM%d[%d]=0x%08x reg: 0x%x\n",
umc, cs, *mask, mask_reg);
+ }

- if (!amd_smn_read(pvt->mc_node_id, mask_reg_sec, mask_sec))
+ if (!amd_smn_read(pvt->mc_node_id, mask_reg_sec, &tmp)) {
+ *mask_sec = tmp;
edac_dbg(0, " DCSM_SEC%d[%d]=0x%08x reg: 0x%x\n",
umc, cs, *mask_sec, mask_reg_sec);
+ }
}
}
}
@@ -3074,7 +3083,7 @@ static void umc_read_mc_regs(struct amd64_pvt *pvt)
{
u8 nid = pvt->mc_node_id;
struct amd64_umc *umc;
- u32 i, umc_base;
+ u32 i, tmp, umc_base;

/* Read registers from each UMC */
for_each_umc(i) {
@@ -3082,11 +3091,20 @@ static void umc_read_mc_regs(struct amd64_pvt *pvt)
umc_base = get_umc_base(i);
umc = &pvt->umc[i];

- amd_smn_read(nid, umc_base + get_umc_reg(pvt, UMCCH_DIMM_CFG), &umc->dimm_cfg);
- amd_smn_read(nid, umc_base + UMCCH_UMC_CFG, &umc->umc_cfg);
- amd_smn_read(nid, umc_base + UMCCH_SDP_CTRL, &umc->sdp_ctrl);
- amd_smn_read(nid, umc_base + UMCCH_ECC_CTRL, &umc->ecc_ctrl);
- amd_smn_read(nid, umc_base + UMCCH_UMC_CAP_HI, &umc->umc_cap_hi);
+ if (!amd_smn_read(nid, umc_base + get_umc_reg(pvt, UMCCH_DIMM_CFG), &tmp))
+ umc->dimm_cfg = tmp;
+
+ if (!amd_smn_read(nid, umc_base + UMCCH_UMC_CFG, &tmp))
+ umc->umc_cfg = tmp;
+
+ if (!amd_smn_read(nid, umc_base + UMCCH_SDP_CTRL, &tmp))
+ umc->sdp_ctrl = tmp;
+
+ if (!amd_smn_read(nid, umc_base + UMCCH_ECC_CTRL, &tmp))
+ umc->ecc_ctrl = tmp;
+
+ if (!amd_smn_read(nid, umc_base + UMCCH_UMC_CAP_HI, &tmp))
+ umc->umc_cap_hi = tmp;
}
}

--
2.34.1


2023-05-16 20:45:16

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH 4/6] x86/amd_nb: Enhance SMN access error checking

AMD Zen-based systems use a System Management Network (SMN) that
provides access to implementation-specific registers.

SMN accesses are done indirectly through an index/data pair in PCI
config space. The PCI config access may fail and return an error code.
This would prevent the "read" value from being updated, and it would
give an indication to the caller that the read or write operation failed.

However for reads, the PCI config access may succeed, but the return
value may be invalid. This is in similar fashion to PCI bad reads, i.e.
return all bits set.

Most systems will return 0 for SMN addresses that are not accessible.
This is in line with AMD convention that unavailable registers are
Read-as-Zero/Writes-Ignored.

However, some systems will return a "PCI Error Response" instead. This
value, along with an error code of 0 from the PCI config access, will
confuse callers of the amd_smn_read() function.

Check for this condition and set a proper error code for SMN reads.

Furthermore, require error checking for callers of amd_smn_read() and
amd_smn_write(). This is needed because many error conditions cannot
be checked by these functions.

Also, drop the extern keyword as it's not needed. And remove a warning
that will not be trigger in many cases.

Fixes: ddfe43cdc0da ("x86/amd_nb: Add SMN and Indirect Data Fabric access for AMD Fam17h")
Signed-off-by: Yazen Ghannam <[email protected]>
Cc: [email protected]
---
arch/x86/include/asm/amd_nb.h | 4 +--
arch/x86/kernel/amd_nb.c | 46 ++++++++++++++++++++++++++++++-----
2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/amd_nb.h b/arch/x86/include/asm/amd_nb.h
index ed0eaf65c437..e6fe405aa567 100644
--- a/arch/x86/include/asm/amd_nb.h
+++ b/arch/x86/include/asm/amd_nb.h
@@ -21,8 +21,8 @@ extern int amd_numa_init(void);
extern int amd_get_subcaches(int);
extern int amd_set_subcaches(int, unsigned long);

-extern int amd_smn_read(u16 node, u32 address, u32 *value);
-extern int amd_smn_write(u16 node, u32 address, u32 value);
+int __must_check amd_smn_read(u16 node, u32 address, u32 *value);
+int __must_check amd_smn_write(u16 node, u32 address, u32 value);

struct amd_l3_cache {
unsigned indices;
diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c
index 7e331e8f3692..c37754354cd8 100644
--- a/arch/x86/kernel/amd_nb.c
+++ b/arch/x86/kernel/amd_nb.c
@@ -157,6 +157,38 @@ static struct pci_dev *next_northbridge(struct pci_dev *dev,
return dev;
}

+/*
+ * SMN accesses may fail in ways that are difficult to detect here in the called
+ * functions smn_read() and smn_write(). Therefore, callers of these functions
+ * must do their own checking based on what behavior they expect.
+ *
+ * For SMN reads, the returned SMN value may be zero if the register is
+ * Read-as-Zero . Or it may be a "PCI Error Response", e.g. all 0xFFs. The "PCI
+ * Error Response" can be checked here, and a proper error code can be returned.
+ * But the Read-as-Zero response cannot be verified here. A value of 0 may be
+ * correct in some cases, so callers must check that this correct is for the
+ * register/fields they need.
+ *
+ * For SMN writes, success can be determined through a "write and read back"
+ * procedure. However, this is not robust when done here.
+ *
+ * Possible issues:
+ * 1) Bits that are "Write-1-to-Clear". In this case, the read value should
+ * *not* match the write value.
+ * 2) Bits that are "Read-as-Zero"/"Writes-Ignored". This information cannot be
+ * known here.
+ * 3) Bits that are "Reserved / Set to 1". Ditto above.
+ *
+ * Callers of amd_smn_write() should do the "write and read back" check themselves,
+ * if needed.
+ *
+ * For #1, they can see if their target bits got cleared.
+ *
+ * For #2 and #3, they can check if their target bits got set as intended.
+ *
+ * This matches what is done for rdmsr/wrmsr. As long as there's no #GP, then
+ * the operation is considered a success, and the caller does their own checking.
+ */
static int __amd_smn_rw(u16 node, u32 address, u32 *value, bool write)
{
struct pci_dev *root;
@@ -179,9 +211,6 @@ static int __amd_smn_rw(u16 node, u32 address, u32 *value, bool write)

err = (write ? pci_write_config_dword(root, 0x64, *value)
: pci_read_config_dword(root, 0x64, value));
- if (err)
- pr_warn("Error %s SMN address 0x%x.\n",
- (write ? "writing to" : "reading from"), address);

out_unlock:
mutex_unlock(&smn_mutex);
@@ -190,13 +219,18 @@ static int __amd_smn_rw(u16 node, u32 address, u32 *value, bool write)
return err;
}

-int amd_smn_read(u16 node, u32 address, u32 *value)
+int __must_check amd_smn_read(u16 node, u32 address, u32 *value)
{
- return __amd_smn_rw(node, address, value, false);
+ int err = __amd_smn_rw(node, address, value, false);
+
+ if (PCI_POSSIBLE_ERROR(*value))
+ err = -ENODEV;
+
+ return err;
}
EXPORT_SYMBOL_GPL(amd_smn_read);

-int amd_smn_write(u16 node, u32 address, u32 value)
+int __must_check amd_smn_write(u16 node, u32 address, u32 value)
{
return __amd_smn_rw(node, address, &value, true);
}
--
2.34.1


2023-05-17 12:56:53

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH 3/6] hwmon: (k10temp) Check return value of amd_smn_read()

On Tue, May 16, 2023 at 03:24:27PM -0500, Yazen Ghannam wrote:
> Check the return value of amd_smn_read() before saving a value. This
> ensures invalid values aren't saved or used.
>
> There are three cases here with slightly different behavior.
>
> 1) read_tempreg_nb_zen():
> This is a function pointer which does not include a return code.
> In this case, set the register value to 0 on failure. This
> enforces Read-as-Zero behavior.
>
> 2) k10temp_read_temp():
> This function does have return codes, so return -EINVAL on a
> failed register read. Continued operation is not necessary,
> since there is no valid data from the register. Furthermore, if
> the register value was set to 0, then the following operation
> would underflow.
>
> 3) k10temp_get_ccd_support():
> This function reads the same register from multiple CCD
> instances in a loop. And a bitmask is formed if a specific bit
> is set in each register instance. The loop should continue on a
> failed register read, skipping the bit check.
>
> Furthermore, the __must_check attribute will be added to amd_smn_read().
> Therefore, this change is required to avoid compile-time warnings.
>
> Signed-off-by: Yazen Ghannam <[email protected]>
> Cc: [email protected]
> ---
> drivers/hwmon/k10temp.c | 19 ++++++++++++-------
> 1 file changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c
> index 7b177b9fbb09..6ea1fa62b7c1 100644
> --- a/drivers/hwmon/k10temp.c
> +++ b/drivers/hwmon/k10temp.c
> @@ -145,8 +145,9 @@ static void read_tempreg_nb_f15(struct pci_dev *pdev, u32 *regval)
>
> static void read_tempreg_nb_zen(struct pci_dev *pdev, u32 *regval)
> {
> - amd_smn_read(amd_pci_dev_to_node_id(pdev),
> - ZEN_REPORTED_TEMP_CTRL_BASE, regval);
> + if (amd_smn_read(amd_pci_dev_to_node_id(pdev),
> + ZEN_REPORTED_TEMP_CTRL_BASE, regval))
> + *regval = 0;
> }
>
> static long get_raw_temp(struct k10temp_data *data)
> @@ -213,9 +214,11 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
> *val = 0;
> break;
> case 2 ... 13: /* Tccd{1-12} */
> - amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
> - ZEN_CCD_TEMP(data->ccd_offset, channel - 2),
> - &regval);
> + if (amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
> + ZEN_CCD_TEMP(data->ccd_offset, channel - 2),
> + &regval))
> + return -EINVAL;
> +

-EINVAL: Invalid Argument, supposed to be used for bad user input.
I don't see how that would apply here. amd_smn_read() returns
a valid error code. This error core should be returned to the caller,
or there needs to be an explanation why this is not appropriate.

> *val = (regval & ZEN_CCD_TEMP_MASK) * 125 - 49000;
> break;
> default:
> @@ -373,8 +376,10 @@ static void k10temp_get_ccd_support(struct pci_dev *pdev,
> int i;
>
> for (i = 0; i < limit; i++) {
> - amd_smn_read(amd_pci_dev_to_node_id(pdev),
> - ZEN_CCD_TEMP(data->ccd_offset, i), &regval);
> + if (amd_smn_read(amd_pci_dev_to_node_id(pdev),
> + ZEN_CCD_TEMP(data->ccd_offset, i), &regval))
> + continue;
> +
The reason for ignoring the error should be explained here.

> if (regval & ZEN_CCD_TEMP_VALID)
> data->show_temp |= BIT(TCCD_BIT(i));
> }
> --
> 2.34.1
>

2023-05-17 14:17:44

by Yazen Ghannam

[permalink] [raw]
Subject: Re: [PATCH 3/6] hwmon: (k10temp) Check return value of amd_smn_read()

On 5/17/23 8:25 AM, Guenter Roeck wrote:

[...]

>> static long get_raw_temp(struct k10temp_data *data)
>> @@ -213,9 +214,11 @@ static int k10temp_read_temp(struct device *dev, u32 attr, int channel,
>> *val = 0;
>> break;
>> case 2 ... 13: /* Tccd{1-12} */
>> - amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
>> - ZEN_CCD_TEMP(data->ccd_offset, channel - 2),
>> - &regval);
>> + if (amd_smn_read(amd_pci_dev_to_node_id(data->pdev),
>> + ZEN_CCD_TEMP(data->ccd_offset, channel - 2),
>> + &regval))
>> + return -EINVAL;
>> +
>
> -EINVAL: Invalid Argument, supposed to be used for bad user input.
> I don't see how that would apply here. amd_smn_read() returns
> a valid error code. This error core should be returned to the caller,
> or there needs to be an explanation why this is not appropriate.
>

Understood. Will change it to return the amd_smn_read() error code.

>> *val = (regval & ZEN_CCD_TEMP_MASK) * 125 - 49000;
>> break;
>> default:
>> @@ -373,8 +376,10 @@ static void k10temp_get_ccd_support(struct pci_dev *pdev,
>> int i;
>>
>> for (i = 0; i < limit; i++) {
>> - amd_smn_read(amd_pci_dev_to_node_id(pdev),
>> - ZEN_CCD_TEMP(data->ccd_offset, i), &regval);
>> + if (amd_smn_read(amd_pci_dev_to_node_id(pdev),
>> + ZEN_CCD_TEMP(data->ccd_offset, i), &regval))
>> + continue;
>> +
> The reason for ignoring the error should be explained here.
>

Sure thing. I'll add a code comment above.

Thanks,
Yazen