LinuxLists.cc - [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-22 00:12:10

Subject: [PATCH v3 0/8] AMD64 EDAC fixes

From: Yazen Ghannam <[email protected]>

Hi Boris,

This set contains a few fixes for some changes merged in v5.2. There
are also a couple of fixes for older issues. In addition, there are a
couple of patches to add support for Asymmetric Dual-Rank DIMMs.

I don't have the failing config readily available that you used, but I
believe I found the issue. Please let me know how it goes.

I've also added RFC patches to avoid the "ECC disabled" message for
nodes without memory. I haven't fully tested these, but I wanted to get
your thoughts. Here's an earlier discussion:
https://lkml.kernel.org/r/[email protected]

Thanks,
Yazen

Link:
https://lkml.kernel.org/r/[email protected]

v2->v3:
* Drop Fixes: tags in patch 1.
* Add detection of x8 DRAM devices in patches 2 and 3.
* Fix Chip Select size printing in patch 4.
* Added RFC patches to avoid "ECC disabled" message for nodes without memory.

v1->v2:
* Squash patches 1 and 2 together.

Yazen Ghannam (10):
EDAC/amd64: Support more than two controllers for chip selects
handling
EDAC/amd64: Recognize DRAM device type with EDAC_CTL_CAP
EDAC/amd64: Initialize DIMM info for systems with more than two
channels
EDAC/amd64: Find Chip Select memory size using Address Mask
EDAC/amd64: Decode syndrome before translating address
EDAC/amd64: Cache secondary Chip Select registers
EDAC/amd64: Support Asymmetric Dual-Rank DIMMs
EDAC/amd64: Gather hardware information early
EDAC/amd64: Use cached data when checking for ECC
EDAC/amd64: Check for memory before fully initializing an instance

drivers/edac/amd64_edac.c | 371 +++++++++++++++++++++++++-------------
drivers/edac/amd64_edac.h | 9 +-
2 files changed, 251 insertions(+), 129 deletions(-)

--
2.17.1

2019-08-22 00:12:19

by Yazen Ghannam

[permalink] [raw]

Subject: [RFC PATCH v3 09/10] EDAC/amd64: Use cached data when checking for ECC

From: Yazen Ghannam <[email protected]>

...now that the data is available earlier.

Signed-off-by: Yazen Ghannam <[email protected]>
---
drivers/edac/amd64_edac.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 84832771dec0..c1cb0234f085 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3183,31 +3183,27 @@ static const char *ecc_msg =
"'ecc_enable_override'.\n"
" (Note that use of the override may cause unknown side effects.)\n";

-static bool ecc_enabled(struct pci_dev *F3, u16 nid)
+static bool ecc_enabled(struct amd64_pvt *pvt)
{
+ u16 nid = pvt->mc_node_id;
bool nb_mce_en = false;
u8 ecc_en = 0, i;
u32 value;

if (boot_cpu_data.x86 >= 0x17) {
u8 umc_en_mask = 0, ecc_en_mask = 0;
+ struct amd64_umc *umc;

for_each_umc(i) {
- u32 base = get_umc_base(i);
+ umc = &pvt->umc[i];

/* Only check enabled UMCs. */
- if (amd_smn_read(nid, base + UMCCH_SDP_CTRL, &value))
- continue;
-
- if (!(value & UMC_SDP_INIT))
+ if (!(umc->sdp_ctrl & UMC_SDP_INIT))
continue;

umc_en_mask |= BIT(i);

- if (amd_smn_read(nid, base + UMCCH_UMC_CAP_HI, &value))
- continue;
-
- if (value & UMC_ECC_ENABLED)
+ if (umc->umc_cap_hi & UMC_ECC_ENABLED)
ecc_en_mask |= BIT(i);
}

@@ -3220,7 +3216,7 @@ static bool ecc_enabled(struct pci_dev *F3, u16 nid)
/* Assume UMC MCA banks are enabled. */
nb_mce_en = true;
} else {
- amd64_read_pci_cfg(F3, NBCFG, &value);
+ amd64_read_pci_cfg(pvt->F3, NBCFG, &value);

ecc_en = !!(value & NBCFG_ECC_ENABLE);

@@ -3539,7 +3535,7 @@ static int probe_one_instance(unsigned int nid)
if (ret < 0)
goto err_enable;

- if (!ecc_enabled(F3, nid)) {
+ if (!ecc_enabled(pvt)) {
ret = 0;

if (!ecc_enable_override)
--
2.17.1

2019-08-22 00:13:11

by Yazen Ghannam

[permalink] [raw]

Subject: [RFC PATCH v3 10/10] EDAC/amd64: Check for memory before fully initializing an instance

2019-08-22 02:20:03

by Yazen Ghannam

[permalink] [raw]

Subject: [PATCH v3 3/8] EDAC/amd64: Initialize DIMM info for systems with more than two channels

From: Yazen Ghannam <[email protected]>

Currently, the DIMM info for AMD Family 17h systems is initialized in
init_csrows(). This function is shared with legacy systems, and it has a
limit of two channel support.

This prevents initialization of the DIMM info for a number of ranks, so
there will be missing ranks in the EDAC sysfs.

Create a new init_csrows_df() for Family17h+ and revert init_csrows()
back to pre-Family17h support.

Loop over all channels in the new function in order to support systems
with more than two channels.

Signed-off-by: Yazen Ghannam <[email protected]>
---
Link:
https://lkml.kernel.org/r/[email protected]

v2->v3:
* Drop Fixes: tag.
* Add x8 DRAM device case.

v1->v2:
* No change.

drivers/edac/amd64_edac.c | 66 ++++++++++++++++++++++++++++++---------
1 file changed, 52 insertions(+), 14 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 0e8b2137edbb..001dc85122e9 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2837,6 +2837,49 @@ static u32 get_csrow_nr_pages(struct amd64_pvt *pvt, u8 dct, int csrow_nr_orig)
return nr_pages;
}

+static int init_csrows_df(struct mem_ctl_info *mci)
+{
+ struct amd64_pvt *pvt = mci->pvt_info;
+ enum edac_type edac_mode = EDAC_NONE;
+ enum dev_type dev_type = DEV_UNKNOWN;
+ struct dimm_info *dimm;
+ int empty = 1;
+ u8 umc, cs;
+
+ if (mci->edac_ctl_cap & EDAC_FLAG_S16ECD16ED) {
+ edac_mode = EDAC_S16ECD16ED;
+ dev_type = DEV_X16;
+ } else if (mci->edac_ctl_cap & EDAC_FLAG_S8ECD8ED) {
+ edac_mode = EDAC_S8ECD8ED;
+ dev_type = DEV_X8;
+ } else if (mci->edac_ctl_cap & EDAC_FLAG_S4ECD4ED) {
+ edac_mode = EDAC_S4ECD4ED;
+ dev_type = DEV_X4;
+ } else if (mci->edac_ctl_cap & EDAC_FLAG_SECDED) {
+ edac_mode = EDAC_SECDED;
+ }
+
+ for_each_umc(umc) {
+ for_each_chip_select(cs, umc, pvt) {
+ if (!csrow_enabled(cs, umc, pvt))
+ continue;
+
+ empty = 0;
+ dimm = mci->csrows[cs]->channels[umc]->dimm;
+
+ edac_dbg(1, "MC node: %d, csrow: %d\n",
+ pvt->mc_node_id, cs);
+
+ dimm->nr_pages = get_csrow_nr_pages(pvt, umc, cs);
+ dimm->mtype = pvt->dram_type;
+ dimm->edac_mode = edac_mode;
+ dimm->dtype = dev_type;
+ }
+ }
+
+ return empty;
+}
+
/*
* Initialize the array of csrow attribute instances, based on the values
* from pci config hardware registers.
@@ -2851,15 +2894,16 @@ static int init_csrows(struct mem_ctl_info *mci)
int nr_pages = 0;
u32 val;

- if (!pvt->umc) {
- amd64_read_pci_cfg(pvt->F3, NBCFG, &val);
+ if (pvt->umc)
+ return init_csrows_df(mci);
+
+ amd64_read_pci_cfg(pvt->F3, NBCFG, &val);

- pvt->nbcfg = val;
+ pvt->nbcfg = val;

- edac_dbg(0, "node %d, NBCFG=0x%08x[ChipKillEccCap: %d|DramEccEn: %d]\n",
- pvt->mc_node_id, val,
- !!(val & NBCFG_CHIPKILL), !!(val & NBCFG_ECC_ENABLE));
- }
+ edac_dbg(0, "node %d, NBCFG=0x%08x[ChipKillEccCap: %d|DramEccEn: %d]\n",
+ pvt->mc_node_id, val,
+ !!(val & NBCFG_CHIPKILL), !!(val & NBCFG_ECC_ENABLE));

/*
* We iterate over DCT0 here but we look at DCT1 in parallel, if needed.
@@ -2896,13 +2940,7 @@ static int init_csrows(struct mem_ctl_info *mci)
edac_dbg(1, "Total csrow%d pages: %u\n", i, nr_pages);

/* Determine DIMM ECC mode: */
- if (pvt->umc) {
- if (mci->edac_ctl_cap & EDAC_FLAG_S4ECD4ED)
- edac_mode = EDAC_S4ECD4ED;
- else if (mci->edac_ctl_cap & EDAC_FLAG_SECDED)
- edac_mode = EDAC_SECDED;
-
- } else if (pvt->nbcfg & NBCFG_ECC_ENABLE) {
+ if (pvt->nbcfg & NBCFG_ECC_ENABLE) {
edac_mode = (pvt->nbcfg & NBCFG_CHIPKILL)
? EDAC_S4ECD4ED
: EDAC_SECDED;
--
2.17.1

2019-08-22 02:20:43

by Yazen Ghannam

[permalink] [raw]

Subject: [PATCH v3 4/8] EDAC/amd64: Find Chip Select memory size using Address Mask

From: Yazen Ghannam <[email protected]>

Chip Select memory size reporting on AMD Family 17h was recently fixed
in order to account for interleaving. However, the current method is not
robust.

The Chip Select Address Mask can be used to find the memory size. There
are a couple of cases.

1) For single-rank and dual-rank non-interleaved, use the address mask
plus 1 as the size.

2) For dual-rank interleaved, do #1 but "de-interleave" the address mask
first.

Always "de-interleave" the address mask in order to simplify the code
flow. Bit mask manipulation is necessary to check for interleaving, so
just go ahead and do the de-interleaving. In the non-interleaved case,
the original and de-interleaved address masks will be the same.

To de-interleave the mask, count the number of zero bits in the middle
of the mask and swap them with the most significant bits.

For example,
Original=0xFFFF9FE, De-interleaved=0x3FFFFFE

Signed-off-by: Yazen Ghannam <[email protected]>
---
Link:
https://lkml.kernel.org/r/[email protected]

v2->v3:
* Drop Fixes: tag.
* Add checks to only return CS size for enabled CSes.

v1->v2:
* No change.

drivers/edac/amd64_edac.c | 114 +++++++++++++++++++++++---------------
1 file changed, 70 insertions(+), 44 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 001dc85122e9..c4f2d7c59b04 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -788,51 +788,39 @@ static void debug_dump_dramcfg_low(struct amd64_pvt *pvt, u32 dclr, int chan)
(dclr & BIT(15)) ? "yes" : "no");
}

-/*
- * The Address Mask should be a contiguous set of bits in the non-interleaved
- * case. So to check for CS interleaving, find the most- and least-significant
- * bits of the mask, generate a contiguous bitmask, and compare the two.
- */
-static bool f17_cs_interleaved(struct amd64_pvt *pvt, u8 ctrl, int cs)
+#define CS_EVEN_PRIMARY BIT(0)
+#define CS_ODD_PRIMARY BIT(1)
+
+#define CS_EVEN CS_EVEN_PRIMARY
+#define CS_ODD CS_ODD_PRIMARY
+
+static int f17_get_cs_mode(int dimm, u8 ctrl, struct amd64_pvt *pvt)
{
- u32 mask = pvt->csels[ctrl].csmasks[cs >> 1];
- u32 msb = fls(mask) - 1, lsb = ffs(mask) - 1;
- u32 test_mask = GENMASK(msb, lsb);
+ int cs_mode = 0;

- edac_dbg(1, "mask=0x%08x test_mask=0x%08x\n", mask, test_mask);
+ if (csrow_enabled(2 * dimm, ctrl, pvt))
+ cs_mode |= CS_EVEN_PRIMARY;

- return mask ^ test_mask;
+ if (csrow_enabled(2 * dimm + 1, ctrl, pvt))
+ cs_mode |= CS_ODD_PRIMARY;
+
+ return cs_mode;
}

static void debug_display_dimm_sizes_df(struct amd64_pvt *pvt, u8 ctrl)
{
- int dimm, size0, size1, cs0, cs1;
+ int dimm, size0, size1, cs0, cs1, cs_mode;

edac_printk(KERN_DEBUG, EDAC_MC, "UMC%d chip selects:\n", ctrl);

for (dimm = 0; dimm < 2; dimm++) {
- size0 = 0;
cs0 = dimm * 2;
-
- if (csrow_enabled(cs0, ctrl, pvt))
- size0 = pvt->ops->dbam_to_cs(pvt, ctrl, 0, cs0);
-
- size1 = 0;
cs1 = dimm * 2 + 1;

- if (csrow_enabled(cs1, ctrl, pvt)) {
- /*
- * CS interleaving is only supported if both CSes have
- * the same amount of memory. Because they are
- * interleaved, it will look like both CSes have the
- * full amount of memory. Save the size for both as
- * half the amount we found on CS0, if interleaved.
- */
- if (f17_cs_interleaved(pvt, ctrl, cs1))
- size1 = size0 = (size0 >> 1);
- else
- size1 = pvt->ops->dbam_to_cs(pvt, ctrl, 0, cs1);
- }
+ cs_mode = f17_get_cs_mode(dimm, ctrl, pvt);
+
+ size0 = pvt->ops->dbam_to_cs(pvt, ctrl, cs_mode, cs0);
+ size1 = pvt->ops->dbam_to_cs(pvt, ctrl, cs_mode, cs1);

amd64_info(EDAC_MC ": %d: %5dMB %d: %5dMB\n",
cs0, size0,
@@ -1569,18 +1557,54 @@ static int f16_dbam_to_chip_select(struct amd64_pvt *pvt, u8 dct,
return ddr3_cs_size(cs_mode, false);
}

-static int f17_base_addr_to_cs_size(struct amd64_pvt *pvt, u8 umc,
+static int f17_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc,
unsigned int cs_mode, int csrow_nr)
{
- u32 base_addr = pvt->csels[umc].csbases[csrow_nr];
+ u32 addr_mask_orig, addr_mask_deinterleaved;
+ u32 msb, weight, num_zero_bits;
+ int dimm, size = 0;

- /* Each mask is used for every two base addresses. */
- u32 addr_mask = pvt->csels[umc].csmasks[csrow_nr >> 1];
+ /* No Chip Selects are enabled. */
+ if (!cs_mode)
+ return size;

- /* Register [31:1] = Address [39:9]. Size is in kBs here. */
- u32 size = ((addr_mask >> 1) - (base_addr >> 1) + 1) >> 1;
+ /* Requested size of an even CS but none are enabled. */
+ if (!(cs_mode & CS_EVEN) && !(csrow_nr & 1))
+ return size;

- edac_dbg(1, "BaseAddr: 0x%x, AddrMask: 0x%x\n", base_addr, addr_mask);
+ /* Requested size of an odd CS but none are enabled. */
+ if (!(cs_mode & CS_ODD) && (csrow_nr & 1))
+ return size;
+
+ /*
+ * There is one mask per DIMM, and two Chip Selects per DIMM.
+ * CS0 and CS1 -> DIMM0
+ * CS2 and CS3 -> DIMM1
+ */
+ dimm = csrow_nr >> 1;
+
+ addr_mask_orig = pvt->csels[umc].csmasks[dimm];
+
+ /*
+ * The number of zero bits in the mask is equal to the number of bits
+ * in a full mask minus the number of bits in the current mask.
+ *
+ * The MSB is the number of bits in the full mask because BIT[0] is
+ * always 0.
+ */
+ msb = fls(addr_mask_orig) - 1;
+ weight = hweight_long(addr_mask_orig);
+ num_zero_bits = msb - weight;
+
+ /* Take the number of zero bits off from the top of the mask. */
+ addr_mask_deinterleaved = GENMASK_ULL(msb - num_zero_bits, 1);
+
+ edac_dbg(1, "CS%d DIMM%d AddrMasks:\n", csrow_nr, dimm);
+ edac_dbg(1, " Original AddrMask: 0x%x\n", addr_mask_orig);
+ edac_dbg(1, " Deinterleaved AddrMask: 0x%x\n", addr_mask_deinterleaved);
+
+ /* Register [31:1] = Address [39:9]. Size is in kBs here. */
+ size = (addr_mask_deinterleaved >> 2) + 1;

/* Return size in MBs. */
return size >> 10;
@@ -2245,7 +2269,7 @@ static struct amd64_family_type family_types[] = {
.f6_id = PCI_DEVICE_ID_AMD_17H_DF_F6,
.ops = {
.early_channel_count = f17_early_channel_count,
- .dbam_to_cs = f17_base_addr_to_cs_size,
+ .dbam_to_cs = f17_addr_mask_to_cs_size,
}
},
[F17_M10H_CPUS] = {
@@ -2254,7 +2278,7 @@ static struct amd64_family_type family_types[] = {
.f6_id = PCI_DEVICE_ID_AMD_17H_M10H_DF_F6,
.ops = {
.early_channel_count = f17_early_channel_count,
- .dbam_to_cs = f17_base_addr_to_cs_size,
+ .dbam_to_cs = f17_addr_mask_to_cs_size,
}
},
[F17_M30H_CPUS] = {
@@ -2263,7 +2287,7 @@ static struct amd64_family_type family_types[] = {
.f6_id = PCI_DEVICE_ID_AMD_17H_M30H_DF_F6,
.ops = {
.early_channel_count = f17_early_channel_count,
- .dbam_to_cs = f17_base_addr_to_cs_size,
+ .dbam_to_cs = f17_addr_mask_to_cs_size,
}
},
};
@@ -2822,10 +2846,12 @@ static u32 get_csrow_nr_pages(struct amd64_pvt *pvt, u8 dct, int csrow_nr_orig)
int csrow_nr = csrow_nr_orig;
u32 cs_mode, nr_pages;

- if (!pvt->umc)
+ if (!pvt->umc) {
csrow_nr >>= 1;
-
- cs_mode = DBAM_DIMM(csrow_nr, dbam);
+ cs_mode = DBAM_DIMM(csrow_nr, dbam);
+ } else {
+ cs_mode = f17_get_cs_mode(csrow_nr >> 1, dct, pvt);
+ }

nr_pages = pvt->ops->dbam_to_cs(pvt, dct, cs_mode, csrow_nr);
nr_pages <<= 20 - PAGE_SHIFT;
--
2.17.1

2019-08-22 02:21:36

by Yazen Ghannam

[permalink] [raw]

Subject: [PATCH v3 6/8] EDAC/amd64: Cache secondary Chip Select registers

From: Yazen Ghannam <[email protected]>

AMD Family 17h systems have a set of secondary Chip Select Base
Addresses and Address Masks. These do not represent unique Chip
Selects, rather they are used in conjunction with the primary
Chip Select registers in certain use cases.

Cache these secondary Chip Select registers for future use.

Signed-off-by: Yazen Ghannam <[email protected]>
---
Link:
https://lkml.kernel.org/r/[email protected]

v2->v3:
* No change.

v1->v2:
* No change.

drivers/edac/amd64_edac.c | 23 ++++++++++++++++++++---
drivers/edac/amd64_edac.h | 4 ++++
2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index de141de7b2e5..26ce48fcaf00 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -946,34 +946,51 @@ static void prep_chip_selects(struct amd64_pvt *pvt)

static void read_umc_base_mask(struct amd64_pvt *pvt)
{
- u32 umc_base_reg, umc_mask_reg;
- u32 base_reg, mask_reg;
- u32 *base, *mask;
+ u32 umc_base_reg, umc_base_reg_sec;
+ u32 umc_mask_reg, umc_mask_reg_sec;
+ u32 base_reg, base_reg_sec;
+ u32 mask_reg, mask_reg_sec;
+ u32 *base, *base_sec;
+ u32 *mask, *mask_sec;
int cs, umc;

for_each_umc(umc) {
umc_base_reg = get_umc_base(umc) + UMCCH_BASE_ADDR;
+ umc_base_reg_sec = get_umc_base(umc) + UMCCH_BASE_ADDR_SEC;

for_each_chip_select(cs, umc, pvt) {
base = &pvt->csels[umc].csbases[cs];
+ base_sec = &pvt->csels[umc].csbases_sec[cs];

base_reg = umc_base_reg + (cs * 4);
+ base_reg_sec = umc_base_reg_sec + (cs * 4);

if (!amd_smn_read(pvt->mc_node_id, base_reg, base))
edac_dbg(0, " DCSB%d[%d]=0x%08x reg: 0x%x\n",
umc, cs, *base, base_reg);
+
+ if (!amd_smn_read(pvt->mc_node_id, base_reg_sec, base_sec))
+ edac_dbg(0, " DCSB_SEC%d[%d]=0x%08x reg: 0x%x\n",
+ umc, cs, *base_sec, base_reg_sec);
}

umc_mask_reg = get_umc_base(umc) + UMCCH_ADDR_MASK;
+ umc_mask_reg_sec = get_umc_base(umc) + UMCCH_ADDR_MASK_SEC;

for_each_chip_select_mask(cs, umc, pvt) {
mask = &pvt->csels[umc].csmasks[cs];
+ mask_sec = &pvt->csels[umc].csmasks_sec[cs];

mask_reg = umc_mask_reg + (cs * 4);
+ mask_reg_sec = umc_mask_reg_sec + (cs * 4);

if (!amd_smn_read(pvt->mc_node_id, mask_reg, mask))
edac_dbg(0, " DCSM%d[%d]=0x%08x reg: 0x%x\n",
umc, cs, *mask, mask_reg);
+
+ if (!amd_smn_read(pvt->mc_node_id, mask_reg_sec, mask_sec))
+ edac_dbg(0, " DCSM_SEC%d[%d]=0x%08x reg: 0x%x\n",
+ umc, cs, *mask_sec, mask_reg_sec);
}
}
}
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 4dce6a2ac75f..68f12de6e654 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -259,7 +259,9 @@

/* UMC CH register offsets */
#define UMCCH_BASE_ADDR 0x0
+#define UMCCH_BASE_ADDR_SEC 0x10
#define UMCCH_ADDR_MASK 0x20
+#define UMCCH_ADDR_MASK_SEC 0x28
#define UMCCH_ADDR_CFG 0x30
#define UMCCH_DIMM_CFG 0x80
#define UMCCH_UMC_CFG 0x100
@@ -312,9 +314,11 @@ struct dram_range {
/* A DCT chip selects collection */
struct chip_select {
u32 csbases[NUM_CHIPSELECTS];
+ u32 csbases_sec[NUM_CHIPSELECTS];
u8 b_cnt;

u32 csmasks[NUM_CHIPSELECTS];
+ u32 csmasks_sec[NUM_CHIPSELECTS];
u8 m_cnt;
};

--
2.17.1

2019-08-22 02:22:39

by Yazen Ghannam

[permalink] [raw]

Subject: [PATCH v3 2/8] EDAC/amd64: Recognize DRAM device type with EDAC_CTL_CAP

From: Yazen Ghannam <[email protected]>

AMD Family 17h systems support x4 and x16 DRAM devices. However, the
device type is not checked when setting EDAC_CTL_CAP.

Set the appropriate EDAC_CTL_CAP flag based on the device type.

Default to x8 DRAM device when neither the x4 or x16 bits are set.

Fixes: 2d09d8f301f5 ("EDAC, amd64: Determine EDAC MC capabilities on Fam17h")
Signed-off-by: Yazen Ghannam <[email protected]>
---
Link:
https://lkml.kernel.org/r/[email protected]

v2->v3:
* Add case for x8 DRAM devices.

v1->v2:
* No change.

drivers/edac/amd64_edac.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index dd60cf5a3d96..0e8b2137edbb 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3150,12 +3150,15 @@ static bool ecc_enabled(struct pci_dev *F3, u16 nid)
static inline void
f17h_determine_edac_ctl_cap(struct mem_ctl_info *mci, struct amd64_pvt *pvt)
{
- u8 i, ecc_en = 1, cpk_en = 1;
+ u8 i, ecc_en = 1, cpk_en = 1, dev_x4 = 1, dev_x16 = 1;

for_each_umc(i) {
if (pvt->umc[i].sdp_ctrl & UMC_SDP_INIT) {
ecc_en &= !!(pvt->umc[i].umc_cap_hi & UMC_ECC_ENABLED);
cpk_en &= !!(pvt->umc[i].umc_cap_hi & UMC_ECC_CHIPKILL_CAP);
+
+ dev_x4 &= !!(pvt->umc[i].dimm_cfg & BIT(6));
+ dev_x16 &= !!(pvt->umc[i].dimm_cfg & BIT(7));
}
}

@@ -3163,8 +3166,14 @@ f17h_determine_edac_ctl_cap(struct mem_ctl_info *mci, struct amd64_pvt *pvt)
if (ecc_en) {
mci->edac_ctl_cap |= EDAC_FLAG_SECDED;

- if (cpk_en)
- mci->edac_ctl_cap |= EDAC_FLAG_S4ECD4ED;
+ if (cpk_en) {
+ if (dev_x4)
+ mci->edac_ctl_cap |= EDAC_FLAG_S4ECD4ED;
+ else if (dev_x16)
+ mci->edac_ctl_cap |= EDAC_FLAG_S16ECD16ED;
+ else
+ mci->edac_ctl_cap |= EDAC_FLAG_S8ECD8ED;
+ }
}
}

--
2.17.1

2019-08-22 05:09:38

by Adam Borowski

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes

On Wed, Aug 21, 2019 at 11:59:53PM +0000, Ghannam, Yazen wrote:
> I've also added RFC patches to avoid the "ECC disabled" message for
> nodes without memory. I haven't fully tested these, but I wanted to get
> your thoughts. Here's an earlier discussion:
> https://lkml.kernel.org/r/[email protected]

While you're editing that code, could you please also cut the spam if ECC is
actually disabled? For example, a 2990WX with non-ECC RAM gets 1024 lines;
64 copies of:

[ 8.186164] EDAC amd64: Node 0: DRAM ECC disabled.
[ 8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[ 8.194762] EDAC amd64: Node 1: DRAM ECC disabled.
[ 8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[ 8.199840] EDAC amd64: Node 2: DRAM ECC disabled.
[ 8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[ 8.204326] EDAC amd64: Node 3: DRAM ECC disabled.
[ 8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)

Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋ The root of a real enemy is an imaginary friend.
⠈⠳⣄⠀⠀⠀⠀

2019-08-22 09:42:03

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes

On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote:
> While you're editing that code, could you please also cut the spam if ECC is
> actually disabled? For example, a 2990WX with non-ECC RAM gets 1024 lines;

Patch is in there. I'll give you extra points if you spot it.

> Meow!

Wuff!

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-08-22 13:10:09

by Adam Borowski

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes

On Thu, Aug 22, 2019 at 10:35:48AM +0200, Borislav Petkov wrote:
> On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote:
> > While you're editing that code, could you please also cut the spam if ECC is
> > actually disabled? For example, a 2990WX with non-ECC RAM gets 1024 lines;
>
> Patch is in there. I'll give you extra points if you spot it.

Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1
and 3). Your patch set also overhauls the messages.

But, the amount of redundant messages I'm complaining about has actually
increased:

dmesg|grep EDAC|cut -c 16-|sort|uniq -c
256 EDAC MC: UMC0 chip selects:
256 EDAC MC: UMC1 chip selects:
1 EDAC MC: Ver: 3.0.0
128 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
^ three lines each
64 EDAC amd64: F17h detected (node 0).
64 EDAC amd64: F17h detected (node 1).
64 EDAC amd64: F17h detected (node 2).
64 EDAC amd64: F17h detected (node 3).
512 EDAC amd64: MC: 0: 0MB 1: 0MB
256 EDAC amd64: MC: 2: 0MB 3: 0MB
256 EDAC amd64: MC: 2: 8192MB 3: 0MB
64 EDAC amd64: Node 0: DRAM ECC disabled.
64 EDAC amd64: Node 2: DRAM ECC disabled.
256 EDAC amd64: using x4 syndromes.

(Full dmesg at http://ix.io/1T1o)

While on 5.3-rc5 without the patchset I get:

1 EDAC MC: Ver: 3.0.0
256 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
^ three lines each
64 EDAC amd64: Node 0: DRAM ECC disabled.
64 EDAC amd64: Node 1: DRAM ECC disabled.
64 EDAC amd64: Node 2: DRAM ECC disabled.
64 EDAC amd64: Node 3: DRAM ECC disabled.

So I wonder if we could deduplicate those.

Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄⠀⠀⠀⠀ A master species delegates.

2019-08-22 13:10:28

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes

On Thu, Aug 22, 2019 at 11:46:07AM +0200, Adam Borowski wrote:
> Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1
> and 3). Your patch set also overhauls the messages.

Not my patchset - Yazen's.

> But, the amount of redundant messages I'm complaining about has actually
> increased:

He's working on that - that still needs some love.

Thanks for testing.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-08-23 07:23:55

by Yazen Ghannam

[permalink] [raw]

Subject: [RFC PATCH v2] EDAC/amd64: Check for memory before fully initializing an instance

From: Yazen Ghannam <[email protected]>

Return early before checking for ECC if the node does not have any
populated memory.

Free any cached hardware data before returning. Also, return 0 in this
case since this is not a failure. Other nodes may have memory and the
module should attempt to load an instance for them.

Move printing of hardware information to after the instance is
initialized, so that the information is only printed for nodes with
memory.

Signed-off-by: Yazen Ghannam <[email protected]>
---
Link:
https://lkml.kernel.org/r/[email protected]

v1->v2:
* Moved hardware info printing to after instance is initialized.
* Added message for when instance has no memory.

drivers/edac/amd64_edac.c | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index c1cb0234f085..3f0fe6ed1fa3 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2831,8 +2831,6 @@ static void read_mc_regs(struct amd64_pvt *pvt)
edac_dbg(1, " DIMM type: %s\n", edac_mem_types[pvt->dram_type]);

determine_ecc_sym_sz(pvt);
-
- dump_misc_regs(pvt);
}

/*
@@ -3505,6 +3503,23 @@ static int init_one_instance(struct amd64_pvt *pvt,
return ret;
}

+static bool instance_has_memory(struct amd64_pvt *pvt)
+{
+ bool cs_enabled = false;
+ int num_channels = 2;
+ int cs = 0, dct = 0;
+
+ if (pvt->umc)
+ num_channels = num_umcs;
+
+ for (dct = 0; dct < num_channels; dct++) {
+ for_each_chip_select(cs, dct, pvt)
+ cs_enabled |= csrow_enabled(cs, dct, pvt);
+ }
+
+ return cs_enabled;
+}
+
static int probe_one_instance(unsigned int nid)
{
struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
@@ -3535,6 +3550,12 @@ static int probe_one_instance(unsigned int nid)
if (ret < 0)
goto err_enable;

+ ret = 0;
+ if (!instance_has_memory(pvt)) {
+ amd64_warn("Node %d: DRAM ECC disabled. No DIMMs detected.\n", nid);
+ goto err_enable;
+ }
+
if (!ecc_enabled(pvt)) {
ret = 0;

@@ -3561,6 +3582,8 @@ static int probe_one_instance(unsigned int nid)
goto err_enable;
}

+ dump_misc_regs(pvt);
+
return ret;

err_enable:
--
2.17.1

2019-08-23 07:24:36

by Yazen Ghannam

[permalink] [raw]

Subject: RE: [PATCH v3 0/8] AMD64 EDAC fixes

> -----Original Message-----
> From: [email protected] <[email protected]> On Behalf Of Adam Borowski
> Sent: Wednesday, August 21, 2019 7:50 PM
> To: Ghannam, Yazen <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
>
> On Wed, Aug 21, 2019 at 11:59:53PM +0000, Ghannam, Yazen wrote:
> > I've also added RFC patches to avoid the "ECC disabled" message for
> > nodes without memory. I haven't fully tested these, but I wanted to get
> > your thoughts. Here's an earlier discussion:
> > https://lkml.kernel.org/r/[email protected]
>
> While you're editing that code, could you please also cut the spam if ECC is
> actually disabled? For example, a 2990WX with non-ECC RAM gets 1024 lines;
> 64 copies of:
>
> [ 8.186164] EDAC amd64: Node 0: DRAM ECC disabled.
> [ 8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
> Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
> (Note that use of the override may cause unknown side effects.)
> [ 8.194762] EDAC amd64: Node 1: DRAM ECC disabled.
> [ 8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
> Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
> (Note that use of the override may cause unknown side effects.)
> [ 8.199840] EDAC amd64: Node 2: DRAM ECC disabled.
> [ 8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
> Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
> (Note that use of the override may cause unknown side effects.)
> [ 8.204326] EDAC amd64: Node 3: DRAM ECC disabled.
> [ 8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
> Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
> (Note that use of the override may cause unknown side effects.)
>

I wonder if the module is being loaded multiple times. I'll try to reproduce this and track it down.

Thanks for testing and reporting this issue.

-Yazen

2019-08-23 12:38:22

by Yazen Ghannam

[permalink] [raw]

Subject: [RFC PATCH v2] EDAC/amd64: Check for memory before fully initializing an instance

2019-08-23 23:30:31

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes

On Fri, Aug 23, 2019 at 03:28:59PM +0000, Ghannam, Yazen wrote:
> Boris, Do you think it'd be appropriate to change the return values
> for some cases?
>
> For example, ECC disabled is a hardware configuration. This doesn't
> mean that the module failed any operations in this case.
>
> In other words, the module checks for a feature. If the feature is not
> present, then return without failure (and maybe give a message).

That makes sense but AFAICT if probe_one_instance() sees that ECC is not
enabled, it returns 0.

The "if (!edac_has_mcs())" check later is to verify that at least once
instance was loaded successfully and, if not, then return an error.

So where does it return failure?

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-08-23 23:31:07

by Yazen Ghannam

[permalink] [raw]

Subject: RE: [PATCH v3 0/8] AMD64 EDAC fixes

> -----Original Message-----
> From: [email protected] <[email protected]> On Behalf Of Ghannam, Yazen
> Sent: Thursday, August 22, 2019 1:54 PM
> To: Adam Borowski <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]
> Subject: RE: [PATCH v3 0/8] AMD64 EDAC fixes
>
...
> I wonder if the module is being loaded multiple times. I'll try to reproduce this and track it down.
>

I was able to reproduce a similar failure. I do see that the module is being loaded multiple times on failure.

Here's a call trace from one dump_stack() output:
[ +0.004964] CPU: 132 PID: 2680 Comm: systemd-udevd Not tainted 4.20.0-edac-debug+ #36
[ +0.009802] Call Trace:
[ +0.002727] dump_stack+0x63/0x85
[ +0.003696] amd64_edac_init+0x2163/0x3000 [amd64_edac_mod]
[ +0.006216] ? __wake_up+0x13/0x20
[ +0.003790] ? 0xffffffffc120d000
[ +0.003694] do_one_initcall+0x4a/0x1c9
[ +0.004277] ? _cond_resched+0x19/0x40
[ +0.004178] ? kmem_cache_alloc_trace+0x15c/0x1d0
[ +0.005244] do_init_module+0x5f/0x216
[ +0.004180] load_module+0x21d5/0x2ac0
[ +0.004179] ? wait_woken+0x80/0x80
[ +0.003889] __do_sys_finit_module+0xfc/0x120
[ +0.004858] ? __do_sys_finit_module+0xfc/0x120
[ +0.005052] __x64_sys_finit_module+0x1a/0x20
[ +0.004857] do_syscall_64+0x5a/0x120
[ +0.004081] entry_SYSCALL_64_after_hwframe+0x44/0xa9

So it seems that userspace (systemd-udevd) keeps trying to load the module. I'm not sure how to prevent this from within the module.

Boris,
Do you think it'd be appropriate to change the return values for some cases?

For example, ECC disabled is a hardware configuration. This doesn't mean that the module failed any operations in this case.

In other words, the module checks for a feature. If the feature is not present, then return without failure (and maybe give a message).

Thanks,
Yazen

2019-08-26 15:40:42

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes

On Mon, Aug 26, 2019 at 02:19:18PM +0000, Ghannam, Yazen wrote:
> I was tracking down the failure with ECC disabled, and that seems to be it.
>
> So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
> there if ECC is disabled on all nodes and there wasn't some other initialization
> error.
>
> I'll send a patch for this soon.
>
> Adam, would you mind testing this patch?

You can't return 0 when ECC is disabled on all nodes because then the
driver remains loaded without driving anything. That silly userspace
needs to understand that ENODEV means "stop trying to load this driver".

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-08-26 15:42:12

by Yazen Ghannam

[permalink] [raw]

Subject: RE: [PATCH v3 0/8] AMD64 EDAC fixes

> -----Original Message-----
> From: [email protected] <[email protected]> On Behalf Of Borislav Petkov
> Sent: Monday, August 26, 2019 9:59 AM
> To: Ghannam, Yazen <[email protected]>
> Cc: Adam Borowski <[email protected]>; [email protected]; [email protected]
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
>
> On Mon, Aug 26, 2019 at 02:19:18PM +0000, Ghannam, Yazen wrote:
> > I was tracking down the failure with ECC disabled, and that seems to be it.
> >
> > So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
> > there if ECC is disabled on all nodes and there wasn't some other initialization
> > error.
> >
> > I'll send a patch for this soon.
> >
> > Adam, would you mind testing this patch?
>
> You can't return 0 when ECC is disabled on all nodes because then the
> driver remains loaded without driving anything. That silly userspace
> needs to understand that ENODEV means "stop trying to load this driver".
>

Yes, you're right.

I'll try and track down the interaction here between userspace and the module.
Please let me know if you have any suggestions.

Thanks,
Yazen

2019-08-26 16:24:25

by Yazen Ghannam

[permalink] [raw]

Subject: RE: [PATCH v3 0/8] AMD64 EDAC fixes

> -----Original Message-----
> From: [email protected] <[email protected]> On Behalf Of Borislav Petkov
> Sent: Friday, August 23, 2019 10:38 AM
> To: Ghannam, Yazen <[email protected]>
> Cc: Adam Borowski <[email protected]>; [email protected]; [email protected]
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
>
> On Fri, Aug 23, 2019 at 03:28:59PM +0000, Ghannam, Yazen wrote:
> > Boris, Do you think it'd be appropriate to change the return values
> > for some cases?
> >
> > For example, ECC disabled is a hardware configuration. This doesn't
> > mean that the module failed any operations in this case.
> >
> > In other words, the module checks for a feature. If the feature is not
> > present, then return without failure (and maybe give a message).
>
> That makes sense but AFAICT if probe_one_instance() sees that ECC is not
> enabled, it returns 0.
>
> The "if (!edac_has_mcs())" check later is to verify that at least once
> instance was loaded successfully and, if not, then return an error.
>
> So where does it return failure?
>

I was tracking down the failure with ECC disabled, and that seems to be it.

So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
there if ECC is disabled on all nodes and there wasn't some other initialization
error.

I'll send a patch for this soon.

Adam, would you mind testing this patch?

Thanks,
Yazen