2019-10-19 08:35:02

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH 0/6] AMD64 EDAC: Check for nodes without memory, etc.

From: Yazen Ghannam <[email protected]>

Hi Boris,

This set contains the next revision of the RFC patches I included with
the last AMD64 EDAC updates. I dropped the RFC tags, and I added a
couple of new patches.

Most of these patches address the issue where the module check and
complains about DRAM ECC on nodes without memory.

Patch 3 is new and came out of looking at the family type structs and
the boot flow.

Patch 6 fixes the "grain not set" warning that was recently introduced.

Thanks,
Yazen

Links:
https://lkml.kernel.org/r/[email protected]
https://lkml.kernel.org/r/[email protected]
https://lkml.kernel.org/r/[email protected]

Yazen Ghannam (6):
EDAC/amd64: Make struct amd64_family_type global
EDAC/amd64: Gather hardware information early
EDAC/amd64: Save max number of controllers to family type
EDAC/amd64: Use cached data when checking for ECC
EDAC/amd64: Check for memory before fully initializing an instance
EDAC/amd64: Set grain per DIMM

drivers/edac/amd64_edac.c | 174 ++++++++++++++++++++------------------
drivers/edac/amd64_edac.h | 1 +
2 files changed, 94 insertions(+), 81 deletions(-)

--
2.17.1


2019-10-19 08:35:08

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH 3/6] EDAC/amd64: Save max number of controllers to family type

From: Yazen Ghannam <[email protected]>

The maximum number of memory controllers is fixed within a family/model
group. In most cases, this has been fixed at 2, but some systems may
have up to 8.

The struct amd64_family_type already contains family/model-specific
information, and this can be used rather than adding model checks to
various functions.

Create a new field in struct amd64_family_type for max_num_controllers.
Set this when setting other family type information, and use this when
needing the maximum number of memory controllers possible for a system.

Signed-off-by: Yazen Ghannam <[email protected]>
---
Link:
https://lkml.kernel.org/r/[email protected]

rfc -> v1:
* New patch.
* Idea came up from Boris' comment about compute_num_umcs().

drivers/edac/amd64_edac.c | 45 +++++++++++++--------------------------
drivers/edac/amd64_edac.h | 1 +
2 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 4410da7c3a25..0fde5ad2fdcd 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -21,9 +21,6 @@ static struct amd64_family_type *fam_type;
/* Per-node stuff */
static struct ecc_settings **ecc_stngs;

-/* Number of Unified Memory Controllers */
-static u8 num_umcs;
-
/*
* Valid scrub rates for the K8 hardware memory scrubber. We map the scrubbing
* bandwidth to a valid bit pattern. The 'set' operation finds the 'matching-
@@ -456,7 +453,7 @@ static void get_cs_base_and_mask(struct amd64_pvt *pvt, int csrow, u8 dct,
for (i = 0; i < pvt->csels[dct].m_cnt; i++)

#define for_each_umc(i) \
- for (i = 0; i < num_umcs; i++)
+ for (i = 0; i < fam_type->max_num_controllers; i++)

/*
* @input_addr is an InputAddr associated with the node given by mci. Return the
@@ -2226,6 +2223,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "K8",
.f1_id = PCI_DEVICE_ID_AMD_K8_NB_ADDRMAP,
.f2_id = PCI_DEVICE_ID_AMD_K8_NB_MEMCTL,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = k8_early_channel_count,
.map_sysaddr_to_csrow = k8_map_sysaddr_to_csrow,
@@ -2236,6 +2234,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F10h",
.f1_id = PCI_DEVICE_ID_AMD_10H_NB_MAP,
.f2_id = PCI_DEVICE_ID_AMD_10H_NB_DRAM,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = f1x_early_channel_count,
.map_sysaddr_to_csrow = f1x_map_sysaddr_to_csrow,
@@ -2246,6 +2245,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F15h",
.f1_id = PCI_DEVICE_ID_AMD_15H_NB_F1,
.f2_id = PCI_DEVICE_ID_AMD_15H_NB_F2,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = f1x_early_channel_count,
.map_sysaddr_to_csrow = f1x_map_sysaddr_to_csrow,
@@ -2256,6 +2256,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F15h_M30h",
.f1_id = PCI_DEVICE_ID_AMD_15H_M30H_NB_F1,
.f2_id = PCI_DEVICE_ID_AMD_15H_M30H_NB_F2,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = f1x_early_channel_count,
.map_sysaddr_to_csrow = f1x_map_sysaddr_to_csrow,
@@ -2266,6 +2267,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F15h_M60h",
.f1_id = PCI_DEVICE_ID_AMD_15H_M60H_NB_F1,
.f2_id = PCI_DEVICE_ID_AMD_15H_M60H_NB_F2,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = f1x_early_channel_count,
.map_sysaddr_to_csrow = f1x_map_sysaddr_to_csrow,
@@ -2276,6 +2278,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F16h",
.f1_id = PCI_DEVICE_ID_AMD_16H_NB_F1,
.f2_id = PCI_DEVICE_ID_AMD_16H_NB_F2,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = f1x_early_channel_count,
.map_sysaddr_to_csrow = f1x_map_sysaddr_to_csrow,
@@ -2286,6 +2289,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F16h_M30h",
.f1_id = PCI_DEVICE_ID_AMD_16H_M30H_NB_F1,
.f2_id = PCI_DEVICE_ID_AMD_16H_M30H_NB_F2,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = f1x_early_channel_count,
.map_sysaddr_to_csrow = f1x_map_sysaddr_to_csrow,
@@ -2296,6 +2300,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F17h",
.f0_id = PCI_DEVICE_ID_AMD_17H_DF_F0,
.f6_id = PCI_DEVICE_ID_AMD_17H_DF_F6,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = f17_early_channel_count,
.dbam_to_cs = f17_addr_mask_to_cs_size,
@@ -2305,6 +2310,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F17h_M10h",
.f0_id = PCI_DEVICE_ID_AMD_17H_M10H_DF_F0,
.f6_id = PCI_DEVICE_ID_AMD_17H_M10H_DF_F6,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = f17_early_channel_count,
.dbam_to_cs = f17_addr_mask_to_cs_size,
@@ -2314,6 +2320,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F17h_M30h",
.f0_id = PCI_DEVICE_ID_AMD_17H_M30H_DF_F0,
.f6_id = PCI_DEVICE_ID_AMD_17H_M30H_DF_F6,
+ .max_num_controllers = 8,
.ops = {
.early_channel_count = f17_early_channel_count,
.dbam_to_cs = f17_addr_mask_to_cs_size,
@@ -2323,6 +2330,7 @@ static struct amd64_family_type family_types[] = {
.ctl_name = "F17h_M70h",
.f0_id = PCI_DEVICE_ID_AMD_17H_M70H_DF_F0,
.f6_id = PCI_DEVICE_ID_AMD_17H_M70H_DF_F6,
+ .max_num_controllers = 2,
.ops = {
.early_channel_count = f17_early_channel_count,
.dbam_to_cs = f17_addr_mask_to_cs_size,
@@ -3400,29 +3408,14 @@ static const struct attribute_group *amd64_edac_attr_groups[] = {
NULL
};

-/* Set the number of Unified Memory Controllers in the system. */
-static void compute_num_umcs(void)
-{
- u8 model = boot_cpu_data.x86_model;
-
- if (boot_cpu_data.x86 < 0x17)
- return;
-
- if (model >= 0x30 && model <= 0x3f)
- num_umcs = 8;
- else
- num_umcs = 2;
-
- edac_dbg(1, "Number of UMCs: %x", num_umcs);
-}
-
static int get_hardware_info(struct amd64_pvt *pvt)
{
u16 pci_id1, pci_id2;
int ret = -EINVAL;

if (pvt->fam >= 0x17) {
- pvt->umc = kcalloc(num_umcs, sizeof(struct amd64_umc), GFP_KERNEL);
+ pvt->umc = kcalloc(fam_type->max_num_controllers,
+ sizeof(struct amd64_umc), GFP_KERNEL);
if (!pvt->umc) {
ret = -ENOMEM;
goto err_ret;
@@ -3476,14 +3469,8 @@ static int init_one_instance(struct amd64_pvt *pvt)
* Always allocate two channels since we can have setups with DIMMs on
* only one channel. Also, this simplifies handling later for the price
* of a couple of KBs tops.
- *
- * On Fam17h+, the number of controllers may be greater than two. So set
- * the size equal to the maximum number of UMCs.
*/
- if (pvt->fam >= 0x17)
- layers[1].size = num_umcs;
- else
- layers[1].size = 2;
+ layers[1].size = fam_type->max_num_controllers;
layers[1].is_virt_csrow = false;

mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0);
@@ -3678,8 +3665,6 @@ static int __init amd64_edac_init(void)
if (!msrs)
goto err_free;

- compute_num_umcs();
-
for (i = 0; i < amd_nb_num(); i++) {
err = probe_one_instance(i);
if (err) {
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 8c3cda81e619..0d5a9bc4d6de 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -479,6 +479,7 @@ struct low_ops {
struct amd64_family_type {
const char *ctl_name;
u16 f0_id, f1_id, f2_id, f6_id;
+ u8 max_num_controllers;
struct low_ops ops;
};

--
2.17.1

2019-10-21 14:41:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 3/6] EDAC/amd64: Save max number of controllers to family type

On Fri, Oct 18, 2019 at 03:31:26PM +0000, Ghannam, Yazen wrote:
> From: Yazen Ghannam <[email protected]>
>
> The maximum number of memory controllers is fixed within a family/model
> group. In most cases, this has been fixed at 2, but some systems may
> have up to 8.
>
> The struct amd64_family_type already contains family/model-specific
> information, and this can be used rather than adding model checks to
> various functions.
>
> Create a new field in struct amd64_family_type for max_num_controllers.
> Set this when setting other family type information, and use this when
> needing the maximum number of memory controllers possible for a system.
>
> Signed-off-by: Yazen Ghannam <[email protected]>
> ---
> Link:
> https://lkml.kernel.org/r/[email protected]

...

> diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
> index 8c3cda81e619..0d5a9bc4d6de 100644
> --- a/drivers/edac/amd64_edac.h
> +++ b/drivers/edac/amd64_edac.h
> @@ -479,6 +479,7 @@ struct low_ops {
> struct amd64_family_type {
> const char *ctl_name;
> u16 f0_id, f1_id, f2_id, f6_id;
> + u8 max_num_controllers;

Sure but call that max_mcs or so, so that the code which mentions it,
doesn't stick out too much. You can put a comment above it here to
explain what it is:

/* Maximum number of memory controllers per node */
u8 max_mcs;

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2019-10-21 14:49:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 0/6] AMD64 EDAC: Check for nodes without memory, etc.

On Fri, Oct 18, 2019 at 03:31:25PM +0000, Ghannam, Yazen wrote:
> From: Yazen Ghannam <[email protected]>
>
> Hi Boris,
>
> This set contains the next revision of the RFC patches I included with
> the last AMD64 EDAC updates. I dropped the RFC tags, and I added a
> couple of new patches.

Yah, looks pretty much good, modulo the minor things I commented on
earlier.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2019-10-21 16:42:04

by Yazen Ghannam

[permalink] [raw]
Subject: RE: [PATCH 0/6] AMD64 EDAC: Check for nodes without memory, etc.

> -----Original Message-----
> From: [email protected] <[email protected]> On Behalf Of Borislav Petkov
> Sent: Monday, October 21, 2019 10:48 AM
> To: Ghannam, Yazen <[email protected]>
> Cc: [email protected]; [email protected]
> Subject: Re: [PATCH 0/6] AMD64 EDAC: Check for nodes without memory, etc.
>
> On Fri, Oct 18, 2019 at 03:31:25PM +0000, Ghannam, Yazen wrote:
> > From: Yazen Ghannam <[email protected]>
> >
> > Hi Boris,
> >
> > This set contains the next revision of the RFC patches I included with
> > the last AMD64 EDAC updates. I dropped the RFC tags, and I added a
> > couple of new patches.
>
> Yah, looks pretty much good, modulo the minor things I commented on
> earlier.
>

Thank you. I'll send another revision soon.

-Yazen