2024-03-01 14:38:26

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 0/3] FMPM Debug Updates

Hi all,

This set adds two pieces of debug functionality.
1) Saving the system physical address of a recorded error.
2) Printing record entries through a debugfs file.

I'd like to include Murali, Naveen, and Sathya as co-developers, since
this is based on their previous work from here:
https://lore.kernel.org/r/[email protected]

v1 Link:
https://lore.kernel.org/r/[email protected]

v1->v2:
* Patch 1 replaced with suggested patch from Boris.
* Patch 2 update variable names and some code flow.
* Patch 3 rebase on changes from 1 and 2.

Thanks,
Yazen

Borislav Petkov (AMD) (1):
RAS: Export helper to get ras_debugfs_dir

Yazen Ghannam (2):
RAS/AMD/FMPM: Save SPA values
RAS/AMD/FMPM: Add debugfs interface to print record entries

drivers/ras/amd/fmpm.c | 199 +++++++++++++++++++++++++++++++++++++++++
drivers/ras/cec.c | 10 ++-
drivers/ras/debugfs.c | 8 +-
drivers/ras/debugfs.h | 2 +-
4 files changed, 215 insertions(+), 4 deletions(-)


base-commit: 3513ecaa685c6627a943b1f610421754734301fa
--
2.34.1



2024-03-01 14:38:31

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 1/3] RAS: Export helper to get ras_debugfs_dir

From: "Borislav Petkov (AMD)" <[email protected]>

..so that RAS modules can use it.

Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Yazen Ghannam <[email protected]>
---
Link:
https://lore.kernel.org/r/[email protected]

v1->v2:
* Replace with patch from Boris to export a function.
* Added commit message and authorship to patch from Boris.

drivers/ras/cec.c | 10 ++++++++--
drivers/ras/debugfs.c | 8 +++++++-
drivers/ras/debugfs.h | 2 +-
3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 321af498ee11..e440b15fbabc 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -480,9 +480,15 @@ DEFINE_SHOW_ATTRIBUTE(array);

static int __init create_debugfs_nodes(void)
{
- struct dentry *d, *pfn, *decay, *count, *array;
+ struct dentry *d, *pfn, *decay, *count, *array, *dfs;

- d = debugfs_create_dir("cec", ras_debugfs_dir);
+ dfs = ras_get_debugfs_root();
+ if (!dfs) {
+ pr_warn("Error getting RAS debugfs root!\n");
+ return -1;
+ }
+
+ d = debugfs_create_dir("cec", dfs);
if (!d) {
pr_warn("Error creating cec debugfs node!\n");
return -1;
diff --git a/drivers/ras/debugfs.c b/drivers/ras/debugfs.c
index ffb973c328e3..42afd3de68b2 100644
--- a/drivers/ras/debugfs.c
+++ b/drivers/ras/debugfs.c
@@ -3,10 +3,16 @@
#include <linux/ras.h>
#include "debugfs.h"

-struct dentry *ras_debugfs_dir;
+static struct dentry *ras_debugfs_dir;

static atomic_t trace_count = ATOMIC_INIT(0);

+struct dentry *ras_get_debugfs_root(void)
+{
+ return ras_debugfs_dir;
+}
+EXPORT_SYMBOL_GPL(ras_get_debugfs_root);
+
int ras_userspace_consumers(void)
{
return atomic_read(&trace_count);
diff --git a/drivers/ras/debugfs.h b/drivers/ras/debugfs.h
index c07443b462ad..4749ccdeeba1 100644
--- a/drivers/ras/debugfs.h
+++ b/drivers/ras/debugfs.h
@@ -4,6 +4,6 @@

#include <linux/debugfs.h>

-extern struct dentry *ras_debugfs_dir;
+struct dentry *ras_get_debugfs_root(void);

#endif /* __RAS_DEBUGFS_H__ */
--
2.34.1


2024-03-01 14:38:39

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 3/3] RAS/AMD/FMPM: Add debugfs interface to print record entries

It is helpful to see the saved record entries during run time in
human-readable format. This is useful for testing during module
development. And it can be used by system admins to quickly and easily
see the state of the system.

Provide a sequential file in debugfs to print fields of interest from
the FRU records and their entries.

Don't fail to load the module if the debugfs interface is not available.
This is a convenience feature which does not affect other module
functionality.

The new interface reads the record entries and should hold the mutex.
Expand the mutex code comment to clarify when it should be held.

Signed-off-by: Yazen Ghannam <[email protected]>
---
Link:
https://lore.kernel.org/r/[email protected]

v1->v2:
* Update based on patch 1 and 2 changes.

drivers/ras/amd/fmpm.c | 131 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 131 insertions(+)

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index a7bb36eb60cb..d670aa11aef4 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -54,6 +54,8 @@
#include <asm/cpu_device_id.h>
#include <asm/mce.h>

+#include "../debugfs.h"
+
#define INVALID_CPU UINT_MAX

/* Validation Bits */
@@ -116,6 +118,9 @@ static u64 *spa_entries;

#define INVALID_SPA ~0ULL

+static struct dentry *fmpm_dfs_dir;
+static struct dentry *fmpm_dfs_entries;
+
#define CPER_CREATOR_FMP \
GUID_INIT(0xcd5c2993, 0xf4b2, 0x41b2, 0xb5, 0xd4, 0xf9, 0xc3, \
0xa0, 0x33, 0x08, 0x75)
@@ -152,6 +157,11 @@ static unsigned int spa_nr_entries;
* Protect the local records cache in fru_records and prevent concurrent
* writes to storage. This is only needed after init once notifier block
* registration is done.
+ *
+ * The majority of a record is fixed at module init and will not change
+ * during run time. The entries within a record will be updated as new
+ * errors are reported. The mutex should be held whenever the entries are
+ * accessed during run time.
*/
static DEFINE_MUTEX(fmpm_update_mutex);

@@ -813,6 +823,124 @@ static int allocate_records(void)
return ret;
}

+static void *fmpm_start(struct seq_file *f, loff_t *pos)
+{
+ if (*pos >= (spa_nr_entries + 1))
+ return NULL;
+ return pos;
+}
+
+static void *fmpm_next(struct seq_file *f, void *data, loff_t *pos)
+{
+ if (++(*pos) >= (spa_nr_entries + 1))
+ return NULL;
+ return pos;
+}
+
+static void fmpm_stop(struct seq_file *f, void *data)
+{
+}
+
+#define SHORT_WIDTH 8
+#define U64_WIDTH 18
+#define TIMESTAMP_WIDTH 19
+#define LONG_WIDTH 24
+#define U64_PAD (LONG_WIDTH - U64_WIDTH)
+#define TS_PAD (LONG_WIDTH - TIMESTAMP_WIDTH)
+static int fmpm_show(struct seq_file *f, void *data)
+{
+ unsigned int fru_idx, entry, spa_entry, line;
+ struct cper_fru_poison_desc *fpd;
+ struct fru_rec *rec;
+
+ line = *(loff_t *)data;
+ if (line == 0) {
+ seq_printf(f, "%-*s", SHORT_WIDTH, "fru_idx");
+ seq_printf(f, "%-*s", LONG_WIDTH, "fru_id");
+ seq_printf(f, "%-*s", SHORT_WIDTH, "entry");
+ seq_printf(f, "%-*s", LONG_WIDTH, "timestamp");
+ seq_printf(f, "%-*s", LONG_WIDTH, "hw_id");
+ seq_printf(f, "%-*s", LONG_WIDTH, "addr");
+ seq_printf(f, "%-*s", LONG_WIDTH, "spa");
+ goto out_newline;
+ }
+
+ spa_entry = line - 1;
+ fru_idx = spa_entry / max_nr_entries;
+ entry = spa_entry % max_nr_entries;
+
+ rec = fru_records[fru_idx];
+ if (!rec)
+ goto out;
+
+ seq_printf(f, "%-*u", SHORT_WIDTH, fru_idx);
+ seq_printf(f, "0x%016llx%-*s", rec->fmp.fru_id, U64_PAD, "");
+ seq_printf(f, "%-*u", SHORT_WIDTH, entry);
+
+ mutex_lock(&fmpm_update_mutex);
+
+ if (entry >= rec->fmp.nr_entries) {
+ seq_printf(f, "%-*s", LONG_WIDTH, "*");
+ seq_printf(f, "%-*s", LONG_WIDTH, "*");
+ seq_printf(f, "%-*s", LONG_WIDTH, "*");
+ seq_printf(f, "%-*s", LONG_WIDTH, "*");
+ goto out_unlock;
+ }
+
+ fpd = &rec->entries[entry];
+
+ seq_printf(f, "%ptT%-*s", &fpd->timestamp, TS_PAD, "");
+ seq_printf(f, "0x%016llx%-*s", fpd->hw_id, U64_PAD, "");
+ seq_printf(f, "0x%016llx%-*s", fpd->addr, U64_PAD, "");
+
+ if (spa_entries[spa_entry] == INVALID_SPA)
+ seq_printf(f, "%-*s", LONG_WIDTH, "*");
+ else
+ seq_printf(f, "0x%016llx%-*s", spa_entries[spa_entry], U64_PAD, "");
+
+out_unlock:
+ mutex_unlock(&fmpm_update_mutex);
+out_newline:
+ seq_putc(f, '\n');
+out:
+ return 0;
+}
+
+static const struct seq_operations fmpm_seq_ops = {
+ .start = fmpm_start,
+ .next = fmpm_next,
+ .stop = fmpm_stop,
+ .show = fmpm_show,
+};
+
+static int fmpm_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &fmpm_seq_ops);
+}
+
+static const struct file_operations fmpm_fops = {
+ .open = fmpm_open,
+ .release = seq_release,
+ .read = seq_read,
+ .llseek = seq_lseek,
+};
+
+static void setup_debugfs(void)
+{
+ struct dentry *dfs = ras_get_debugfs_root();
+
+ if (!dfs)
+ return;
+
+ fmpm_dfs_dir = debugfs_create_dir("fmpm", dfs);
+ if (!fmpm_dfs_dir)
+ return;
+
+ fmpm_dfs_entries = debugfs_create_file("entries", 0400, fmpm_dfs_dir, NULL, &fmpm_fops);
+ if (!fmpm_dfs_entries)
+ debugfs_remove(fmpm_dfs_dir);
+}
+
static const struct x86_cpu_id fmpm_cpuids[] = {
X86_MATCH_VENDOR_FAM(AMD, 0x19, NULL),
{ }
@@ -854,6 +982,8 @@ static int __init fru_mem_poison_init(void)
if (ret)
goto out_free;

+ setup_debugfs();
+
retire_mem_records();

mce_register_decode_chain(&fru_mem_poison_nb);
@@ -870,6 +1000,7 @@ static int __init fru_mem_poison_init(void)
static void __exit fru_mem_poison_exit(void)
{
mce_unregister_decode_chain(&fru_mem_poison_nb);
+ debugfs_remove(fmpm_dfs_dir);
free_records();
}

--
2.34.1


2024-03-01 14:39:03

by Yazen Ghannam

[permalink] [raw]
Subject: [PATCH v2 2/3] RAS/AMD/FMPM: Save SPA values

The system physical address (SPA) of an error is not a stable value. It
will change depending on the location of the memory: parts can be
swapped. And it will change depending on memory topology: NUMA nodes
and/or interleaving can be adjusted.

Therefore, the SPA value is not part of the "FRU Memory Poison" record
format. And it will not be saved to persistent storage.

However, the SPA values can be helpful during debug and for system
admins during run time.

Save the SPA values in a separate structure. This is updated when
records are restored and when new errors are saved.

Signed-off-by: Yazen Ghannam <[email protected]>
---
Link:
https://lore.kernel.org/r/[email protected]

v1->v2:
* Changed variable names to remove "sys_" prefix. (Boris)
* Used "spa_" prefix to highlight that these are for SPA values. (Yazen)
* Added warning to "index out-of-bound" condition. (Boris)
* Reworked save_spa() flow to get a valid array position before saving
SPA value (Yazen).

drivers/ras/amd/fmpm.c | 68 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 80dd112b720a..a7bb36eb60cb 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -111,6 +111,11 @@ struct fru_rec {
*/
static struct fru_rec **fru_records;

+/* system physical addresses array */
+static u64 *spa_entries;
+
+#define INVALID_SPA ~0ULL
+
#define CPER_CREATOR_FMP \
GUID_INIT(0xcd5c2993, 0xf4b2, 0x41b2, 0xb5, 0xd4, 0xf9, 0xc3, \
0xa0, 0x33, 0x08, 0x75)
@@ -140,6 +145,9 @@ static unsigned int max_nr_fru;
/* Total length of record including headers and list of descriptor entries. */
static size_t max_rec_len;

+/* Total number of SPA entries across all FRUs. */
+static unsigned int spa_nr_entries;
+
/*
* Protect the local records cache in fru_records and prevent concurrent
* writes to storage. This is only needed after init once notifier block
@@ -269,6 +277,52 @@ static bool rec_has_fpd(struct fru_rec *rec, struct cper_fru_poison_desc *fpd)
return false;
}

+static void save_spa(struct fru_rec *rec, unsigned int entry,
+ u64 addr, u64 id, unsigned int cpu)
+{
+ unsigned int i, fru_idx, spa_entry;
+ struct atl_err a_err;
+ unsigned long spa;
+
+ if (entry >= max_nr_entries) {
+ pr_warn_once("entry out-of-bounds\n");
+ return;
+ }
+
+ for (i = 0; i < spa_nr_entries; i += max_nr_entries) {
+ fru_idx = i / max_nr_entries;
+ if (fru_records[fru_idx] == rec)
+ break;
+ }
+
+ if (i >= spa_nr_entries) {
+ pr_warn_once("record not found");
+ return;
+ }
+
+ spa_entry = i + entry;
+ if (spa_entry >= spa_nr_entries) {
+ pr_warn_once("spa_entries[] index out-of-bounds\n");
+ return;
+ }
+
+ memset(&a_err, 0, sizeof(struct atl_err));
+
+ a_err.addr = addr;
+ a_err.ipid = id;
+ a_err.cpu = cpu;
+
+ spa = amd_convert_umc_mca_addr_to_sys_addr(&a_err);
+ if (IS_ERR_VALUE(spa)) {
+ pr_debug("Failed to get system address\n");
+ return;
+ }
+
+ spa_entries[spa_entry] = spa;
+ pr_debug("fru_idx: %u, entry: %u, spa_entry: %u, spa: 0x%016llx\n",
+ fru_idx, entry, spa_entry, spa_entries[spa_entry]);
+}
+
static void update_fru_record(struct fru_rec *rec, struct mce *m)
{
struct cper_sec_fru_mem_poison *fmp = &rec->fmp;
@@ -301,6 +355,7 @@ static void update_fru_record(struct fru_rec *rec, struct mce *m)
entry = fmp->nr_entries;

save_fpd:
+ save_spa(rec, entry, m->addr, m->ipid, m->extcpu);
fpd_dest = &rec->entries[entry];
memcpy(fpd_dest, &fpd, sizeof(struct cper_fru_poison_desc));

@@ -385,6 +440,7 @@ static void retire_mem_fmp(struct fru_rec *rec)
continue;

retire_dram_row(fpd->addr, fpd->hw_id, err_cpu);
+ save_spa(rec, i, fpd->addr, fpd->hw_id, err_cpu);
}
}

@@ -696,6 +752,8 @@ static int get_system_info(void)
if (!max_nr_entries)
max_nr_entries = FMPM_DEFAULT_MAX_NR_ENTRIES;

+ spa_nr_entries = max_nr_fru * max_nr_entries;
+
max_rec_len = sizeof(struct fru_rec);
max_rec_len += sizeof(struct cper_fru_poison_desc) * max_nr_entries;

@@ -714,6 +772,7 @@ static void free_records(void)
kfree(rec);

kfree(fru_records);
+ kfree(spa_entries);
}

static int allocate_records(void)
@@ -734,6 +793,15 @@ static int allocate_records(void)
}
}

+ spa_entries = kcalloc(spa_nr_entries, sizeof(u64), GFP_KERNEL);
+ if (!spa_entries) {
+ ret = -ENOMEM;
+ goto out_free;
+ }
+
+ for (i = 0; i < spa_nr_entries; i++)
+ spa_entries[i] = INVALID_SPA;
+
return ret;

out_free:
--
2.34.1


2024-03-01 15:51:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] RAS/AMD/FMPM: Save SPA values

On Fri, Mar 01, 2024 at 08:37:47AM -0600, Yazen Ghannam wrote:
> The system physical address (SPA) of an error is not a stable value. It
> will change depending on the location of the memory: parts can be
> swapped. And it will change depending on memory topology: NUMA nodes
> and/or interleaving can be adjusted.
>
> Therefore, the SPA value is not part of the "FRU Memory Poison" record
> format. And it will not be saved to persistent storage.
>
> However, the SPA values can be helpful during debug and for system
> admins during run time.
>
> Save the SPA values in a separate structure. This is updated when
> records are restored and when new errors are saved.
>
> Signed-off-by: Yazen Ghannam <[email protected]>
> ---
> Link:
> https://lore.kernel.org/r/[email protected]
>
> v1->v2:
> * Changed variable names to remove "sys_" prefix. (Boris)
> * Used "spa_" prefix to highlight that these are for SPA values. (Yazen)
> * Added warning to "index out-of-bound" condition. (Boris)
> * Reworked save_spa() flow to get a valid array position before saving
> SPA value (Yazen).
>
> drivers/ras/amd/fmpm.c | 68 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)

Fixups ontop:

---

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index a7bb36eb60cb..8c3188488673 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -125,7 +125,7 @@ static u64 *spa_entries;
0x12, 0x0a, 0x44, 0x58)

/**
- * DOC: fru_poison_entries (byte)
+ * DOC: max_nr_entries (byte)
* Maximum number of descriptor entries possible for each FRU.
*
* Values between '1' and '255' are valid.
@@ -285,10 +285,12 @@ static void save_spa(struct fru_rec *rec, unsigned int entry,
unsigned long spa;

if (entry >= max_nr_entries) {
- pr_warn_once("entry out-of-bounds\n");
+ pr_warn_once("FRU descriptor entry %d out-of-bounds (max: %d)\n",
+ entry, max_nr_entries);
return;
}

+ /* spa_nr_entries is always multiple of max_nr_entries */
for (i = 0; i < spa_nr_entries; i += max_nr_entries) {
fru_idx = i / max_nr_entries;
if (fru_records[fru_idx] == rec)
@@ -296,7 +298,7 @@ static void save_spa(struct fru_rec *rec, unsigned int entry,
}

if (i >= spa_nr_entries) {
- pr_warn_once("record not found");
+ pr_warn_once("FRU record %d not found\n", i);
return;
}

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-02 09:50:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] FMPM Debug Updates

On Fri, Mar 01, 2024 at 08:37:45AM -0600, Yazen Ghannam wrote:
> Hi all,
>
> This set adds two pieces of debug functionality.
> 1) Saving the system physical address of a recorded error.
> 2) Printing record entries through a debugfs file.
>
> I'd like to include Murali, Naveen, and Sathya as co-developers, since
> this is based on their previous work from here:
> https://lore.kernel.org/r/[email protected]
>
> v1 Link:
> https://lore.kernel.org/r/[email protected]
>
> v1->v2:
> * Patch 1 replaced with suggested patch from Boris.
> * Patch 2 update variable names and some code flow.
> * Patch 3 rebase on changes from 1 and 2.
>
> Thanks,
> Yazen
>
> Borislav Petkov (AMD) (1):
> RAS: Export helper to get ras_debugfs_dir
>
> Yazen Ghannam (2):
> RAS/AMD/FMPM: Save SPA values
> RAS/AMD/FMPM: Add debugfs interface to print record entries
>
> drivers/ras/amd/fmpm.c | 199 +++++++++++++++++++++++++++++++++++++++++
> drivers/ras/cec.c | 10 ++-
> drivers/ras/debugfs.c | 8 +-
> drivers/ras/debugfs.h | 2 +-
> 4 files changed, 215 insertions(+), 4 deletions(-)

Applied, thanks.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-04 05:13:48

by M K, Muralidhara

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] FMPM Debug Updates

Hi Yazen and Boris,

For Patch 2: "RAS/AMD/FMPM: Save SPA values", Please include below tags

Co-developed-by: Naveen Krishna Chatradhi <[email protected]>
Signed-off-by: Naveen Krishna Chatradhi <[email protected]>
Co-developed-by: [email protected]
Signed-off-by: [email protected]
Tested-by: [email protected]

For Patch 3: "RAS/AMD/FMPM: Add debugfs interface to print record
entries" Please include below tag
Tested-by: [email protected]



On 3/1/2024 8:07 PM, Yazen Ghannam wrote:
> Hi all,
>
> This set adds two pieces of debug functionality.
> 1) Saving the system physical address of a recorded error.
> 2) Printing record entries through a debugfs file.
>
> I'd like to include Murali, Naveen, and Sathya as co-developers, since
> this is based on their previous work from here:
> https://lore.kernel.org/r/[email protected]
>
> v1 Link:
> https://lore.kernel.org/r/[email protected]
>
> v1->v2:
> * Patch 1 replaced with suggested patch from Boris.
> * Patch 2 update variable names and some code flow.
> * Patch 3 rebase on changes from 1 and 2.
>
> Thanks,
> Yazen
>
> Borislav Petkov (AMD) (1):
> RAS: Export helper to get ras_debugfs_dir
>
> Yazen Ghannam (2):
> RAS/AMD/FMPM: Save SPA values
> RAS/AMD/FMPM: Add debugfs interface to print record entries
>
> drivers/ras/amd/fmpm.c | 199 +++++++++++++++++++++++++++++++++++++++++
> drivers/ras/cec.c | 10 ++-
> drivers/ras/debugfs.c | 8 +-
> drivers/ras/debugfs.h | 2 +-
> 4 files changed, 215 insertions(+), 4 deletions(-)
>
>
> base-commit: 3513ecaa685c6627a943b1f610421754734301fa
>