LinuxLists.cc - [RFC PATCH 0/2] EDAC/ghes: Add EDAC device for recording the CPU error count

2020-12-08 17:33:36

Subject: [RFC PATCH 0/2] EDAC/ghes: Add EDAC device for recording the CPU error count

For the firmware-first error handling on ARM64 hardware platforms,
CPU cache corrected error count is not recorded.
Create an CPU EDAC device and device blocks for the CPU caches
for this purpose. The EDAC device blocks are created based on the
cache information from the cpu_cacheinfo.

User-space application could monitor the recorded corrected error
count for the predictive failure analysis.

More information in the patch headers.

Shiju Jose (2):
EDAC/ghes: Add EDAC device for the CPU caches
ACPI / APEI: Add reporting ARM64 CPU cache corrected error count

Documentation/ABI/testing/sysfs-devices-edac | 15 ++
drivers/acpi/apei/ghes.c | 76 ++++++++-
drivers/edac/Kconfig | 10 ++
drivers/edac/ghes_edac.c | 171 +++++++++++++++++++
include/acpi/ghes.h | 27 +++
include/linux/cper.h | 4 +
6 files changed, 299 insertions(+), 4 deletions(-)

--
2.17.1

2020-12-08 17:37:17

by Shiju Jose

[permalink] [raw]

Subject: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches

The corrected error count on the CPU caches required
reporting to the user-space for the predictive failure
analysis. For this purpose, add an EDAC device and device
blocks for the CPU caches found.
The cache's corrected error count would be stored in the
/sys/devices/system/edac/cpu/cpu*/cache*/ce_count.

Issues and possible solutions,
1.Cache info is not available for the CPUs offline.
EDAC device interface supports only creating EDAC device
and device blocks for the all the CPU caches together.
It requires the number of caches as device blocks for
the creation. However, this info is not available for
the CPUs which offline and may become online later.

Tested Solution: Find the max number of caches among
online CPUs, create the EDAC device for CPUs caches
and get and populate the cache info for an offline
CPU later, when the error is reported on that CPU for
the first time.

2. Reporting error count for the Shared caches.
There are few possible solutions,
2.1 Kernel would report a new error count for a shared cache
per CPU through the EDAC device block for that CPU.
Then user-space application sums the total error count
for a shared cache from EDAC device blocks of all the
CPUs in the shared CPU list of that shared cache.
2.2 Kernel would report a new error count for a shared cache
through the EDAC device blocks for all the CPUs in the
shared CPU list of that shared cache.

Tested Solution: The current implementation used the solution 2.1

For the firmware-first error handling, add an interface in the
ghes_edac for reporting the CPU corrected error count for
a CPU core to the user-space through the CPU EDAC device.

Suggested-by: James Morse <[email protected]>
Signed-off-by: Shiju Jose <[email protected]>
---
Documentation/ABI/testing/sysfs-devices-edac | 15 ++
drivers/acpi/apei/ghes.c | 8 +-
drivers/edac/Kconfig | 10 ++
drivers/edac/ghes_edac.c | 171 +++++++++++++++++++
include/acpi/ghes.h | 27 +++
5 files changed, 230 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-edac b/Documentation/ABI/testing/sysfs-devices-edac
index 256a9e990c0b..56a18b0af419 100644
--- a/Documentation/ABI/testing/sysfs-devices-edac
+++ b/Documentation/ABI/testing/sysfs-devices-edac
@@ -155,3 +155,18 @@ Description: This attribute file displays the total count of uncorrectable
errors that have occurred on this DIMM. If panic_on_ue is set, this
counter will not have a chance to increment, since EDAC will panic the
system
+
+What: /sys/devices/system/edac/cpu/cpu*/cache*/ce_count
+Date: December 2020
+Contact: [email protected]
+Description: This attribute file displays the total count of correctable
+ errors that have occurred on this CPU cache. This count is very important
+ to examine. CEs provide early indications that a cache is beginning
+ to fail. This count field should be monitored for non-zero values
+ and report such information to the system administrator.
+
+What: /sys/devices/system/edac/cpu/cpu*/cache*/ue_count
+Date: December 2020
+Contact: [email protected]
+Description: This attribute file displays the total count of uncorrectable
+ errors that have occurred on this CPU cache.
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index fce7ade2aba9..e7b0edbda0f8 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -1452,4 +1452,10 @@ static int __init ghes_init(void)
err:
return rc;
}
-device_initcall(ghes_init);
+
+/*
+ * device_initcall_sync() is added instead of the device_initcall()
+ * because the CPU cacheinfo should be populated and needed for
+ * adding the CPU cache edac device blocks in the ghes_edac_register().
+ */
+device_initcall_sync(ghes_init);
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 7a47680d6f07..c73eeab27ac9 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -74,6 +74,16 @@ config EDAC_GHES

In doubt, say 'Y'.

+config EDAC_GHES_CPU_ERROR
+ bool "EDAC device for reporting firmware-first BIOS detected CPU error count"
+ depends on EDAC_GHES
+ help
+ EDAC device for the firmware-first BIOS detected CPU error count reported
+ via ACPI APEI/GHES. By enabling this option, EDAC device for the CPU
+ hierarchy and edac device blocks for caches would be created.
+ The cpu error count is shared with the userspace via the CPU EDAC
+ device's sysfs interface.
+
config EDAC_AMD64
tristate "AMD64 (Opteron, Athlon64)"
depends on AMD_NB && EDAC_DECODE_MCE
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index a918ca93e4f7..e6a73a413b33 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -12,6 +12,9 @@
#include <acpi/ghes.h>
#include <linux/edac.h>
#include <linux/dmi.h>
+#if defined(CONFIG_EDAC_GHES_CPU_ERROR)
+#include <linux/cacheinfo.h>
+#endif
#include "edac_module.h"
#include <ras/ras_event.h>

@@ -497,6 +500,166 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
spin_unlock_irqrestore(&ghes_lock, flags);
}

+#if defined(CONFIG_EDAC_GHES_CPU_ERROR)
+static struct ghes_edac_cpu_block {
+ int cpu;
+ u8 level;
+ u8 type;
+ int block_nr;
+ bool info_populated;
+} *cpu_edac_block_list;
+
+static struct edac_device_ctl_info *cpu_edac_dev;
+static unsigned int max_number_of_caches;
+
+static int ghes_edac_add_cpu_device(struct device *dev)
+{
+ int rc;
+
+ cpu_edac_dev = edac_device_alloc_ctl_info(0, "cpu", num_possible_cpus(),
+ "cache", max_number_of_caches, 0, NULL,
+ 0, edac_device_alloc_index());
+ if (!cpu_edac_dev) {
+ pr_warn("ghes-edac cpu edac device registration failed\n");
+ return -ENOMEM;
+ }
+
+ cpu_edac_dev->dev = dev;
+ cpu_edac_dev->ctl_name = "cpu_edac_dev";
+ cpu_edac_dev->dev_name = "ghes";
+ cpu_edac_dev->mod_name = "ghes_edac.c";
+ rc = edac_device_add_device(cpu_edac_dev);
+ if (rc > 0) {
+ pr_warn("edac_device_add_device failed\n");
+ edac_device_free_ctl_info(cpu_edac_dev);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static void ghes_edac_delete_cpu_device(void)
+{
+ max_number_of_caches = 0;
+ if (cpu_edac_dev) {
+ edac_device_del_device(cpu_edac_dev->dev);
+ edac_device_free_ctl_info(cpu_edac_dev);
+ }
+ vfree(cpu_edac_block_list);
+}
+
+static int ghes_edac_get_cache_info(int cpu)
+{
+ struct ghes_edac_cpu_block *block;
+ struct cpu_cacheinfo *this_cpu_ci;
+ struct cacheinfo *this_leaf;
+ int i, num_caches;
+
+ this_cpu_ci = get_cpu_cacheinfo(cpu);
+ if (!this_cpu_ci || !this_cpu_ci->info_list || !this_cpu_ci->num_leaves)
+ return -EINVAL;
+
+ this_leaf = this_cpu_ci->info_list;
+ /*
+ * Cache info would not be available for a CPU which is offline. However EDAC device
+ * creation requires the number of device blocks (for example max number of caches
+ * among CPUs). Thus cache info in the cpu_edac_block_list would be populated when
+ * the error is reported on that cpu. Thus we need to restrict the number of caches
+ * if the CPU's num_leaves exceed the max_number_of_caches.
+ */
+ num_caches = min(max_number_of_caches, this_cpu_ci->num_leaves);
+
+ /*
+ * The edac CPU cache device blocks entries in the sysfs should match with the
+ * CPU cache structure in the sysfs so that the affected cpus for a shared cache
+ * can be easily extracted in the userspace.
+ */
+ for (i = 0; i < num_caches; i++) {
+ block = cpu_edac_block_list + (cpu * max_number_of_caches) + i;
+ block->cpu = cpu;
+ block->level = this_leaf->level;
+ block->type = this_leaf->type;
+ block->block_nr = i;
+ block->info_populated = true;
+ this_leaf++;
+ }
+
+ return 0;
+}
+
+static void ghes_edac_create_cpu_device(struct device *dev)
+{
+ int cpu;
+ struct cpu_cacheinfo *this_cpu_ci;
+
+ /*
+ * Find the maximum number of caches present in the CPU heirarchy
+ * among the online CPUs.
+ */
+ for_each_online_cpu(cpu) {
+ this_cpu_ci = get_cpu_cacheinfo(cpu);
+ if (!this_cpu_ci)
+ continue;
+ if (max_number_of_caches < this_cpu_ci->num_leaves)
+ max_number_of_caches = this_cpu_ci->num_leaves;
+ }
+ if (!max_number_of_caches)
+ return;
+
+ /*
+ * EDAC device interface only supports creating the CPU cache hierarchy for alls
+ * the CPUs together. Thus need to allocate cpu_edac_block_list for the
+ * max_number_of_caches among all the CPUs irrespective of the number of caches
+ * per CPU might vary.
+ */
+ cpu_edac_block_list = vzalloc(num_possible_cpus() * max_number_of_caches *
+ sizeof(*cpu_edac_block_list));
+ if (!cpu_edac_block_list)
+ return;
+
+ if (ghes_edac_add_cpu_device(dev))
+ goto error;
+
+ for_each_online_cpu(cpu)
+ ghes_edac_get_cache_info(cpu);
+
+ return;
+
+error:
+ ghes_edac_delete_cpu_device();
+}
+
+void ghes_edac_report_cpu_error(struct ghes_einfo_cpu *einfo)
+{
+ struct ghes_edac_cpu_block *block;
+ int i;
+
+ if (!einfo || !(einfo->ce_count) || !max_number_of_caches)
+ return;
+
+ /*
+ * EDAC device require the number of device blocks (for example max number of caches
+ * among CPUs) during the creation. For the CPUs that were offline in the cpu edac
+ * init and become online later, the cache info would be populated when the error is
+ * reported on that cpu.
+ */
+ block = cpu_edac_block_list + (einfo->cpu * max_number_of_caches);
+ if (!block->info_populated) {
+ if (ghes_edac_get_cache_info(einfo->cpu))
+ return;
+ }
+
+ for (i = 0; i < max_number_of_caches; i++) {
+ block = cpu_edac_block_list + (einfo->cpu * max_number_of_caches) + i;
+ if ((block->level == einfo->cache_level) && (block->type == einfo->cache_type)) {
+ edac_device_handle_ce_count(cpu_edac_dev, einfo->ce_count,
+ einfo->cpu, block->block_nr, "");
+ break;
+ }
+ }
+}
+#endif
+
/*
* Known systems that are safe to enable this module.
*/
@@ -624,6 +787,10 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
ghes_pvt = pvt;
spin_unlock_irqrestore(&ghes_lock, flags);

+#if defined(CONFIG_EDAC_GHES_CPU_ERROR)
+ ghes_edac_create_cpu_device(dev);
+#endif
+
/* only set on success */
refcount_set(&ghes_refcount, 1);

@@ -654,6 +821,10 @@ void ghes_edac_unregister(struct ghes *ghes)
if (!refcount_dec_and_test(&ghes_refcount))
goto unlock;

+#if defined(CONFIG_EDAC_GHES_CPU_ERROR)
+ ghes_edac_delete_cpu_device();
+#endif
+
/*
* Wait for the irq handler being finished.
*/
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 34fb3431a8f3..8702eba26afc 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -73,6 +73,24 @@ void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);

int ghes_estatus_pool_init(int num_ghes);

+/**
+ * struct ghes_einfo_cpu - structure to pass the CPU error info to the edac
+ * @cpu: CPU index.
+ * @error_type: error type, cache/TLB/bus/ etc.
+ * @cache_level: cache level.
+ * @cache_type: ACPI cache type.
+ * @ue_count: CPU uncorrectable error count.
+ * @ce_count: CPU correctable error count.
+ */
+struct ghes_einfo_cpu {
+ int cpu;
+ u8 error_type;
+ u8 cache_level;
+ u8 cache_type;
+ u16 ue_count;
+ u16 ce_count;
+};
+
/* From drivers/edac/ghes_edac.c */

#ifdef CONFIG_EDAC_GHES
@@ -98,6 +116,15 @@ static inline void ghes_edac_unregister(struct ghes *ghes)
}
#endif

+#ifdef CONFIG_EDAC_GHES_CPU_ERROR
+void ghes_edac_report_cpu_error(struct ghes_einfo_cpu *einfo_cpu);
+
+#else
+static inline void ghes_edac_report_cpu_error(struct ghes_einfo_cpu *einfo_cpu)
+{
+}
+#endif
+
static inline int acpi_hest_get_version(struct acpi_hest_generic_data *gdata)
{
return gdata->revision >> 8;
--
2.17.1

2020-12-08 20:27:54

by Shiju Jose

[permalink] [raw]

Subject: [RFC PATCH 2/2] ACPI / APEI: Add reporting ARM64 CPU cache corrected error count

Add reporting ARM64 CPU cache corrected error count to the ghes_edac.
The error count would be updated in the EDAC CPU cache sysfs
interface.

Note: This patch would be recreated after the patch
"ACPI / APEI: do memory failure on the physical address reported by ARM processor error section"
would be merged on the mainline.

Signed-off-by: Shiju Jose <[email protected]>
---
drivers/acpi/apei/ghes.c | 68 ++++++++++++++++++++++++++++++++++++++--
include/linux/cper.h | 4 +++
2 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index e7b0edbda0f8..37f8b09d810d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -41,6 +41,7 @@
#include <linux/uuid.h>
#include <linux/ras.h>
#include <linux/task_work.h>
+#include <linux/cacheinfo.h>

#include <acpi/actbl1.h>
#include <acpi/ghes.h>
@@ -523,6 +524,69 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
#endif
}

+static u8 arm_err_transaction_type_to_cache_type(u8 trans_type)
+{
+ switch (trans_type) {
+ case CPER_ARM_CACHE_TRANS_TYPE_INSTRUCTION:
+ return CACHE_TYPE_INST;
+ case CPER_ARM_CACHE_TRANS_TYPE_DATA:
+ return CACHE_TYPE_DATA;
+ case CPER_ARM_CACHE_TRANS_TYPE_GENERIC:
+ default:
+ return CACHE_TYPE_UNIFIED;
+ }
+}
+
+static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata)
+{
+ struct cper_sec_proc_arm *error = acpi_hest_get_payload(gdata);
+ struct cper_arm_err_info *err_info;
+ struct ghes_einfo_cpu einfo;
+ u8 transaction_type;
+ u64 error_info;
+ int sec_sev;
+ int i;
+
+ log_arm_hw_error(error);
+
+ sec_sev = ghes_severity(gdata->error_severity);
+
+#if defined(CONFIG_ARM64)
+ if (sec_sev == GHES_SEV_CORRECTED) {
+ memset(&einfo, 0, sizeof(einfo));
+ einfo.cpu = get_logical_index(error->mpidr);
+ if (einfo.cpu == -EINVAL)
+ return;
+
+ /*
+ * ARM processor error types are cache/TLB/bus errors.
+ * Presently corrected error count for caches only
+ * is reported.
+ */
+ err_info = (struct cper_arm_err_info *)(error + 1);
+
+ for (i = 0; i < error->err_info_num; i++) {
+ if (err_info->type != CPER_ARM_CACHE_ERROR)
+ continue;
+ einfo.ce_count = err_info->multiple_error + 1;
+
+ error_info = err_info->error_info;
+ if (!(error_info & CPER_ARM_ERR_VALID_TRANSACTION_TYPE) ||
+ !(error_info & CPER_ARM_ERR_VALID_LEVEL))
+ continue;
+
+ transaction_type = ((error_info >> CPER_ARM_ERR_TRANSACTION_SHIFT)
+ & CPER_ARM_ERR_TRANSACTION_MASK);
+ einfo.cache_type = arm_err_transaction_type_to_cache_type(transaction_type);
+ einfo.cache_level = ((error_info >> CPER_ARM_ERR_LEVEL_SHIFT)
+ & CPER_ARM_ERR_LEVEL_MASK);
+ ghes_edac_report_cpu_error(&einfo);
+ err_info += 1;
+ }
+ }
+#endif
+}
+
static BLOCKING_NOTIFIER_HEAD(vendor_record_notify_list);

int ghes_register_vendor_record_notifier(struct notifier_block *nb)
@@ -605,9 +669,7 @@ static bool ghes_do_proc(struct ghes *ghes,
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
- struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
-
- log_arm_hw_error(err);
+ ghes_handle_arm_hw_error(gdata);
} else {
void *err = acpi_hest_get_payload(gdata);

diff --git a/include/linux/cper.h b/include/linux/cper.h
index 6a511a1078ca..0ea966af6ad9 100644
--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -314,6 +314,10 @@ enum {
#define CPER_ARM_ERR_ACCESS_MODE_SHIFT 43
#define CPER_ARM_ERR_ACCESS_MODE_MASK GENMASK(0,0)

+#define CPER_ARM_CACHE_TRANS_TYPE_INSTRUCTION 0
+#define CPER_ARM_CACHE_TRANS_TYPE_DATA 1
+#define CPER_ARM_CACHE_TRANS_TYPE_GENERIC 2
+
/*
* All tables and structs must be byte-packed to match CPER
* specification, since the tables are provided by the system BIOS
--
2.17.1

2020-12-31 16:55:19

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches

On Tue, Dec 08, 2020 at 05:29:58PM +0000, Shiju Jose wrote:
> The corrected error count on the CPU caches required
> reporting to the user-space for the predictive failure
> analysis. For this purpose, add an EDAC device and device
> blocks for the CPU caches found.
> The cache's corrected error count would be stored in the
> /sys/devices/system/edac/cpu/cpu*/cache*/ce_count.

This still doesn't begin to explain why the kernel needs this. I had
already asked whether errors in CPU caches are something which happen
often enough so that software should count them but nothing came. So pls
justify first why this wants to be added to the kernel.

> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 7a47680d6f07..c73eeab27ac9 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -74,6 +74,16 @@ config EDAC_GHES
>
> In doubt, say 'Y'.
>
> +config EDAC_GHES_CPU_ERROR
> + bool "EDAC device for reporting firmware-first BIOS detected CPU error count"

Why a separate Kconfig item?

> + depends on EDAC_GHES
> + help
> + EDAC device for the firmware-first BIOS detected CPU error count reported

Well this is not what it is doing - you're talking about cache errors.
"CPU errors" can be a lot more than just cache errors.

> +static void ghes_edac_create_cpu_device(struct device *dev)
> +{
> + int cpu;
> + struct cpu_cacheinfo *this_cpu_ci;
> +
> + /*
> + * Find the maximum number of caches present in the CPU heirarchy
> + * among the online CPUs.
> + */
> + for_each_online_cpu(cpu) {
> + this_cpu_ci = get_cpu_cacheinfo(cpu);
> + if (!this_cpu_ci)
> + continue;
> + if (max_number_of_caches < this_cpu_ci->num_leaves)
> + max_number_of_caches = this_cpu_ci->num_leaves;

So this is counting the number of cache levels on the system? So you
want to count the errors per cache levels?

> + }
> + if (!max_number_of_caches)
> + return;
> +
> + /*
> + * EDAC device interface only supports creating the CPU cache hierarchy for alls
> + * the CPUs together. Thus need to allocate cpu_edac_block_list for the
> + * max_number_of_caches among all the CPUs irrespective of the number of caches
> + * per CPU might vary.
> + */

So this is lumping all the caches together into a single list? What for?
To untangle to the proper ones when the error gets reported?

Have you heard of percpu variables?

> @@ -624,6 +787,10 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
> ghes_pvt = pvt;
> spin_unlock_irqrestore(&ghes_lock, flags);
>
> +#if defined(CONFIG_EDAC_GHES_CPU_ERROR)
> + ghes_edac_create_cpu_device(dev);
> +#endif
> +

Init stuff belongs into ghes_scan_system().

...

Ok, I've seen enough. "required reporting to the user-space for the
predictive failure analysis." is not even trying to explain *why* you're
doing this, what *actual* problem it is addressing and why should the
kernel get it.

And without a proper problem definition of what you're trying to solve,
this is not going anywhere.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-01-15 11:13:42

by Shiju Jose

[permalink] [raw]

Subject: RE: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches

Hi Boris,

Thanks for the feedback.
Apologies for the delay.

>-----Original Message-----
>From: Borislav Petkov [mailto:[email protected]]
>Sent: 31 December 2020 16:44
>To: Shiju Jose <[email protected]>
>Cc: [email protected]; [email protected]; linux-
>[email protected]; [email protected];
>[email protected]; [email protected]; [email protected];
>[email protected]; [email protected]; Linuxarm <[email protected]>;
>xuwei (O) <[email protected]>; Jonathan Cameron
><[email protected]>; John Garry <[email protected]>;
>tanxiaofei <[email protected]>; Shameerali Kolothum Thodi
><[email protected]>; Salil Mehta
><[email protected]>
>Subject: Re: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU
>caches
>
>On Tue, Dec 08, 2020 at 05:29:58PM +0000, Shiju Jose wrote:
>> The corrected error count on the CPU caches required reporting to the
>> user-space for the predictive failure analysis. For this purpose, add
>> an EDAC device and device blocks for the CPU caches found.
>> The cache's corrected error count would be stored in the
>> /sys/devices/system/edac/cpu/cpu*/cache*/ce_count.
>
>This still doesn't begin to explain why the kernel needs this. I had already
>asked whether errors in CPU caches are something which happen often
>enough so that software should count them but nothing came. So pls justify
>first why this wants to be added to the kernel.

L2 cache corrected errors are detected occasionally on few of
our ARM64 hardware boards. Though it is rare, the probability of
the CPU cache errors frequently occurring can't be avoided.
The earlier failure detection by monitoring the cache corrected
errors for the frequent occurrences and taking preventive
action could prevent more serious hardware faults.

On Intel architectures, cache corrected errors are reported and
the affected cores are offline in the architecture specific method.
http://www.mcelog.org/cache.html

However for the firmware-first error reporting, specifically on
ARM64 architectures, there is no provision present for reporting
the cache corrected error count to the user-space and taking
preventive action such as offline the affected cores.
>
>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index
>> 7a47680d6f07..c73eeab27ac9 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -74,6 +74,16 @@ config EDAC_GHES
>>
>> In doubt, say 'Y'.
>>
>> +config EDAC_GHES_CPU_ERROR
>> + bool "EDAC device for reporting firmware-first BIOS detected CPU
>error count"
>
>Why a separate Kconfig item?
CONFIG_EDAC_GHES_CPU_CACHE_ERROR is added to make this
feature optional only for the platforms which need this and supported.

>
>> + depends on EDAC_GHES
>> + help
>> + EDAC device for the firmware-first BIOS detected CPU error count
>> +reported
>
>Well this is not what it is doing - you're talking about cache errors.
>"CPU errors" can be a lot more than just cache errors.
Sure. I will change.

>
>> +static void ghes_edac_create_cpu_device(struct device *dev) {
>> + int cpu;
>> + struct cpu_cacheinfo *this_cpu_ci;
>> +
>> + /*
>> + * Find the maximum number of caches present in the CPU heirarchy
>> + * among the online CPUs.
>> + */
>> + for_each_online_cpu(cpu) {
>> + this_cpu_ci = get_cpu_cacheinfo(cpu);
>> + if (!this_cpu_ci)
>> + continue;
>> + if (max_number_of_caches < this_cpu_ci->num_leaves)
>> + max_number_of_caches = this_cpu_ci->num_leaves;
>
>So this is counting the number of cache levels on the system? So you want to
>count the errors per cache levels?
Yes. This was the suggestion from James and to offline the affected cores for the shared cache.

>
>> + }
>> + if (!max_number_of_caches)
>> + return;
>> +
>> + /*
>> + * EDAC device interface only supports creating the CPU cache
>hierarchy for alls
>> + * the CPUs together. Thus need to allocate cpu_edac_block_list for
>the
>> + * max_number_of_caches among all the CPUs irrespective of the
>number of caches
>> + * per CPU might vary.
>> + */
>
>So this is lumping all the caches together into a single list? What for?
>To untangle to the proper ones when the error gets reported?
>
>Have you heard of percpu variables?
Yes. Changed the list to the percpu variable.

>
>> @@ -624,6 +787,10 @@ int ghes_edac_register(struct ghes *ghes, struct
>device *dev)
>> ghes_pvt = pvt;
>> spin_unlock_irqrestore(&ghes_lock, flags);
>>
>> +#if defined(CONFIG_EDAC_GHES_CPU_ERROR)
>> + ghes_edac_create_cpu_device(dev);
>> +#endif
>> +
>
>Init stuff belongs into ghes_scan_system().
>
Did you mean calling ghes_edac_create_cpu_device() in the ghes_scan_system()?

>...
>
>Ok, I've seen enough. "required reporting to the user-space for the predictive
>failure analysis." is not even trying to explain *why* you're doing this, what
>*actual* problem it is addressing and why should the kernel get it.
>
>And without a proper problem definition of what you're trying to solve, this
>is not going anywhere.
>
>--
>Regards/Gruss,
> Boris.
>

Thanks,
Shiju

2021-01-19 05:01:56

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches

On Fri, Jan 15, 2021 at 11:06:30AM +0000, Shiju Jose wrote:
> L2 cache corrected errors are detected occasionally on few of
> our ARM64 hardware boards. Though it is rare, the probability of
> the CPU cache errors frequently occurring can't be avoided.
> The earlier failure detection by monitoring the cache corrected
> errors for the frequent occurrences and taking preventive
> action could prevent more serious hardware faults.
>
> On Intel architectures, cache corrected errors are reported and
> the affected cores are offline in the architecture specific method.
> http://www.mcelog.org/cache.html
>
> However for the firmware-first error reporting, specifically on
> ARM64 architectures, there is no provision present for reporting
> the cache corrected error count to the user-space and taking
> preventive action such as offline the affected cores.

How hard was it to write that in your first submission? What do you
think would be the best way to persuade a patch reviewer/maintainer to
take a look at your submission?

> >Why a separate Kconfig item?
> CONFIG_EDAC_GHES_CPU_CACHE_ERROR is added to make this
> feature optional only for the platforms which need this and supported.
>
> >
> >> + depends on EDAC_GHES

depends on EDAC_GHES hardly expresses which platforms need it/support
it.

If anything, depends on ARM64.

> >Init stuff belongs into ghes_scan_system().
> >
> Did you mean calling ghes_edac_create_cpu_device() in the ghes_scan_system()?

I mean, all hardware discovery needs to happen in ghes_scan_system
- you don't need to call those from outside the driver, in
ghes_edac_register().

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-01-19 11:10:43

by Shiju Jose

[permalink] [raw]

Subject: RE: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches

Hi Boris,

>-----Original Message-----
>From: Borislav Petkov [mailto:[email protected]]
>Sent: 18 January 2021 18:37
>To: Shiju Jose <[email protected]>
>Cc: [email protected]; [email protected]; linux-
>[email protected]; [email protected];
>[email protected]; [email protected]; [email protected];
>[email protected]; [email protected]; Jonathan Cameron
><[email protected]>; tanxiaofei <[email protected]>;
>[email protected]
>Subject: Re: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU
>caches
>
>On Fri, Jan 15, 2021 at 11:06:30AM +0000, Shiju Jose wrote:
>> L2 cache corrected errors are detected occasionally on few of our
>> ARM64 hardware boards. Though it is rare, the probability of the CPU
>> cache errors frequently occurring can't be avoided.
>> The earlier failure detection by monitoring the cache corrected errors
>> for the frequent occurrences and taking preventive action could
>> prevent more serious hardware faults.
>>
>> On Intel architectures, cache corrected errors are reported and the
>> affected cores are offline in the architecture specific method.
>> http://www.mcelog.org/cache.html
>>
>> However for the firmware-first error reporting, specifically on
>> ARM64 architectures, there is no provision present for reporting the
>> cache corrected error count to the user-space and taking preventive
>> action such as offline the affected cores.
>
>How hard was it to write that in your first submission? What do you think
>would be the best way to persuade a patch reviewer/maintainer to take a
>look at your submission?
>
>> >Why a separate Kconfig item?
>> CONFIG_EDAC_GHES_CPU_CACHE_ERROR is added to make this feature
>> optional only for the platforms which need this and supported.
>>
>> >
>> >> + depends on EDAC_GHES
>
>depends on EDAC_GHES hardly expresses which platforms need it/support it.
>
>If anything, depends on ARM64.
Sure. I will add dependency on ARM64.
This EDAC code for the cache errors is architecture independent for the
firmware-first error reporting and could be used for other architectures,
though now we need it for the ARM64.

>
>> >Init stuff belongs into ghes_scan_system().
>> >
>> Did you mean calling ghes_edac_create_cpu_device() in the
>ghes_scan_system()?
>
>I mean, all hardware discovery needs to happen in ghes_scan_system
>- you don't need to call those from outside the driver, in
>ghes_edac_register().

sure. Will modify.
>
>--
>Regards/Gruss,
> Boris.

Thanks,
Shiju

2021-01-19 11:43:07

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] EDAC/ghes: Add EDAC device for the CPU caches

On Tue, Jan 19, 2021 at 10:04:23AM +0000, Shiju Jose wrote:
> This EDAC code for the cache errors is architecture independent for the
> firmware-first error reporting and could be used for other architectures,

I'm not so sure about that because you're lumping all the cache
hierarchies together and there might be L3 slices on some x86
incarnations, for example, which do not belong to the core you're
reporting the error for... It would need to be tested though.

Also, if this is a firmware-first mode, then I would expect the firmware
to also report which cache triggered the error and thus not need any OS
glue at all.

Therefore ARM only and I'd need an ACK from ARM folks whether they want
it this way.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette